在R中刮擦(使用RVest) - 多年来循环



我是R中Web扫描的新手。我正在使用rvest

我可以通过以下一年手动获取个人一年的比赛记录;

## The URL
http://stats.espncricinfo.com/ci/engine/records/index.html
## structure
RECORDS / ONE-DAY INTERNATIONALS / TEAM RECORDS / LIST OF MATCH RESULTS (BY YEAR)
library(rvest)
cricket_record <- read_html('http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000;type=year')
    cricket_record %>%
        html_nodes("table") %>%
        .[[1]] %>%
        html_table()

          Team 1       Team 2       Winner     Margin                Ground      Match Date  Scorecard
1    New Zealand  West Indies  New Zealand  3 wickets              Auckland     Jan 2, 2000 ODI # 1532
2    New Zealand  West Indies  New Zealand  7 wickets                 Taupo     Jan 4, 2000 ODI # 1533
3    New Zealand  West Indies  New Zealand  4 wickets                Napier     Jan 6, 2000 ODI # 1534
4    New Zealand  West Indies  New Zealand  8 wickets            Wellington   Jan 8-9, 2000 ODI # 1535
5      Australia     Pakistan     Pakistan    45 runs              Brisbane     Jan 9, 2000 ODI # 1536
6          India     Pakistan     Pakistan  2 wickets              Brisbane    Jan 10, 2000 ODI # 1537
7    New Zealand  West Indies  New Zealand    20 runs          Christchurch    Jan 11, 2000 ODI # 1538
8      Australia        India    Australia    28 runs             Melbourne    Jan 12, 2000 ODI # 1539
9      Australia        India    Australia  5 wickets                Sydney    Jan 14, 2000 ODI # 1540
10     Australia     Pakistan    Australia  6 wickets             Melbourne    Jan 16, 2000 ODI # 1541
11     Australia     Pakistan    Australia    81 runs                Sydney    Jan 19, 2000 ODI # 1542
12         India     Pakistan     Pakistan    32 runs                Hobart    Jan 21, 2000 ODI # 1543
13  South Africa     Zimbabwe South Africa  6 wickets          Johannesburg    Jan 21, 2000 ODI # 1544
14     Australia     Pakistan    Australia    15 runs             Melbourne    Jan 23, 2000 ODI # 1545
15  South Africa      England      England  9 wickets          Bloemfontein    Jan 23, 2000 ODI # 1546
16         India     Pakistan        India    48 runs              Adelaide    Jan 25, 2000 ODI # 1547
17     Australia        India    Australia   152 runs              Adelaide    Jan 26, 2000 ODI # 1548
18  South Africa      England South Africa      1 run             Cape Town    Jan 26, 2000 ODI # 1549
19         India     Pakistan     Pakistan   104 runs                 Perth    Jan 28, 2000 ODI # 1550
20       England     Zimbabwe     Zimbabwe   104 runs             Cape Town    Jan 28, 2000 ODI # 1551
21     Australia        India    Australia  4 wickets                 Perth    Jan 30, 2000 ODI # 1552
22       England     Zimbabwe      England  8 wickets             Kimberley    Jan 30, 2000 ODI # 1553
23     Australia     Pakistan    Australia  6 wickets             Melbourne     Feb 2, 2000 ODI # 1554
24  South Africa     Zimbabwe     Zimbabwe  2 wickets                Durban     Feb 2, 2000 ODI # 1555
25     Australia     Pakistan    Australia   152 runs                Sydney     Feb 4, 2000 ODI # 1556
26  South Africa      England South Africa  2 wickets           East London     Feb 4, 2000 ODI # 1557
27  South Africa     Zimbabwe South Africa    53 runs        Port Elizabeth     Feb 6, 2000 ODI # 1558
28      Pakistan    Sri Lanka    Sri Lanka    29 runs               Karachi    Feb 13, 2000 ODI # 1559
29  South Africa      England South Africa    38 runs          Johannesburg    Feb 13, 2000 ODI # 1560
30      Pakistan    Sri Lanka    Sri Lanka    34 runs            Gujranwala    Feb 16, 2000 ODI # 1561
31      Zimbabwe      England      England  5 wickets              Bulawayo    Feb 16, 2000 ODI # 1562
32   New Zealand    Australia    no result                       Wellington    Feb 17, 2000 ODI # 1563
33      Zimbabwe      England      England   1 wicket              Bulawayo    Feb 18, 2000 ODI # 1564
34   New Zealand    Australia    Australia  5 wickets              Auckland    Feb 19, 2000 ODI # 1565
35      Pakistan    Sri Lanka    Sri Lanka   104 runs                Lahore    Feb 19, 2000 ODI # 1566
36      Zimbabwe      England      England    85 runs                Harare    Feb 20, 2000 ODI # 1567
37   New Zealand    Australia    Australia    50 runs               Dunedin    Feb 23, 2000 ODI # 1568
38   New Zealand    Australia    Australia    48 runs          Christchurch    Feb 26, 2000 ODI # 1569
39   New Zealand    Australia    Australia  5 wickets                Napier     Mar 1, 2000 ODI # 1570
40   New Zealand    Australia  New Zealand  7 wickets              Auckland     Mar 3, 2000 ODI # 1571
41         India South Africa        India  3 wickets                 Kochi     Mar 9, 2000 ODI # 1572
42         India South Africa        India  6 wickets            Jamshedpur    Mar 12, 2000 ODI # 1573
43         India South Africa South Africa  2 wickets             Faridabad    Mar 15, 2000 ODI # 1574
44         India South Africa        India  4 wickets              Vadodara    Mar 17, 2000 ODI # 1575
45         India South Africa South Africa    10 runs                Nagpur    Mar 19, 2000 ODI # 1576
46         India South Africa South Africa 10 wickets               Sharjah    Mar 22, 2000 ODI # 1577
47         India     Pakistan        India  5 wickets               Sharjah    Mar 23, 2000 ODI # 1578
48      Pakistan South Africa South Africa  3 wickets               Sharjah    Mar 24, 2000 ODI # 1579
49         India     Pakistan     Pakistan    98 runs               Sharjah    Mar 26, 2000 ODI # 1580
50         India South Africa South Africa  6 wickets               Sharjah    Mar 27, 2000 ODI # 1581
51      Pakistan South Africa     Pakistan    67 runs               Sharjah    Mar 28, 2000 ODI # 1582
52      Pakistan South Africa     Pakistan    16 runs               Sharjah    Mar 31, 2000 ODI # 1583
53   West Indies     Zimbabwe  West Indies    87 runs              Kingston     Apr 1, 2000 ODI # 1584
54   West Indies     Zimbabwe  West Indies    41 runs              Kingston     Apr 2, 2000 ODI # 1585
55      Pakistan     Zimbabwe     Pakistan  5 wickets             St John's     Apr 5, 2000 ODI # 1586
56  South Africa    Australia South Africa  6 wickets                Durban    Apr 12, 2000 ODI # 1587
57   West Indies     Pakistan  West Indies    96 runs             Kingstown    Apr 12, 2000 ODI # 1588
58  South Africa    Australia    Australia  5 wickets             Cape Town    Apr 14, 2000 ODI # 1589
59      Pakistan     Zimbabwe     Pakistan  6 wickets           St George's    Apr 15, 2000 ODI # 1590
60  South Africa    Australia South Africa  4 wickets          Johannesburg    Apr 16, 2000 ODI # 1591
61   West Indies     Pakistan  West Indies    17 runs           St George's    Apr 16, 2000 ODI # 1592
62   West Indies     Pakistan     Pakistan    17 runs            Bridgetown    Apr 19, 2000 ODI # 1593
63   West Indies     Pakistan  West Indies    60 runs         Port of Spain    Apr 22, 2000 ODI # 1594
64   West Indies     Pakistan     Pakistan  4 wickets         Port of Spain    Apr 23, 2000 ODI # 1595
65    Bangladesh    Sri Lanka    Sri Lanka  9 wickets                 Dhaka    May 29, 2000 ODI # 1596
66    Bangladesh        India        India  8 wickets                 Dhaka May 30-31, 2000 ODI # 1597
67         India    Sri Lanka    Sri Lanka    71 runs                 Dhaka     Jun 1, 2000 ODI # 1598
68    Bangladesh     Pakistan     Pakistan   233 runs                 Dhaka     Jun 2, 2000 ODI # 1599
69         India     Pakistan     Pakistan    44 runs                 Dhaka     Jun 3, 2000 ODI # 1600
70      Pakistan    Sri Lanka     Pakistan  7 wickets                 Dhaka     Jun 5, 2000 ODI # 1601
71      Pakistan    Sri Lanka     Pakistan    39 runs                 Dhaka     Jun 7, 2000 ODI # 1602
72     Sri Lanka     Pakistan    Sri Lanka  5 wickets                 Galle     Jul 5, 2000 ODI # 1603
73     Sri Lanka South Africa    Sri Lanka    37 runs                 Galle     Jul 6, 2000 ODI # 1604
74   West Indies     Zimbabwe     Zimbabwe  6 wickets               Bristol     Jul 6, 2000 ODI # 1605
75      Pakistan South Africa South Africa    18 runs         Colombo (RPS)     Jul 8, 2000 ODI # 1606
76       England     Zimbabwe     Zimbabwe  5 wickets              The Oval     Jul 8, 2000 ODI # 1607
77     Sri Lanka     Pakistan    Sri Lanka  6 wickets         Colombo (RPS)     Jul 9, 2000 ODI # 1608
78       England  West Indies    no result                           Lord's     Jul 9, 2000 ODI # 1609
79     Sri Lanka South Africa    Sri Lanka  8 wickets         Colombo (SSC)    Jul 11, 2000 ODI # 1610
80   West Indies     Zimbabwe     Zimbabwe    70 runs            Canterbury    Jul 11, 2000 ODI # 1611
81      Pakistan South Africa South Africa  7 wickets         Colombo (SSC)    Jul 12, 2000 ODI # 1612
82       England     Zimbabwe      England  8 wickets            Manchester    Jul 13, 2000 ODI # 1613
83     Sri Lanka South Africa    Sri Lanka    30 runs         Colombo (RPS)    Jul 14, 2000 ODI # 1614
84       England  West Indies      England 10 wickets     Chester-le-Street    Jul 15, 2000 ODI # 1615
85   West Indies     Zimbabwe     Zimbabwe  6 wickets     Chester-le-Street    Jul 16, 2000 ODI # 1616
86       England     Zimbabwe      England    52 runs            Birmingham    Jul 18, 2000 ODI # 1617
87       England  West Indies  West Indies     3 runs            Nottingham    Jul 20, 2000 ODI # 1618
88       England     Zimbabwe      England  6 wickets                Lord's    Jul 22, 2000 ODI # 1619
89     Australia South Africa    Australia    94 runs Melbourne (Docklands)    Aug 16, 2000 ODI # 1620
90     Australia South Africa         tied            Melbourne (Docklands)    Aug 18, 2000 ODI # 1621
91     Australia South Africa South Africa     8 runs Melbourne (Docklands)    Aug 20, 2000 ODI # 1622
92   New Zealand     Pakistan     Pakistan    12 runs             Singapore    Aug 20, 2000 ODI # 1623
93      Pakistan South Africa     Pakistan    28 runs             Singapore    Aug 23, 2000 ODI # 1624
94   New Zealand South Africa South Africa  8 wickets             Singapore    Aug 25, 2000 ODI # 1625
95      Pakistan South Africa South Africa    93 runs             Singapore    Aug 27, 2000 ODI # 1626
96      Zimbabwe  New Zealand  New Zealand  7 wickets                Harare    Sep 27, 2000 ODI # 1627
97      Zimbabwe  New Zealand     Zimbabwe    21 runs              Bulawayo    Sep 30, 2000 ODI # 1628
98      Zimbabwe  New Zealand     Zimbabwe  6 wickets              Bulawayo     Oct 1, 2000 ODI # 1629
99         Kenya        India        India  8 wickets         Nairobi (Gym)     Oct 3, 2000 ODI # 1630
100    Sri Lanka  West Indies    Sri Lanka   108 runs         Nairobi (Gym)     Oct 4, 2000 ODI # 1631
101   Bangladesh      England      England  8 wickets         Nairobi (Gym)     Oct 5, 2000 ODI # 1632
102    Australia        India        India    20 runs         Nairobi (Gym)     Oct 7, 2000 ODI # 1633
103     Pakistan    Sri Lanka     Pakistan  9 wickets         Nairobi (Gym)     Oct 8, 2000 ODI # 1634
104  New Zealand     Zimbabwe  New Zealand    64 runs         Nairobi (Gym)     Oct 9, 2000 ODI # 1635
105      England South Africa South Africa  8 wickets         Nairobi (Gym)    Oct 10, 2000 ODI # 1636
106  New Zealand     Pakistan  New Zealand  4 wickets         Nairobi (Gym)    Oct 11, 2000 ODI # 1637
107        India South Africa        India    95 runs         Nairobi (Gym)    Oct 13, 2000 ODI # 1638
108        India  New Zealand  New Zealand  4 wickets         Nairobi (Gym)    Oct 15, 2000 ODI # 1639
109        India    Sri Lanka    Sri Lanka  5 wickets               Sharjah    Oct 20, 2000 ODI # 1640
110 South Africa  New Zealand    no result                    Potchefstroom    Oct 20, 2000 ODI # 1641
111    Sri Lanka     Zimbabwe    Sri Lanka  7 wickets               Sharjah    Oct 21, 2000 ODI # 1642
112 South Africa  New Zealand South Africa  6 wickets                Benoni    Oct 22, 2000 ODI # 1643
113        India     Zimbabwe        India    13 runs               Sharjah    Oct 22, 2000 ODI # 1644
114     Pakistan      England      England  5 wickets               Karachi    Oct 24, 2000 ODI # 1645
115    Sri Lanka     Zimbabwe    Sri Lanka   123 runs               Sharjah    Oct 25, 2000 ODI # 1646
116 South Africa  New Zealand South Africa   115 runs             Centurion    Oct 25, 2000 ODI # 1647
117        India     Zimbabwe        India  3 wickets               Sharjah    Oct 26, 2000 ODI # 1648
118     Pakistan      England     Pakistan  8 wickets                Lahore    Oct 27, 2000 ODI # 1649
119        India    Sri Lanka    Sri Lanka    68 runs               Sharjah    Oct 27, 2000 ODI # 1650
120 South Africa  New Zealand South Africa  5 wickets             Kimberley    Oct 28, 2000 ODI # 1651
121        India    Sri Lanka    Sri Lanka   245 runs               Sharjah    Oct 29, 2000 ODI # 1652
122     Pakistan      England     Pakistan  6 wickets            Rawalpindi    Oct 30, 2000 ODI # 1653
123 South Africa  New Zealand South Africa  6 wickets                Durban     Nov 1, 2000 ODI # 1654
124 South Africa  New Zealand South Africa  3 wickets             Cape Town     Nov 4, 2000 ODI # 1655
125        India     Zimbabwe        India  3 wickets               Cuttack     Dec 2, 2000 ODI # 1656
126        India     Zimbabwe        India    61 runs             Ahmedabad     Dec 5, 2000 ODI # 1657
127        India     Zimbabwe     Zimbabwe   1 wicket               Jodhpur     Dec 8, 2000 ODI # 1658
128        India     Zimbabwe        India  9 wickets                Kanpur    Dec 11, 2000 ODI # 1659
129        India     Zimbabwe        India    39 runs                Rajkot    Dec 14, 2000 ODI # 1660
130 South Africa    Sri Lanka South Africa  4 wickets        Port Elizabeth    Dec 15, 2000 ODI # 1661
131 South Africa    Sri Lanka South Africa    95 runs           East London    Dec 17, 2000 ODI # 1662

Q1-我想知道我如何以某种方式以某种方式获得每年的时间?

第二,我还需要从表的记分卡列中删除一些信息,例如ODI # 1532具有指向记分牌和匹配摘要的链接。同样,我可以通过将每个匹配链接作为输入来单独获取;

cricket_score_odi <- read_html('http://www.espncricinfo.com/series/15743/scorecard/64640/new-zealand-vs-west-indies-1st-odi-west-indies-tour-of-new-zealand-1999-00')
cricket_score_odi %>%
    html_nodes('.cscore_info-overview , .match-detail--item:nth-child(3) h4 , .match-detail--item:nth-child(3) span , .cscore_name--long , #main-container .cscore_score') %>%
    html_text(trim = TRUE)
[1] "1st ODI, West Indies tour of New Zealand at Auckland, Jan 2 2000"
 [2] "West Indies"                                                     
 [3] "268/7"                                                           
 [4] "New Zealand"                                                     
 [5] "250/7 (45.1/46 ov, target 250)"                                  
 [6] "1st ODI, West Indies tour of New Zealand at Auckland, Jan 2 2000"
 [7] "West Indies"                                                     
 [8] "268/7"                                                           
 [9] "New Zealand"                                                     
[10] "250/7 (45.1/46 ov, target 250)"                                  
[11] "Toss"                                                            
[12] "West Indies , elected to bat first"  

Q2-我想知道如何从每个匹配链接中的记分牌列中所需的信息?

确实非常感谢!

基于@ulfelder的主张,我建议您两个问题的purrr解决方案。

1,准备工作我创建一个具有所有年度urls的数据框,以映射刮擦

library(progress)
library(rvest)
library(tidyverse)
(df_url <- tibble(year = 2000:2001) %>%
  mutate(url = str_c("http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=", year, ";type=year", sep = "")))
# A tibble: 2 x 2
   year url                                                                                              
  <int> <chr>                                                                                            
1  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000;type=year
2  2001 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2001;type=year

2, scrape 团队记录rvest函数映射到年度url dataframe。

(df_records <- df_url %>%
  mutate(record = map(url, ~ {read_html(.x) %>%
      html_nodes("table") %>%
      purrr::pluck(1) %>%
      html_table()
    })) %>%
  unnest())
# A tibble: 251 x 9
    year url                                                                      `Team 1`    `Team 2`    Winner     Margin   Ground     `Match Date` Scorecard
   <int> <chr>                                                                    <chr>       <chr>       <chr>      <chr>    <chr>      <chr>        <chr>    
 1  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html~ New Zealand West Indies New Zeala~ 3 wicke~ Auckland   Jan 2, 2000  ODI # 15~
 2  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html~ New Zealand West Indies New Zeala~ 7 wicke~ Taupo      Jan 4, 2000  ODI # 15~
 3  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html~ New Zealand West Indies New Zeala~ 4 wicke~ Napier     Jan 6, 2000  ODI # 15~
 4  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html~ New Zealand West Indies New Zeala~ 8 wicke~ Wellington Jan 8-9, 20~ ODI # 15~
 5  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html~ Australia   Pakistan    Pakistan   45 runs  Brisbane   Jan 9, 2000  ODI # 15~
# ... with 246 more rows

3,提取记分卡URL href 属性中将URL提取到记分卡。

(df_url_card <- df_url %>%
  mutate(url_card = map(url, ~{read_html(.x) %>%
      html_nodes("td:nth-child(7) .data-link") %>%
      html_attr("href")
  })) %>%
  unnest() %>%
  mutate(url_card = str_c("http://stats.espncricinfo.com", url_card, sep = "")))
# A tibble: 251 x 3
    year url                                                                                             url_card                                              
   <int> <chr>                                                                                           <chr>                                                 
 1  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000;type=y~ http://stats.espncricinfo.com/ci/engine/match/64640.h~
 2  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000;type=y~ http://stats.espncricinfo.com/ci/engine/match/64641.h~
 3  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000;type=y~ http://stats.espncricinfo.com/ci/engine/match/64642.h~
 4  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000;type=y~ http://stats.espncricinfo.com/ci/engine/match/64643.h~
 5  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000;type=y~ http://stats.espncricinfo.com/ci/engine/match/65587.h~
# ... with 246 more rows

4, scrape 记分卡我将rvest函数映射到记分卡URL。由于这可能是大量的URL,我建议使用进度栏。

pb <- progress_bar$new(format = "  downloading [:bar] :percent eta: :eta", total = dim(df_url_card)[1])
(df_scorecard <- df_url_card %>%
  mutate(scorecard = map(url_card, ~{pb$tick()
    read_html(.x) %>%
      html_nodes('.cscore_info-overview , .match-detail--item:nth-child(3) h4 , .match-detail--item:nth-child(3) span , .cscore_name--long , #main-container .cscore_score') %>%
      html_text(trim = TRUE)
  })))
# A tibble: 251 x 4
    year url                                                                                      url_card                                            scorecard
   <int> <chr>                                                                                    <chr>                                               <list>   
 1  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000~ http://stats.espncricinfo.com/ci/engine/match/6464~ <chr [12~
 2  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000~ http://stats.espncricinfo.com/ci/engine/match/6464~ <chr [12~
 3  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000~ http://stats.espncricinfo.com/ci/engine/match/6464~ <chr [12~
 4  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000~ http://stats.espncricinfo.com/ci/engine/match/6464~ <chr [12~
 5  2000 http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2000~ http://stats.espncricinfo.com/ci/engine/match/6558~ <chr [12~
# ... with 246 more rows
df_scorecard$scorecard[1][[1]]
 [1] "1st ODI, West Indies tour of New Zealand at Auckland, Jan 2 2000" "West Indies"                                                     
 [3] "268/7"                                                            "New Zealand"                                                     
 [5] "250/7 (45.1/46 ov, target 250)"                                   "1st ODI, West Indies tour of New Zealand at Auckland, Jan 2 2000"
 [7] "West Indies"                                                      "268/7"                                                           
 [9] "New Zealand"                                                      "250/7 (45.1/46 ov, target 250)"                                  
[11] "Toss"                                                             "West Indies , elected to bat first" 

使用urlurl_card(使用一些处理)您可以将记分卡重新加入您的游戏记录。

最新更新