我想从本地 HTML 文件中抓取元素列表(名称玩家、成本、买方、卖家、一天(,但是当我尝试抓取买方和卖方时,我对2和3有问题(在这种情况下,对于第一次转移"计算机"和"彼得"(以及第二次转移"计算机"和"詹姆斯"(
document.querySelector("#pressReleases > ul > li:nth-child(**2**) > ul > li.text > div > strong:nth-child(2)")
document.querySelector("#pressReleases > ul > li:nth-child(**3**) > ul > li.text > div > strong:nth-child(2)")
如何抓取构成此 2 变量的li
元素?
我已经在 R 中尝试过这个:
dades<- mylocalfile
player<-dades %>% html_nodes("ul.player li.text strong") %>% html_text() %>% trimws()
cost<-dades %>% html_nodes("ul.player li.text span") %>% html_text() %>% trimws()
buyer<-dades %>% html_nodes("#pressReleases > ul > li:nth-child(2) > ul > li.text > div > strong:nth-child(2)") %>% html_text() %>% trimws()
seller<-dades %>% html_nodes("#pressReleases > ul > li:nth-child(2) > ul > li.text > div > strong:nth-child(1)") %>% html_text() %>% trimws()
day<-dades %>% html_nodes("ul.player li.text time") %>% html_text() %>% trimws()
我检测到这 2#pressReleases > ul > li:nth-child(2)
对于每个li class="post pressRelease"
都是可变的
该 html 代码:
<div class="newsList" id="pressReleases">
<ul>
<li class="date" style="background-color: rgb(128, 128, 128);">
<strong>Fitxatges del dia</strong>
09/08/2019
</li>
<li class="post pressRelease">
<ul class="player">
<li class="photo">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/espanyol.png" onerror="Futmondo.Helpers.Resources.onErrorPlayerPhoto(this, "L", "espanyol.png")">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/espanyol(1).png" alt="Espanyol" class="crest">
</li>
<li class="text">
<strong>Player1</strong>
<time>09/08/2019 - 05:30</time>
<span>16.245.485 €</span>
<div class="from">
D'
<strong>computer</strong>
a
<strong>peter</strong>
</div>
</li>
<a class="icon-revert">
</a>
</ul>
<div class="bid second">
<span class="triangle"></span>
<strong class="second">2º puja</strong>
<strong>matheu:</strong>
<span class="price">15.925.828 €</span>
</div>
</li>
<li class="post pressRelease">
<ul class="player">
<li class="photo">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/real-sociedad.png" onerror="Futmondo.Helpers.Resources.onErrorPlayerPhoto(this, "L", "real-sociedad.png")">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/real-sociedad(1).png" alt="Real Sociedad" class="crest">
</li>
<li class="text">
<strong>Player2</strong>
<time>09/08/2019 - 05:30</time>
<span>1.111.711 €</span>
<div class="from">
D'
<strong>computer</strong>
a
<strong>james</strong>
</div>
</li>
<a class="icon-revert">
</a>
</ul>
</li>
这是获取buyer/seller
的可能解决方案:
# Read the local file
URL <- 'D:/Test/Test.html'
wp <- xml2::read_html(URL, encoding = 'utf-8')
# Extract the relevant nodes
node <- rvest::html_nodes(wp, '.from')
# Extract the names
seller <- gsub('.*D'rn\s+(.*?)rn\s+a\s?rn\s+(.*?)rn.*', '\1', rvest::html_text(node))
# [1] "computer" "computer"
buyer <- gsub('.*D'rn\s+(.*?)rn\s+a\s?rn\s+(.*?)rn.*', '\2', rvest::html_text(node))
# [1] "peter" "james"
您是否尝试过买家
#pressReleases .from strong:nth-child(1)
对于卖家
#pressReleases .from strong:nth-child(2)
假设您已经将 html 读入变量page
然后(扩展以包含您的其他变量(
buyers <- page %>% html_nodes("#pressReleases .from strong:nth-child(1)") %>% html_text
sellers <- page %>% html_nodes("#pressReleases .from strong:nth-child(2)") %>% html_text
df <- as.data.frame(cbind(buyers,sellers))
然后,数据帧应该易于导出。