R 中的抓取列表

我想从本地 HTML 文件中抓取元素列表(名称玩家、成本、买方、卖家、一天(，但是当我尝试抓取买方和卖方时，我对2和3有问题(在这种情况下，对于第一次转移"计算机"和"彼得"(以及第二次转移"计算机"和"詹姆斯"(

document.querySelector("#pressReleases > ul > li:nth-child(**2**) > ul > li.text > div > strong:nth-child(2)")
document.querySelector("#pressReleases > ul > li:nth-child(**3**) > ul > li.text > div > strong:nth-child(2)")

如何抓取构成此 2 变量的li元素？

我已经在 R 中尝试过这个：

dades<- mylocalfile
player<-dades %>% html_nodes("ul.player li.text strong") %>% html_text() %>% trimws()
cost<-dades %>% html_nodes("ul.player li.text span") %>% html_text() %>% trimws()
buyer<-dades %>% html_nodes("#pressReleases > ul > li:nth-child(2) > ul > li.text > div > strong:nth-child(2)") %>% html_text() %>% trimws()
seller<-dades %>% html_nodes("#pressReleases > ul > li:nth-child(2) > ul > li.text > div > strong:nth-child(1)") %>% html_text() %>% trimws()
day<-dades %>% html_nodes("ul.player li.text time") %>% html_text() %>% trimws()

我检测到这 2#pressReleases > ul > li:nth-child(2)对于每个li class="post pressRelease"都是可变的

该 html 代码：

<div class="newsList" id="pressReleases">
<ul>
<li class="date" style="background-color: rgb(128, 128, 128);">
<strong>Fitxatges del dia</strong>
09/08/2019
</li>
<li class="post pressRelease">
<ul class="player">
<li class="photo">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/espanyol.png" onerror="Futmondo.Helpers.Resources.onErrorPlayerPhoto(this, &quot;L&quot;, &quot;espanyol.png&quot;)">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/espanyol(1).png" alt="Espanyol" class="crest">
</li>
<li class="text">
<strong>Player1</strong>
<time>09/08/2019 - 05:30</time>
<span>16.245.485 €</span>
<div class="from">
D'
<strong>computer</strong>
a 
<strong>peter</strong>
</div>
</li>
<a class="icon-revert">
</a>
</ul>
<div class="bid second">
<span class="triangle"></span>
<strong class="second">2º puja</strong>
<strong>matheu:</strong>
<span class="price">15.925.828 €</span>
</div>
</li>
<li class="post pressRelease">
<ul class="player">
<li class="photo">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/real-sociedad.png" onerror="Futmondo.Helpers.Resources.onErrorPlayerPhoto(this, &quot;L&quot;, &quot;real-sociedad.png&quot;)">
<img src="./futmondo - Fútbol fantasy manager - futmondo_files/real-sociedad(1).png" alt="Real Sociedad" class="crest">
</li>
<li class="text">
<strong>Player2</strong>
<time>09/08/2019 - 05:30</time>
<span>1.111.711 €</span>
<div class="from">
D'
<strong>computer</strong>
a 
<strong>james</strong>
</div>
</li>
<a class="icon-revert">
</a>
</ul>
</li>

这是获取buyer/seller的可能解决方案：

# Read the local file
URL <- 'D:/Test/Test.html'
wp <- xml2::read_html(URL, encoding = 'utf-8')
# Extract the relevant nodes
node <- rvest::html_nodes(wp, '.from')
# Extract the names
seller <- gsub('.*D'rn\s+(.*?)rn\s+a\s?rn\s+(.*?)rn.*', '\1', rvest::html_text(node))
# [1] "computer" "computer"
buyer <- gsub('.*D'rn\s+(.*?)rn\s+a\s?rn\s+(.*?)rn.*', '\2', rvest::html_text(node))
# [1] "peter" "james"

您是否尝试过买家

#pressReleases .from strong:nth-child(1)

对于卖家

#pressReleases .from strong:nth-child(2)

假设您已经将 html 读入变量page然后(扩展以包含您的其他变量(

buyers <- page %>% html_nodes("#pressReleases .from strong:nth-child(1)") %>% html_text
sellers <- page %>% html_nodes("#pressReleases .from strong:nth-child(2)") %>% html_text
df <- as.data.frame(cbind(buyers,sellers))

然后，数据帧应该易于导出。

相关内容

最新更新

热门标签：