r语言 - 使用Python创建数据集，抓取网络 - r - Creating Dataset with Python, scraping web 小贝子编程网

我看过很多帖子，但还没有找到完全满足我需求的解决方案。首先，我首先说我是Python的新手（我正在使用Python 2）。

我正在尝试从网页（http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html）收集数据。请注意漂亮的 html 表。我已经能够毫无问题地将其读入列表。但是，还要注意有两列带有链接。我想删除第一个链接列（但我不确定如何做到这一点，因为我的数据在列表中）。

第二个链接列稍微复杂一些。我想将标题"链接"替换为"最后声明"。然后，我想访问提供的每个链接，检索最后一个语句，并将其放在我为其创建列表的原始表的相应行中。

最后，我想将此列表打印为制表符分隔的文件，该文件可以作为数据框读入 R。

这对于菜鸟来说很多。请告诉我我是否正确处理了这个问题。以下是我到目前为止的代码。我错过了一些我想做的事情，因为我不确定如何开始。

from bs4 import BeautifulSoup
import requests
from lxml import html
import csv
import string
import sys
#obtain the main url with bigger data
main_url = "http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
#convert the html to BeautifulSoup
doc = requests.get(main_url)
soup = BeautifulSoup(doc.text, 'lxml')
#find in html the table
tbl = soup.find("table", attrs = {"class":"os"})
#create labels for list rows by table headers
headings = [th.get_text() for th in tbl.find("tr").find_all("th")]
#convert the unicode to string
headers = []
for i in range(0,len(headings)-1):
    headers.append(str(headings[i]))
#access the remaining information
prisoners = []
for row in tbl.find_all("tr")[1:]:
    #attach the appropriate header to the appropriate corresponding data
    #also, converts unicode to string
    info = zip(headers, (str(td.get_text()) for td in row.find_all("td")))    
    #append each of the newly made rows
    prisoners.append(info)
#print each row of the list to a file for R
with open('output.txt', 'a') as output:
    for p in prisoners:
        output.write(str(p)+'n')
output.close()

如果您能帮助我找出我正在努力解决的三个部分中的任何一个，我将不胜感激！

不需要 UglyStew。R 简洁、富有表现力的抓取效果很好。

library(xml2)
library(rvest)
library(pbapply)
library(dplyr)
URL <- "http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
pg <- read_html(URL)
nod <- html_nodes(pg, "table.os")[[1]]
tab <- html_table(nod)
last_urls <- html_attr(html_nodes(nod, xpath=".//tr/td[3]/a"), "href")
last_urls <- sprintf("http://www.tdcj.state.tx.us/death_row/%s", last_urls)
last_st <- pbsapply(last_urls, function(x) {
  pg2 <- read_html(x)
  trimws(html_text(html_nodes(pg2, 
                              xpath=".//p[contains(., 'Last Statement')]/following-sibling::p")))
})
death_row <- mutate(tab[, -c(2:3)], last_statement=last_st)
death_row <- setNames(death_row, gsub("\.", "_", tolower(make.names(colnames(death_row)))))
death_row <- mutate(death_row, date=as.Date(date, "%m/%d/%Y"))
glimpse(death_row)
## Observations: 537
## Variables: 9
## $ execution      (int) 537, 536, 535, 534, 533, 532, 531, 530, 529, 528, 527, 5...
## $ last_name      (chr) "Vasquez", "Ward", "Wesbrook", "Garcia", "Freeman", "Mas...
## $ first_name     (chr) "Pablo", "Adam", "Coy", "Gustavo", "James", "Richard", "...
## $ tdcj_number    (int) 999297, 999525, 999281, 999018, 999539, 999414, 999419, ...
## $ age            (int) 38, 33, 58, 43, 35, 43, 36, 33, 35, 27, 46, 67, 32, 34, ...
## $ date           (date) 2016-04-06, 2016-03-22, 2016-03-09, 2016-02-16, 2016-01...
## $ race           (chr) "Hispanic", "White", "White", "Hispanic", "White", "Whit...
## $ county         (chr) "Hidalgo", "Hunt", "Harris", "Collin", "Wharton", "Harri...
## $ last_statement (list) I just want to tell my family thank you, my mom and  da...

这需要一两分钟来抓取，拥有这样的网站对 TX 来说是超级好的，所以这里链接到结果数据帧的 R Data 文件，以避免他们的服务器上出现任何更多负载。

r语言 - 使用Python创建数据集，抓取网络

相关内容

最新更新

热门标签：