通过抓取信息创建新列



我正在尝试将从网站上抓取的信息添加到列中。我有一个数据集,看起来像:

COL1   COL2    COL3
...     ...    bbc.co.uk

我想要一个包含新列的数据集:

COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status   
...     ...    bbc.co.uk
IP Address  Server Location    City       Region

这些新栏目来自本网站:https://www.urlvoid.com/scan/bbc.co.uk.我需要在每一列中填写相关信息。

例如:

COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status   
...     ...    bbc.co.uk         Bbc.co.uk         9 days ago       0/35
Domain Registration               IP Address       Server Location    City       Region
1996-08-01 | 24 years ago       151.101.64.81    (US) United States   Unknown    Unknown

不幸的是,我在创建新的专栏并用从网站上刮来的信息填充它们时遇到了一些问题。我可能还有更多的网站要查,不仅仅是bbc.co.uk。请参阅下面使用的代码。我相信有一种更好(也不那么混乱(的方法可以做到这一点。如果你能帮我弄清楚,我将不胜感激。感谢

编辑:

如上面的示例所示,对于包括三列(col1, col2 and col3(的现有数据集,我还应该添加来自抓取(Website Address,Last Analysis,Blacklist Status, ...(的字段。那么,对于每个url,我都应该有与其相关的信息(例如,示例中的bbc.co.uk(。

COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status   
...     ...    bbc.co.uk          Bbc.co.uk         9 days ago       0/35
...     ...    stackoverflow.com
...     ...    ...

IP Address  Server Location    City       Region
COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status   
...     ...    bbc.co.uk         Bbc.co.uk         9 days ago       0/35
...     ...    stackoverflow.com Stackoverflow.com  7 days ago      0/35

Domain Registration               IP Address       Server Location    ...
996-08-01 | 24 years ago       151.101.64.81    (US) United States    ...
2003-12-26 | 17 years ago      ...

(格式不好,但我认为这足以让您了解预期的输出(。

更新代码:

urls= ['bbc.co.uk', 'stackoverflow.com', ...]
for x in urls:
print(x)
r = requests.get('https://www.urlvoid.com/scan/'+x)
soup = BeautifulSoup(r.content, 'lxml')
tab = soup.select("table.table.table-custom.table-striped")
dat = tab[0].select('tr')
for d in dat:
row = d.select('td')
original_dataset[row[0].text]=row[1].text

不幸的是,我做错了一些事情,因为它只复制了网站(即bbc.co.uk(上检查的第一个url中的信息,并将其复制到新列下的所有行中。

如果这是您想要的,请告诉我:

cols = ['Col1','Col2']
rows = ['something','something else']
my_df= pd.DataFrame(rows,index=cols).transpose()
my_df

从以下行提取现有代码:

dat = tab[0].select('tr')

添加:

for d in dat:
row = d.select('td')
my_df[row[0].text]=row[1].text
my_df

输出(抱歉格式化(:

Col1       Col2       Website Address   Last Analysis   Blacklist Status    Domain Registration     Domain Information  IP Address  Reverse DNS     ASN     Server Location     LatitudeLongitude  City    Region
0   something   something else  Bbc.com     11 days ago  |  Rescan  0/35    1989-07-15 | 31 years ago   WHOIS Lookup | DNS Records | Ping   151.101.192.81   Find Websites  |  IPVoid  |  ...   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown

编辑:

要使用多个url,请尝试以下操作:

urls = ['bbc.com', 'stackoverflow.com']
ares = []
for u in urls:
url = 'https://www.urlvoid.com/scan/'+u
r = requests.get(url)
ares.append(r)
rows = []
cols = []
for ar in ares:
soup = bs(ar.content, 'lxml')
tab = soup.select("table.table.table-custom.table-striped")        
dat = tab[0].select('tr')
line= []
header=[]
for d in dat:
row = d.select('td')
line.append(row[1].text)
new_header = row[0].text
if not new_header in cols:
cols.append(new_header)
rows.append(line)
my_df = pd.DataFrame(rows,columns=cols)   
my_df

输出:

Website Address     Last Analysis   Blacklist Status    Domain Registration     Domain Information  IP Address  Reverse DNS     ASN     Server Location     LatitudeLongitude  City    Region
0   Bbc.com     12 days ago  |  Rescan  0/35    1989-07-15 | 31 years ago   WHOIS Lookup | DNS Records | Ping   151.101.192.81   Find Websites  |  IPVoid  |  ...   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown
1   Stackoverflow.com   5 minutes ago  |  Rescan    0/35    2003-12-26 | 17 years ago   WHOIS Lookup | DNS Records | Ping   151.101.1.69   Find Websites  |  IPVoid  |  Whois   Unknown     AS54113 FASTLY  (US) United States  37.751 / -97.822   Google Map   Unknown     Unknown

请注意,它没有两个现有列(因为我不知道它们是什么(,所以您必须将它们分别附加到数据帧中。

您可以使用pandas read_html方法使用更简单的方法来获取数据。这是我的镜头-

import pandas as pd
df = pd.read_html("https://www.urlvoid.com/scan/bbc.co.uk/")[0]
df_transpose = df.T

现在您有了所需的转置数据。如果愿意,可以删除不需要的列。之后,您现在所要做的就是将它与现有的数据集连接起来。考虑到您可以将数据集加载为pandas数据帧,您可以简单地使用concat函数(轴=1是连接为列(:

pd.concat([df_transpose, existing_dataset], axis=1)

请参阅pandas文档中关于合并/串联的内容:http://pandas.pydata.org/pandas-docs/stable/merging.html

最新更新