我有一个这样的数据帧:
df1 = pd.DataFrame({
"index": ["EXEC sp_delete_job", "exec sp_add_job", "something else","exec sp_add_jobserver"],
"index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
"index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})
df1.head()
index index1 index2 index3
0 EXEC sp_delete_job NaN EXEC sp_droplogin EXEC sp_droplogin
1 exec sp_add_job NaN EXEC sp_delete_job EXEC sp_delete_job
2 something else NaN NaN exec sp_add_job
3 exec sp_add_jobserver exec sp_delete_job something else exec sp_delete_job
我想要的是从这个网站映射列值及其描述https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-服务器-ver15
例如,这个值EXEC sp_droplogin
可以用这里的描述进行映射https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-服务器-ver15
所以输出看起来是这样的:
index
0 Removes a SQL Server login. This prevents access to an instance of SQL Server under
that login name.
1 EXEC sp_delete_job
2 exec sp_add_job
3 exec sp_delete_job
4 exec sp_add_jobserver
其他列值也必须如此。
执行此操作的最佳方式是什么?用美汤?
你能提供一些想法/方向/代码等吗?
您可以为每个index
条目调用一个函数,并将其替换为requests
beautifulsoup
查找的结果:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def description(value):
name = value.split(' ')[1].replace('_', '-')
url = f"https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
return [p.text for p in div.find_all('p')][3]
df = pd.DataFrame({
"index": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"],
})
df['index'] = df['index'].map(description)
print(df)
这将更改您的数据帧如下:
index
0 Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name.
1 Deletes a job.
2 Adds a new job executed by the SQL Agent service.
3 Deletes a job.
4 Targets the specified job at the specified server.
首先取
value
,例如EXEC sp_droplogin
,并在空间上拆分它。然后取第二部分sp_droplogin
,并用URL所需的-
替换任何_
。基于
name
创建一个合适的URL。使用
requests.get()
从Microsoft站点获取相应的HTML。找到包含说明的
<div class='content'>
。在该div中,找到所有
<p>
元素,并提取每个元素的文本。第四个条目包含所需的文本。归还。
如果有None
值,则需要对此进行测试并返回合适的值:
def description(value):
if value:
.........existing code......
else:
return "Not found"
对于您更新的示例,我建议您使用字典来保存每个请求的结果,以避免多次查找相同的值。
您可以使用.applymap()
为数据帧中的所有项运行函数。
最后,如果value
不是以exec
开头,则只返回未更改的值(或您喜欢的任何值)
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
procedures = {} # cache of results
def description(value):
if value.lower().startswith("exec "):
name = value.lower().split(' ')[1].replace('_', '-')
if name in procedures: # already seen?
return procedures[name]
else:
url = f"https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
text = [p.text for p in div.find_all('p')][3]
procedures[name] = text
return text
else:
return value
df = pd.DataFrame({
"index": ["EXEC sp_delete_job", "exec sp_add_job", "something else", "exec sp_add_jobserver"],
"index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
"index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})
df = df.applymap(description)
print(df)