根据网站上下文映射不同的列值



我有一个这样的数据帧:

df1 = pd.DataFrame({   
"index": ["EXEC sp_delete_job",  "exec sp_add_job", "something else","exec sp_add_jobserver"],
"index1": ["NaN",  "NaN",  "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "NaN", "something else"],
"index3": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "exec sp_add_job", "exec sp_delete_job"]
})
df1.head()
index                 index1                  index2          index3
0   EXEC sp_delete_job       NaN                EXEC sp_droplogin   EXEC sp_droplogin
1   exec sp_add_job          NaN                EXEC sp_delete_job  EXEC sp_delete_job
2   something else           NaN                  NaN                   exec sp_add_job
3   exec sp_add_jobserver    exec sp_delete_job   something else    exec sp_delete_job

我想要的是从这个网站映射列值及其描述https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-服务器-ver15

例如,这个值EXEC sp_droplogin可以用这里的描述进行映射https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-服务器-ver15

所以输出看起来是这样的:

index
0   Removes a SQL Server login. This prevents access to an instance of SQL Server under 
that login name.
1   EXEC sp_delete_job
2   exec sp_add_job
3   exec sp_delete_job
4   exec sp_add_jobserver

其他列值也必须如此。

执行此操作的最佳方式是什么?用美汤?

你能提供一些想法/方向/代码等吗?

您可以为每个index条目调用一个函数,并将其替换为requestsbeautifulsoup查找的结果:

import pandas as pd
import requests
from bs4 import BeautifulSoup
def description(value):   
name = value.split(' ')[1].replace('_', '-')
url = f"https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
return [p.text for p in div.find_all('p')][3]

df = pd.DataFrame({   
"index": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"],
})

df['index'] = df['index'].map(description)
print(df)

这将更改您的数据帧如下:

                            index
0  Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name.
1                                                                                        Deletes a job.
2                                                     Adds a new job executed by the SQL Agent service.
3                                                                                        Deletes a job.
4                                                    Targets the specified job at the specified server.
  1. 首先取value,例如EXEC sp_droplogin,并在空间上拆分它。然后取第二部分sp_droplogin,并用URL所需的-替换任何_

  2. 基于name创建一个合适的URL。

  3. 使用requests.get()从Microsoft站点获取相应的HTML。

  4. 找到包含说明的<div class='content'>

  5. 在该div中,找到所有<p>元素,并提取每个元素的文本。第四个条目包含所需的文本。归还。

如果有None值,则需要对此进行测试并返回合适的值:

def description(value):
if value:
.........existing code......
else:
return "Not found"

对于您更新的示例,我建议您使用字典来保存每个请求的结果,以避免多次查找相同的值。

您可以使用.applymap()为数据帧中的所有项运行函数。

最后,如果value不是以exec开头,则只返回未更改的值(或您喜欢的任何值)

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
procedures = {}     # cache of results
def description(value):   
if value.lower().startswith("exec "):
name = value.lower().split(' ')[1].replace('_', '-')

if name in procedures:  # already seen?
return procedures[name]
else:
url = f"https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
div = soup.find('div', class_="content")
text = [p.text for p in div.find_all('p')][3]
procedures[name] = text
return text
else:
return value

df = pd.DataFrame({   
"index": ["EXEC sp_delete_job",  "exec sp_add_job", "something else", "exec sp_add_jobserver"],
"index1": ["NaN", "NaN",  "NaN", "exec sp_delete_job"],
"index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
"index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})
df = df.applymap(description)
print(df)

最新更新