类型错误:XXXXX 收到意外的关键字参数"XXXXXX"



我在运行代码时得到了一个意外的关键字参数。来源:https://sempioneer.com/python-for-seo/how-to-extract-text-from-multiple-webpages-in-python/有人能帮忙吗?感谢

运行以下代码:

single_url = 'https://understandingdata.com/'
text = extract_text_from_single_web_page(url=single_url)
print(text)

给出以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~AppDataLocalTemp/ipykernel_10260/3606377172.py in <module>
1 single_url = 'https://understandingdata.com/'
----> 2 text = extract_text_from_single_web_page(url=single_url)
3 print(text)
~AppDataLocalTemp/ipykernel_10260/850098094.py in extract_text_from_single_web_page(url)
42     try:
43         a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
---> 44                             date_extraction_params={'extensive_search': True, 'original_date': True})
45     except AttributeError:
46         a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
TypeError: extract() got an unexpected keyword argument 'json_output'

";extract_text_from_single_web_page(url=single_url(

def extract_text_from_single_web_page(url):

downloaded_url = trafilatura.fetch_url(url)
try:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
date_extraction_params={'extensive_search': True, 'original_date': True})
except AttributeError:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
date_extraction_params={'extensive_search': True, 'original_date': True})
if a:
json_output = json.loads(a)
return json_output['text']
else:
try:
resp = requests.get(url)
# We will only extract the text from successful requests:
if resp.status_code == 200:
return beautifulsoup_extract_text_fallback(resp.content)
else:
# This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
return np.nan
# Handling for any URLs that don't have the correct protocol
except MissingSchema:
return np.nan

正如我的评论中所建议的,最好的选择是找到一个不使用trafilatura的教程,因为这似乎是一个坏东西。然而,修改这个特定的函数以避免它非常简单,只需使用回退:

def extract_text_from_single_web_page(url):
try:
resp = requests.get(url)
# We will only extract the text from successful requests:
if resp.status_code == 200:
return beautifulsoup_extract_text_fallback(resp.content)
else:
# This line will handle for any failures in the BeautifulSoup4 function:
return np.nan
# Handling for any URLs that don't have the correct protocol
except MissingSchema:
return np.nan

除了我同意Samwise尝试使用标准的、支持良好的Python模块之外,我认为这里还有一堂关于版本管理的课

在您提供的教程中,他们只安装了最新版本的软件包。这通常不是一个好的做法。特别是在生产环境中,您希望控制版本,这样您就不会因为其他人更改了您的依赖关系而破坏代码。

在您的案例中,trafilatura0.7.0版本仍然支持json_output关键字参数,但后来的版本已经删除了这一点。例如,撰写本文时的最新版本:0.9.3。

最新更新