网页中的HTML无法正确显示外语字符



如果标题有误导性,我们深表歉意。

我试图通过查询歌词网站,然后使用CLD2来检查歌词的语言,来找出给定歌曲的语言。然而,对于某些歌曲(例如下面给出的示例(,外语字符没有正确编码,这意味着CLD2会出现以下错误:input contains invalid UTF-8 around byte 2121 (of 32761)

import requests
import re
from bs4 import BeautifulSoup
import cld2
response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break
def checklang(lyrics):
try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
language = re.search("ENGLISH", str(details))

if language == None:
print("foreign lang")

if len(re.findall("Unknown", str(details))) < 2:
print("foreign lang")

if language != None:
print("english")
pass

还值得一提的是,这不仅限于非拉丁字符,有时还会出现撇号或其他标点符号。

有人能解释为什么会发生这种情况,或者我能做些什么来解决它吗?

Requests应该根据HTTP标头对响应的编码进行有根据的猜测。

不幸的是,在给定的示例中,response.encoding在中显示ISO-8859-1,而response.content显示<meta charset="utf-8">

以下是我基于requests文档中响应内容段落的解决方案。

import requests
import re
from bs4 import BeautifulSoup
# import cld2
import pycld2 as cld2
def checklang(lyrics):
#try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
# language = re.search("ENGLISH", str(details))
for detail in details:
print(detail)
response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')
print(response.encoding)
response.encoding = 'utf-8'                         ### key change ###
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break

输出SO65630066.py

ISO-8859-1
('ENGLISH', 'en', 74, 833.0)
('Korean', 'ko', 20, 3575.0)
('Unknown', 'un', 0, 0.0)
Lyrics found!

最新更新