如何使用python确定特定字符串或文本是美式英语还是英式英语?



我想实现这样的东西…

input_text = "The body is burnt"
output = "en-uk"
input_text = "The body is burned" 
output = "en-us"

尝试TextBlob这需要NLTK包,使用Google

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()
边注:这使用谷歌翻译API,所以它需要互联网

与此答案类似,您可以使用英美翻译。

import re
import requests
url = "https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/"
# The two dictionaries differ slightly so we import both
uk_to_us = requests.get(url + "british_spellings.json").json()    
us_to_uk = requests.get(url + "american_spellings.json").json()   
us_only = requests.get(url + "american_only.json").json()
uk_only = requests.get(url + "british_only.json").json()
# Save these word lists in a local text file if you want to avoid requesting the data every time
uk_words = set(uk_to_us) | set(uk_only)
us_words = set(us_to_uk) | set(us_only)
uk_phrases = {w for w in uk_words if len(w.split()) > 1}
us_phrases = {w for w in us_words if len(w.split()) > 1}
uk_words -= uk_phrases
us_words -= us_phrases
max_length = max(len(word.split()) for word in uk_phrases | us_phrases)
def get_dialect(s):
words = re.findall(r"([a-z]+)", s.lower()) # list of lowercase words only
uk = 0
us = 0 
# Check for multi-word phrases first, removing them if they are found
for length in range(max_length, 1, -1):
i = 0
while i + length <= len(words):
phrase = " ".join(words[i:i+length])
if phrase in uk_phrases:
uk += length
words = words[:i] + words[i + length:]
elif phrase in us_phrases:
us += length
words = words[:i] + words[i + length:]
else:
i += 1

# Add single words
uk += sum(word in uk_words for word in words)
us += sum(word in us_words for word in words)
print("Scores", uk, us)
if uk > us:
return "en-uk"
if us > uk:
return "en-us"
return "Unknown"
print(get_dialect("The color of the ax"))  # en-us
print(get_dialect("The colour of the axe"))  # en-uk
print(get_dialect("I opened my brolly on the zebra crossing"))  #en-uk
print(get_dialect("The body is burnt"))  # Unknown

这只是在单个单词水平上进行测试,而不能检查单词在语法上下文中的使用差异(例如,有些单词在一种方言中仅用作形容词,但在另一种方言中也可以用作过去式动词)。

us_onlyuk_only列表不包含相同单词的不同形式(例如"abseil";有没有"abseiled","abseiling","abseiling";等),所以你最好先把你的文本转换成茎。

最新更新