如何最有效地识别和计算单词后面的缩写并输入到新列中,但前提是它们是正确的。
期望输出:
|-------Name---------------------------||-Count-|
This is Ante Meridian (AM) not included|| 3 |
This is Ante Meridian (AM) included|| 3 |
This is Ante Meridian (AM) not included|| 3 |
Extra module with Post Meridian (PM) || 1 |
Post Meridian (PO) is not available || 0 | #Mismatch
首先,您需要使用正则表达式来确定 (( 中的字母是否与它前面的两个单词匹配。
#get two words before (
wordsbefore = df['Name'].str.extract(r'(w+) (w+) (?=()')
#get first letter of both words and make it what it should be in ()
check = wordsbefore[0].str.extract(r'(^.)') + wordsbefore[1].str.extract(r'(^.)')
#check if letters in () matches our check
df['count'] = np.where(df['Name'].str.extract(r"((.*))") == check, df['Name'].str.extract(r"((.*))"), 0)
现在你有一个 df,其中 acynoym 位于它自己的列中,如果不匹配,则为 0。现在我们只需要用计数替换。
df['count'] = df['count'].map(dict(df[df['count']!=0]['count'].value_counts())).fillna(0)
Name count
0 This is Ante Meridian (AM) not included 3.0
1 This is Ante Meridian (AM) included 3.0
2 This is Ante Meridian (AM) not included 3.0
3 Extra module with Post Meridian (PM) 1.0
4 Post Meridian (PO) is not available 0.0
如果一行中没有 ((,则最终也会以 0 结束。
对于 3 和可调更多,如果您只是遵循循环中的模式:
acy = re.compile("((.*))")
twoWords = re.compile('(w+) (w+) (?=()')
threeWords = re.compile('(w+) (w+) (w+) (?=()')
firstLet = re.compile('(^.)')
acyList = []
#Pull the first letters out of the words before ()
for index, value in df['Name'].iteritems():
#get letters in () two inspect to check if we need to check 2 or 3 words
getAcy = acy.search(value)
try:
#check if length of letters in () is 2
if len(getAcy[1]) == 2:
#search for two words
words = twoWords.search(value)
#get first letter of two words before () and add phrase to list
acyList.append(firstLet.search(words[1])[1] + firstLet.search(words[2])[1])
#check if length of letters in () is 3
elif len(getAcy[1]) == 3:
#search for three words
words = threeWords.search(value)
#get first letter of three words before () and add phrase to list
acyList.append(firstLet.search(words[1])[1] + firstLet.search(words[2])[1] + firstLet.search(words[3])[1])
except:
acyList.append(np.NaN)
df['count'] = np.where(df['Name'].str.extract(r"((.*))") == pd.DataFrame(acyList), df['Name'].str.extract(r"((.*))"), 0)
df['count'] = df['count'].map(dict(df[df['count']!=0]['count'].value_counts())).fillna(0)