检查熊猫数据帧列中的缩写



如何最有效地识别和计算单词后面的缩写并输入到新列中,但前提是它们是正确的。

期望输出:

|-------Name---------------------------||-Count-|
This is Ante Meridian (AM) not included||   3   |         
This is Ante Meridian (AM)     included||   3   |     
This is Ante Meridian (AM) not included||   3   |     
Extra module with Post Meridian (PM)   ||   1   |     
Post Meridian (PO) is not available    ||   0   |  #Mismatch   

首先,您需要使用正则表达式来确定 (( 中的字母是否与它前面的两个单词匹配。

#get two words before (
wordsbefore = df['Name'].str.extract(r'(w+) (w+) (?=()')
#get first letter of both words and make it what it should be in ()
check = wordsbefore[0].str.extract(r'(^.)') + wordsbefore[1].str.extract(r'(^.)')
#check if letters in () matches our check
df['count'] = np.where(df['Name'].str.extract(r"((.*))") == check, df['Name'].str.extract(r"((.*))"), 0)

现在你有一个 df,其中 acynoym 位于它自己的列中,如果不匹配,则为 0。现在我们只需要用计数替换。

df['count'] = df['count'].map(dict(df[df['count']!=0]['count'].value_counts())).fillna(0)
Name                          count
0   This is Ante Meridian (AM) not included   3.0
1   This is Ante Meridian (AM) included       3.0
2   This is Ante Meridian (AM) not included   3.0
3   Extra module with Post Meridian (PM)      1.0
4   Post Meridian (PO) is not available       0.0

如果一行中没有 ((,则最终也会以 0 结束。


对于 3 和可调更多,如果您只是遵循循环中的模式:

acy = re.compile("((.*))")
twoWords = re.compile('(w+) (w+) (?=()')
threeWords = re.compile('(w+) (w+) (w+) (?=()')
firstLet = re.compile('(^.)')
acyList = []
#Pull the first letters out of the words before ()
for index, value in df['Name'].iteritems():
#get letters in () two inspect to check if we need to check 2 or 3 words
getAcy = acy.search(value)
try:    
#check if length of letters in () is 2
if len(getAcy[1]) == 2:
#search for two words
words = twoWords.search(value)
#get first letter of two words before () and add phrase to list
acyList.append(firstLet.search(words[1])[1] + firstLet.search(words[2])[1])
#check if length of letters in () is 3
elif len(getAcy[1]) == 3:
#search for three words
words = threeWords.search(value)
#get first letter of three words before () and add phrase to list
acyList.append(firstLet.search(words[1])[1] + firstLet.search(words[2])[1] + firstLet.search(words[3])[1])
except:
acyList.append(np.NaN)
df['count'] = np.where(df['Name'].str.extract(r"((.*))") == pd.DataFrame(acyList), df['Name'].str.extract(r"((.*))"), 0)
df['count'] = df['count'].map(dict(df[df['count']!=0]['count'].value_counts())).fillna(0)

最新更新