名为KW的虚构df如下所示:
Group Subgroup Word
orange zebra keys
green lion mouse
blue horse captain
我目前拥有的代码采用"单词"列下的每个单词,并将某些字母替换为字典中的其他字母,每次一个。之后,创建所有这些拼写错误的列表。因此,使用KW df:
kw = df[['Word',"Group","Subgroup"]]
words = kw.to_dict()["Word"].values()
md = {"m":"w","o":"z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
newwords.append(word)
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append(tmp)
pos += 1
返回
Word
keys
mouse
wouse
mzuse
captain
我想做的基本上是根据被操纵的原始单词将这些拼写错误重新分类到相应的组/子群中。因此,理想情况下,与其吐出一个单独的拼写错误列表,不如像这样:
Group Subgroup Word
orange zebra keys
green lion mouse
green lion wouse
green lion mzuse
blue horse captain
不知何故,我们需要将新单词与原始单词关联起来。您可以通过在newwords
中存储2个元组(如('mouse', 'wouse')
)来实现这一点。然后,您可以将newwords
转换为DataFrame,并使用pd.merge
通过连接原始单词将newwords
与kw
合并
import pandas as pd
df = pd.read_table('data', sep='s+')
kw = df[['Word',"Group","Subgroup"]]
words = df['Word']
md = {"m":"w","o":"z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
# Save both the original word and the new word
newwords.append((word, word))
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append((word, tmp))
newwords = pd.DataFrame(newwords, columns=['Word', 'New'])
# Merge on the original Word
result = pd.merge(newwords, kw, left_on='Word', right_on='Word', how='left')
result = result[['Group', 'Subgroup', 'New']]
result.columns = ['Group', 'Subgroup', 'Word']
print(result)
产生
Group Subgroup Word
0 orange zebra keys
1 green lion mouse
2 green lion wouse
3 green lion mzuse
4 blue horse captain