我正在制作一个脚本来处理我可以重用的csv。现在我正在使用这段代码来规范化csv文件中的列,以便它们都可以有类似的列。
df = pd.read_csv('Crokis.csv', index_col=0, encoding = "ISO-8859-1", low_memory=False)
genCol=['Genus','genus','ngenus','genera',]
df.rename(columns={typo: 'Genus' for typo in genCol}, inplace=True)
spCol=['species', 'sp', 'Species']
df.rename(columns={typo: 'species' for typo in spCol}, inplace=True)
chromCol=['Chromosome count', 'chromosome', 'Cytology', '2n', 'Chromosome']
df.rename(columns={typo: 'chromosome' for typo in chromCol}, inplace=True)
del chromCol, spCol, genCol
工作正常,但有两个问题
有时由于大小写或在前面或后面添加了额外的字符,列表中缺少项目。是否有一种方法包括
regex
或类似的东西来处理不同的变化?似乎有一个多余的模式,所以我认为应该有一个方法来优化它。
可以使用python的re函数来实现。
下面是一个用'Genus'
代替'genus.*'
出现的例子。它将匹配和替换例如'genUS'
, 'GENUS'
, 'Genus_666'
import pandas as pd
import re
df = pd.read_csv('Crokis.csv', index_col=0, encoding = "ISO-8859-1", low_memory=False)
# 'Genus' column renaming
f = lambda x: re.sub('genus.*','Genus', x, flags = re.IGNORECASE)
df.rename(columns = f, inplace = True)
我将这样处理这个问题:
# use a single dict to hold the mapping
name_map = {'Genus': ['Genus','genus','ngenus','genera'],
'species':['species', 'sp', 'Species'],
'chromosome':['Chromosome count', 'chromosome', 'Cytology', '2n', 'Chromosome']}
col_translate = {}
for c in df.columns:
for canonical_name, alias_names in name_map.items():
for alias_name in alias_names:
if c.lower() == col_name.lower():
col_translate[c] = canonical_name
# if you want to check prefix or suffix...
elif c.startswith(alias_name) or c.endswith(alias_name)
col_translate[c] = canonical_name
# ... any additional, more complicated test
...
如果在某些情况下re
可能认为太难