我正在处理一个大型数据集,但这个问题可以用下面的小数据集来总结:
import pandas as pd
df = pd.DataFrame({"Filename":["fileName1_uniqueTag1", "fileName2_uniqueTag1", "fileName3_uniqueTag1", "fileName1_uniqueTag2", "fileName2_uniqueTag2", "fileName3_uniqueTag2"],
"measurement":[1336.564888, 1090.852579, 990.320323, 1202.522612, 1098.045258, 923.600277],})
print(df)
>>>
Filename measurement
0 fileName1_uniqueTag1 1336.564888
1 fileName2_uniqueTag1 1090.852579
2 fileName3_uniqueTag1 990.320323
3 fileName1_uniqueTag2 1202.522612
4 fileName2_uniqueTag2 1098.045258
5 fileName3_uniqueTag2 923.600277
有三个不同的文件名,每个文件名在"文件名"中有两个唯一的标签。列。目标是计算每个文件的uniqueTag1/uniqueTag2测量值的比率。结果应该看起来像这样:
Filename uniqueTag2/uniqueTag1
0 fileName1 0.899711
1 fileName2 1.006593
2 fileName3 0.932627
我可以列出三个不同的文件名和两个不同的标签:
nameList = df["Filename"].tolist()
fileNames = [] #empty list to fill with different base file names
uniqueTags = [] #empty list to fill with unique tags
for name in nameList: #iterate through list of full file names
subStrings = name.split("_") #splits each base file name at the underscore
if subStrings[0] not in fileNames: #if the base file name isn't already in the file names list...
fileNames.append(subStrings[0]) #append it
if subStrings[1] not in uniqueTags: #if the unique tag isn't already in the unique tags list...
uniqueTags.append(subStrings[1]) #append it
我认为我可以通过将文件名变成索引并使用df.at()来访问单个测量值,但这看起来非常混乱,我确信一定有更好的方法使用Pandas中的功能来实现这一点。有什么建议吗?
您可以使用str.split()
:
df[['Filename','uniquetag']] = df['Filename'].str.split('_', expand=True)
tag1 = df.loc[df['uniquetag'] == 'uniqueTag1'].set_index('Filename')['measurement']
tag2 = df.loc[df['uniquetag'] == 'uniqueTag2'].set_index('Filename')['measurement']
tag2 / tag1
Try
df[['one','two']] = df.filename.str.split("_",expand=True)
然后groupby
使用这两列