我有一个包含文本列的pandas数据框架,如下所示:
A
0 61.11% no_label ; 32.22% Area_1 ; 5.56% Area_2
1 58.97% Area_3 ; 41.03% no_label
2 100% no_label
3 80.49% Area_1 ; 14.63% Area_3
我需要从中接收一行中最大百分比的列和该数字的面积名称,或者如果最大数字属于'no_label',则第二大百分比及其面积名称。就像上一个例子中的这个:
A
0 32.22% Area_1
1 58.97% Area_3
2 100% no_label
3 80.49% Area_1
也可以是第二列,没有关系:
A B
0 61.11% no_label ; 32.22% Area_1 ; 5.56% Area_2 32.22% Area_1
1 58.97% Area_3 ; 41.03% no_label 58.97% Area_3
2 100% no_label 100% no_label
3 80.49% Area_1 ; 14.63% Area_3 80.49% Area_1
任何想法?
我们可以尝试以下操作。这段代码使用pandas.DataFrame.apply
对列中的每个单元格应用一个函数。每个单元格中都有一个字符串。我们用分隔符;
分隔每个字符串,如果有多个字符串,我们按照字符串中该部分的百分比从低到高排序。排序是由key
中的一个函数完成的,该函数取字符串直到%
字符并将其转换为浮点数。
import io
import pandas as pd
data = """
61.11% no_label ; 32.22% Area_1 ; 5.56% Area_2
58.97% Area_3 ; 41.03% no_label
100% no_label
80.49% Area_1 ; 14.63% Area_3"""
df = pd.DataFrame(io.StringIO(data), columns=["A"])
df.head()
def get_largest_or_second_largest_percentage(string):
"""Given a string of data, return the largest percentage,
the second-largest if the largest has label 'no_label',
the only percentage if there is one data point,
or an empty string if there is no data.
"""
if not string:
return ""
parts = [p.strip() for p in string.split(";")]
# If there is only one item, don't bother sorting.
if len(parts) == 1:
return parts[0]
# Sort from lowest to highest percentage.
# Assumes that the percentage is before the % symbol.
parts.sort(key=lambda s: float(s.split("%")[0]))
# Return second-largest if the largest has label 'no_label'.
if "no_label" in parts[-1]:
return parts[-2]
return parts[-1] # return largest otherwise.
df.loc[:, "B"] = df.loc[:, "A"].apply(get_largest_or_second_largest_percentage)
print(df.head())
输出为
A B
0 61.11% no_label ; 32.22% Area_1 ; 5.56% Area_2n 32.22% Area_1
1 58.97% Area_3 ; 41.03% no_labeln 58.97% Area_3
2 100% no_labeln 100% no_label
3 80.49% Area_1 ; 14.63% Area_3 80.49% Area_1