按条件划分Pandas DF单元格并保存特定单词



我有一个包含文本列的pandas数据框架,如下所示:

A
0  61.11% no_label ; 32.22% Area_1 ; 5.56% Area_2
1  58.97% Area_3 ; 41.03% no_label
2  100% no_label
3  80.49% Area_1 ; 14.63% Area_3

我需要从中接收一行中最大百分比的列和该数字的面积名称,或者如果最大数字属于'no_label',则第二大百分比及其面积名称。就像上一个例子中的这个:

A
0  32.22% Area_1
1  58.97% Area_3
2  100% no_label
3  80.49% Area_1

也可以是第二列,没有关系:

A                                          B
0  61.11% no_label ; 32.22% Area_1 ; 5.56% Area_2           32.22% Area_1
1  58.97% Area_3 ; 41.03% no_label                          58.97% Area_3
2  100% no_label                                            100% no_label
3  80.49% Area_1 ; 14.63% Area_3                            80.49% Area_1

任何想法?

我们可以尝试以下操作。这段代码使用pandas.DataFrame.apply对列中的每个单元格应用一个函数。每个单元格中都有一个字符串。我们用分隔符;分隔每个字符串,如果有多个字符串,我们按照字符串中该部分的百分比从低到高排序。排序是由key中的一个函数完成的,该函数取字符串直到%字符并将其转换为浮点数。

import io
import pandas as pd
data = """
61.11% no_label ; 32.22% Area_1 ; 5.56% Area_2
58.97% Area_3 ; 41.03% no_label
100% no_label
80.49% Area_1 ; 14.63% Area_3"""
df = pd.DataFrame(io.StringIO(data), columns=["A"])
df.head()
def get_largest_or_second_largest_percentage(string):
"""Given a string of data, return the largest percentage, 
the second-largest if the largest has label 'no_label',
the only percentage if there is one data point,
or an empty string if there is no data.
"""
if not string:
return ""
parts = [p.strip() for p in string.split(";")]
# If there is only one item, don't bother sorting.
if len(parts) == 1:
return parts[0]
# Sort from lowest to highest percentage.
# Assumes that the percentage is before the % symbol.
parts.sort(key=lambda s: float(s.split("%")[0]))
# Return second-largest if the largest has label 'no_label'.
if "no_label" in parts[-1]:
return parts[-2]
return parts[-1]  # return largest otherwise.
df.loc[:, "B"] = df.loc[:, "A"].apply(get_largest_or_second_largest_percentage)
print(df.head())

输出为

A              B
0  61.11% no_label ; 32.22% Area_1 ; 5.56% Area_2n  32.22% Area_1
1                 58.97% Area_3 ; 41.03% no_labeln  58.97% Area_3
2                                   100% no_labeln  100% no_label
3                     80.49% Area_1 ; 14.63% Area_3  80.49% Area_1

最新更新