pandas数据帧上的递归循环



输入:

| Company | Employee Number |
|---------|-----------------|
| 1       | 12              |
| 2       | 34, 12          |
| 3       | 56, 34, 78      |
| 4       | 90              |

目标:

查找所有公司中某个员工的所有员工编号

最终结果:

| Company | Employee Number |
|---------|-----------------|
| 1       | 12, 34, 56, 78  |
| 2       | 12, 34, 56, 78  |
| 3       | 12, 34, 56, 78  |
| 4       | 90              |

从上面的结果中可以注意到,前三行是同一个雇员。我们知道,因为第一个员工编号";12〃;存在于第二行中;34〃;存在于第2行和第3行中。因此,第1行、第2行和第3行都是同一个雇员。因此,我们将不同的员工编号连接起来,并显示上面显示的结果。

注意:您可以有0个或N个员工编号。

有递归的方法吗?如果没有,你能想出什么解决方案?

以下是我将如何处理(注释中的解释):

# Replace NaN in df["Employee Number"] with empty string
df["Employee Number"] = df["Employee Number"].fillna("")
# Add a column with sets that contain the individual employee numbers
df["EN_Sets"] = df["Employee Number"].str.findall(r"d+").apply(set)
# Build the maximal distinct employee number sets
en_sets = []
for en_set in df.EN_Sets:
union_sets = []
keep_sets = []
for s in en_sets:
if s.isdisjoint(en_set):
keep_sets.append(s)
else:
union_sets.append(s)
en_sets = keep_sets + [en_set.union(*union_sets)]
# Build a dictionary with the replacement strings as keys the distinct sets
# as values
en_sets = {", ".join(sorted(s)): s for s in en_sets}
# Apply-function to replace the original employee number strings
def setting_en_numbers(s):
for en_set_str, en_set in en_sets.items():
if not s.isdisjoint(en_set):
return en_set_str
# Apply the function to df["Employee Number"]
df["Employee Number"] = df.EN_Sets.apply(setting_en_numbers)
df = df[["Company", "Employee Number"]]

结果

df:
Company Employee Number
0        1              12
1        2          34, 12
2        3      56, 34, 78
3        4              90
4        5             NaN

Company Employee Number
0        1  12, 34, 56, 78
1        2  12, 34, 56, 78
2        3  12, 34, 56, 78
3        4              90
4        5                

最新更新