输入:
| Company | Employee Number |
|---------|-----------------|
| 1 | 12 |
| 2 | 34, 12 |
| 3 | 56, 34, 78 |
| 4 | 90 |
目标:
查找所有公司中某个员工的所有员工编号
最终结果:
| Company | Employee Number |
|---------|-----------------|
| 1 | 12, 34, 56, 78 |
| 2 | 12, 34, 56, 78 |
| 3 | 12, 34, 56, 78 |
| 4 | 90 |
从上面的结果中可以注意到,前三行是同一个雇员。我们知道,因为第一个员工编号";12〃;存在于第二行中;34〃;存在于第2行和第3行中。因此,第1行、第2行和第3行都是同一个雇员。因此,我们将不同的员工编号连接起来,并显示上面显示的结果。
注意:您可以有0个或N个员工编号。
有递归的方法吗?如果没有,你能想出什么解决方案?
以下是我将如何处理(注释中的解释):
# Replace NaN in df["Employee Number"] with empty string
df["Employee Number"] = df["Employee Number"].fillna("")
# Add a column with sets that contain the individual employee numbers
df["EN_Sets"] = df["Employee Number"].str.findall(r"d+").apply(set)
# Build the maximal distinct employee number sets
en_sets = []
for en_set in df.EN_Sets:
union_sets = []
keep_sets = []
for s in en_sets:
if s.isdisjoint(en_set):
keep_sets.append(s)
else:
union_sets.append(s)
en_sets = keep_sets + [en_set.union(*union_sets)]
# Build a dictionary with the replacement strings as keys the distinct sets
# as values
en_sets = {", ".join(sorted(s)): s for s in en_sets}
# Apply-function to replace the original employee number strings
def setting_en_numbers(s):
for en_set_str, en_set in en_sets.items():
if not s.isdisjoint(en_set):
return en_set_str
# Apply the function to df["Employee Number"]
df["Employee Number"] = df.EN_Sets.apply(setting_en_numbers)
df = df[["Company", "Employee Number"]]
结果
df:
Company Employee Number
0 1 12
1 2 34, 12
2 3 56, 34, 78
3 4 90
4 5 NaN
是
Company Employee Number
0 1 12, 34, 56, 78
1 2 12, 34, 56, 78
2 3 12, 34, 56, 78
3 4 90
4 5