如何根据其他行中组合的重复值删除行?



给定8列的数据集,我想检查是否有一行匹配基于其他行的特定值并将其删除,所有这些都在同一数据集中。

下面是一个例子:

<表类> IP_Src_X IP_Dst_X Port_Src_X Port_Dst_X IP_Src_Y IP_Dst_Y Port_Src_Y Port_Dst_Y tbody><<tr>10.00.000.0090.00.000.001000300090.00.000.0010.00.000.003000100060.50.500.0030.000.300.008000200030.000.300.0060.50.500.002000800066.00.000.0010.00.000.005000700010.00.000.0066.00.000.007000500090.00.000.0010.00.000.003000100010.00.000.0090.00.000.001000300010.00.000.0066.00.000.007000500066.00.000.0010.00.000.0050007000

使用您提供的数据框架:

import pandas as pd
df = pd.DataFrame(
{
"IP_Src_X": [
"100000000",
"605050000",
"660000000",
"900000000",
"100000000",
],
"IP_Dst_X": [
"900000000",
"3000030000",
"100000000",
"100000000",
"660000000",
],
"Port_Src_X": [1000, 8000, 5000, 3000, 7000],
"Port_Dst_X": [3000, 2000, 7000, 1000, 5000],
"IP_Src_Y": [
"900000000",
"3000030000",
"100000000",
"100000000",
"660000000",
],
"IP_Dst_Y": [
"100000000",
"605050000",
"660000000",
"900000000",
"100000000",
],
"Port_Src_Y": [3000, 2000, 7000, 1000, 5000],
"Port_Dst_Y": [1000, 8000, 5000, 3000, 7000],
}
)

下面是使用Pandas concat的一种方法:

# Stack X values onto Y values and remove duplicates
new_df = pd.concat(
[
df[[f"IP_Src_{x}", f"IP_Dst_{x}", f"Port_Src_{x}", f"Port_Dst_{x}"]].rename(
columns={
f"IP_Src_{x}": "IP_Src",
f"IP_Dst_{x}": "IP_Dst",
f"Port_Src_{x}": "Port_Src",
f"Port_Dst_{x}": "Port_Dst",
}
)
for x in ["X", "Y"]
]
)
new_df = df.drop_duplicates(keep="first")

# Stack first half of new_df onto switched second half
first_half = df.iloc[: int(df.shape[0] / 2), :]
first_half.columns = [i for i in range(first_half.shape[1])]
second_half = df.iloc[int(df.shape[0] / 2) :, :].reindex(
["IP_Dst", "IP_Src", "Port_Dst", "Port_Src"], axis=1
)
second_half.columns = [i for i in range(second_half.shape[1])]
# Filter df with remaining non duplicated rows
rows_to_keep = pd.concat([first_half, second_half]).drop_duplicates(keep="first").index
df = df[df.index.isin(rows_to_keep)]

:

print(df)
# Output
IP_Src_X    IP_Dst_X  Port_Src_X  Port_Dst_X    IP_Src_Y   IP_Dst_Y  
0  100000000   900000000        1000        3000   900000000  100000000   
1  605050000  3000030000        8000        2000  3000030000  605050000   
2  660000000   100000000        5000        7000   100000000  660000000   
Port_Src_Y  Port_Dst_Y  
0        3000        1000  
1        2000        8000  
2        7000        5000  

最新更新