我有两个.csv文件,其中第一列是某个名称,其他列描述该名称。我想在第二个文件中只留下两个文件中第一列相同的行来处理这些文件。此外,应删除重复项以进行进一步处理。两个文件中除第0列以外的其他列中的数据可能不同。
test1.csv
cat,house,guest
dog,environment
mouse,broom
mouse,broom
test2.csv
cat,house,boy
dog,environment
mouse,broom,hole
path,leave
cat,house,boy
我想要接收的:
1_processed.csv
cat,house,guest
dog,environment
mouse,broom
2_processed.csv
cat,house,boy
dog,environment
mouse,broom,hole
我用pandas尝试了以下代码来实现它,但它并没有像预期的那样工作:
import pandas as pd
file1 = "test1.csv"
file2 = "test2.csv"
if __name__ == '__main__':
print("Getting columns from file")
df1 = pd.read_csv(file1, header=None, sep="delimiter", engine="python")
print("Getting columns from file")
df2 = pd.read_csv(file2, header=None, sep="delimiter", engine="python")
print("Leave only rows where first column is same in both files")
df2.drop(df2[~df2[0].isin(df1[0])], inplace=True)
print("Remove duplicated rows")
df1.drop_duplicates(subset=None, inplace=True)
df2.drop_duplicates(subset=None, inplace=True)
print("Write processed data to files")
df1.to_csv("1_processed", index=False, header=False)
df2.to_csv("2_processed", index=False, header=False)
输出:
1_processed.csv
"cat,house,guest"
"dog,environment"
"mouse,broom"
2_processed.csv
"dog,environment"
"mouse,broom,hole"
"path,leave" #this value should not appear here
"cat,house,boy"
尝试pandas.DataFrame.eq
:
import pandas as pd
file1 = "test1.csv"
file2 = "test2.csv"
if __name__ == '__main__':
print("Getting columns from file1")
df1 = pd.read_csv(file1, header=None)
print("Getting columns from file2")
df2 = pd.read_csv(file2, header=None)
print("Remove duplicated rows")
df1 = df1.drop_duplicates()
df2 = df2.drop_duplicates()
print("Leave only rows where first column is same in both files")
df2 = df2[df1.eq(df2)[0]]
print("Write processed data to files")
df1.to_csv("1_processed.csv", index=False, header=False)
df2.to_csv("2_processed.csv", index=False, header=False)
#输出:
print(df1)
0 1 2
0 cat house guest
1 dog environment NaN
2 mouse broom NaN
print(df2)
0 1 2
0 cat house boy
1 dog environment NaN
2 mouse broom hole