我发现了三个问题,但我无法为这项工作创建代码,我希望得到一些帮助,因为我真的无法创建,我认为它将来对其他人非常有用。我将留下我试图解决的三个问题:
根据另一个csv文件过滤csv文件中的行,并将过滤后的数据保存在新文件中
Python如何根据另一个csv - pandas在一个csv中搜索值?
Python:从其他CSV中过滤带有条件的CSV
数据说明:
CSV
文件名"Main.csv"
:
label,value,market
,,
team1 v team2,match1,market1
team3 v team4,match2,market2
team5 v team6,match3,market3
CSV
文件名"parameter.csv"
:用作过滤器的参数:
time,goals,label,value
,,,
15,4,team1 v team2,match1
10,3,team5 v team6,match3
规则:如果在一行中发现
label
和value
的值恰好在一起,那么在CSV
中出现最后一个!
CSV
预期和创建后的过滤器:
label,value,market
,,
team1 v team2,match1,market1
team5 v team6,match3,market3
用Pandas试试下面的代码(我添加了一些注释来解释):
import pandas as pd
#load the files to dataframes
main = pd.read_csv("Main.csv")
par = pd.read_csv("parameter.csv")
#transform the dataframes to lists of dictionaries
main_dict=main.to_dict(orient='records')
par_dict=par.to_dict(orient='records')
#create a list of dictionaries that use only 'label' and 'value' as keys
par_dict = [{'label':i['label'], 'value':i['value']} for i in par_dict]
#search for records in main that the pair of label-value exists in the previous list
result = [i for i in main_dict if {'label':i['label'], 'value':i['value']} in par_dict]
#change back to dataframe and save to csv
result=pd.DataFrame(result)
result.to_csv('resut.csv', index=False)
通过将公共列设置为数据框的索引,可以很容易地使用pandas实现这一点。由于您不关心parameter.csv
中的其他列,因此可以删除它们。
import pandas as pd
common_index = ["label", "value"]
main = pd.read_csv("Main.csv").dropna().set_index(common_index)
param = (pd.read_csv("parameter.csv", usecols=common_index)
.dropna().set_index(common_index))
result = main[main.index.isin(param.index)]
print(result)
结果
market
label value
team1 v team2 match1 market1
team5 v team6 match3 market3
这也可以在使用csv
模块的标准库中完成。从parameters文件中感兴趣的列创建一个集合,并在读取主文件时将其用作过滤器。
import csv
with open("parameter.csv", newline="") as p_file:
reader = csv.reader(p_file)
next(reader)
param_set = {tuple(row[2:4]) for row in reader if row[2]}
with open("Main.csv") as m_file:
reader = csv.reader(m_file)
next(reader)
result = [row for row in reader if tuple(row[0:2]) in param_set]
print(result)