合并100+ CSV文件时,如果"usecol"中不存在nan,如何在列中填写nan?



考虑到我有CSV文件,它看起来大致像这个

df = pd.DataFrame({'Col1': ['A', 'B', 'C', 'D'],
'ColB': [80, 75, 70, 65]})

我正在使用以下脚本,这是在这里建议的

import pandas as pd
import glob
path = r'path/' # use your path
all_files = glob.glob(path + "/*.csv")
fields = ['ColA', 'ColB', 'ColC']
first_one = True
for filename in all_files:
if not first_one: # if it is not the first csv file then skip the header row (row 0) of that file
skip_row = [0]
else:
skip_row = []
# works with this version: '1.3.4'
# combine into one
mode = "w"
header = True
for filename in all_files:
with pd.read_csv(
filename,
engine="python",
iterator=True,
chunksize=10_000,
usecols = fields
) as reader:
for df in reader:
filename = os.path.basename(filename)
df["username"] = filename
df.to_csv("New_File.csv", index=False, mode=mode, header=header)
mode = "a"
header = False

大多数文件都有这三列,而很少有文件没有ColC。这将产生如下错误(可以理解(:

ValueError: Usecols do not match columns, columns expected but not found: ['ColC']

如何在保持columns列表不变的情况下将nan放入ColC

如果不需要使用"usecols",可以使用.rendex((而不是像那样使用"usecols"来实现这一点

columns = ['Col1','ColB','ColC']
df = pd.read_csv('test.csv').reindex(columns=columns)

输出

Col1  ColB  ColC
0    A    80   NaN
1    B    75   NaN
2    C    70   NaN
3    D    65   NaN

这里有一个预先检查列的替代方案:

# (...)
for filename in all_files:
# Check available columns first
cols = pd.read_csv(filename, engine='python', nrows=0, header=0).columns
fields_ = cols.intersection(fields)
missed = [i for i in fields if i not in cols]

with pd.read_csv(
filename,
engine="python",
iterator=True,
chunksize=10_000,
header=0,
usecols = fields_  # Use the "dynamic" one
) as reader:
for df in reader:
# Manually append missed cols
if missed:
for col in missed:
df[col] = np.nan
# Make sure the order is kept
df = df[fields]
# (proceed...)
filename = os.path.basename(filename)
df["username"] = filename
# (...)

更改了示例列列表,使其缺少多个列。但在不更改解决方案中的的情况下,使用增强的示例文件。

import pandas as pd
import re
columns = ['Col1','ColB','ColC','ColD']
try:
df = pd.read_csv('test.csv',usecols=columns)
except ValueError as e:
if 'Usecols' not in str(e): raise e
missing = re.findall(r"'(.*?)'", str(e))
df = pd.read_csv('test.csv', usecols=set(columns) - set(missing))
df[missing] = np.nan
df

输出

Col1  ColB  ColC  ColD
0    A    80   NaN   NaN
1    B    75   NaN   NaN
2    C    70   NaN   NaN
3    D    65   NaN   NaN

创建具有额外和缺失列的示例csv文件

import pandas as pd
df = pd.DataFrame({'Col1': ['A', 'B', 'C', 'D'],
'ColB': [80, 75, 70, 65],
'ColE': [100, 20, 1, 23]})
df.to_csv('test.csv', index=False)

test.csv

Col1,ColB,ColE
A,80,100
B,75,20
C,70,1
D,65,23

最新更新