我有Pandas v0.24+,我正在仔细查看:保持数组类型为整数,同时具有NaN值
通过尝试读取具有nan值的Integer列,我得到了常见的值错误。
Pandas:ValueError:Integer列在第33列中有NA值
这是因为整数类型无法处理NA值。问题是我实际上不知道csv的数据类型——我仍然希望熊猫能"推断"它们是什么。有没有一种方法可以做到这一点,同时默认使用Int64
而不是int64
,这样它就不会在这个过程中停止并抱怨NA值?
编辑:就是这样
df = pd.read_csv(file)
然后
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"n", file, 'exec'), glob, loc)
File "/Users/christopherturnbull/DATA_SCIENCE/PointTopic/access_test_v3.py", line 18, in <module>
df = mdb.read_table(rdb_file,'v31a_v8_oct20_point_topic_availability_deliverable_201118')
File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas_access/__init__.py", line 127, in read_table
return pd.read_csv(proc.stdout, *args, **kwargs)
File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas/io/parsers.py", line 460, in _read
data = parser.read(nrows)
File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas/io/parsers.py", line 1198, in read
ret = self._engine.read(nrows)
File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas/io/parsers.py", line 2157, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1104, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1198, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Integer column has NA values in column 33
但是df = pd.read_csv(file, header = None)
似乎可以工作,尽管现在我没有的数据类型
据我所知,在读取csv时需要指定dtype,也在pandas 0.24的可为null整数的文档中(在稳定版本中删除(,您可以找到以下内容:
Pandas可以使用表示可能缺少值的整数数据阵列。IntegerArray。这是在中实现的扩展类型熊猫。它不是整数的默认数据类型,也不会推断;必须将dtype显式传递到array((或Series 中
作为替代方案,您可以使用convert_dtypes:
import pandas as pd
import io
s = """val,coln
hello,1n
world,nan"""
df = pd.read_csv(io.StringIO(s))
res = df.convert_dtypes()
print(res.dtypes)
输出
val string
col Int64
dtype: object
convert_dtypes的文档说明:
convert_integer:bool,default True如果可能,是否转换可以对整数扩展类型执行。
注意,在上面的示例中,原始数据类型是float:
print(df.dtypes)
输出(用于使用read_csv产生的df(
val object
col float64
dtype: object
更新
这似乎是它抛出了推理引擎的东西,但由于问题位于第33列,您可以指定它的数据类型,请尝试:
df = pd.read_csv(file, dtype={33: pd.Int64Dtype()})
使用的原因
df = pd.read_csv(file, header=None)
有效的方法是使标题成为列值的一部分,因此由于它们是字符串,所以列被解释为dtype对象,如:
import pandas as pd
import io
s = """val,col,badn
hello,1,1.5n
world,,2.3"""
df = pd.read_csv(io.StringIO(s), header=None)
print(df)
输出
0 1 2
0 val col bad
1 hello 1 1.5
2 world NaN 2.3
可以看出,标题是第一行的值。