更好的解决方案是处理从相关数据库导出的数据帧对象的丢失数据

几天前，我发布了一个关于"如何使pandas HDFStore‘put’操作更快"的问题，感谢Jeff的回答，我找到了一种更有效的方法来从数据库中提取数据并将其存储到hdf5文件中。

但通过这种方式，我必须根据每一列的类型填充每一列缺失的数据，并对每一个表进行这些工作(在大多数情况下，这项工作是重复的)。否则，当我将数据帧放入hdf5文件时，数据帧中的None对象将导致性能问题。

有更好的方法做这项工作吗？

我刚刚读到这个问题"ENH:sql提供NaN/NaT转换">

NaT是否适用于其他类型？(日期时间64除外)
在将数据帧存储到hdf5文件中时，我可以使用它来替换数据帧中的所有None对象而不担心性能问题吗

更新1

pd版本：0.10.1
我现在正在使用np.nan来填充缺失的数据。但我遇到了两个问题。
- 同时具有np.nan和datetime.datetime对象的列无法转换为"datetime64[ns]"类型，并且在将它们放入hdfstore时会引发Exceptions

In[155]：len(df_bugs.lastdiffed[df_bugs.lastdiffed.isnull()])输出[155]:150在[156]中：len(df_bugs.lastdiffed)输出[156]：1003387在[158]中：df_bugs.lastdiffed.astype(df_bugs.creation_ts.dtype)---------------------------------------------------------------------------ValueError Traceback(最近一次调用最后一次)在()---->1 df_bugs.lastdiffed.astype(df_bugs.creation_ts.dtype)/astype中的usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/core/series.pyc(self，dtype)777参见numpy.ndarray.astype778"-->779 casted=com._astype_nansafe(self.value，dtype)780返回自我_构造函数(casted，index=self.index，name=self.name)781/_astype_nansafe中的usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/core/common.pyc(arr，dtype)1047 elif arr.dtype==np.object_和np.issubdtype(dtype.type，np.integer)：1048#关于NumPy brokennes的工作，#1987->1049返回lib.astype_intsafe(arr.rave()，dtype).reform(arr.shape)10501051返回arr.astype(数据类型)/pandas.lib.astype_intsafe中的usr/local/lib/python2.6/dist-packages/pandas-0.110.1-py2.6-linux-x86_64.egg/pands/lib.so(pandas/lib.c:111886)()/util.set_value_at中的usr/local/lib/python2.6/dist-packages/pandas-0.110.1-py2.6-linux-x86_64.egg/pands/lib.so(pandas/lib.c:44436)()ValueError:必须是datetime.date或datetime.datetime对象#df_bugs_sample1=df_bugs.ix[：10000]在[147]中：%prun store.put('df_bugs_sample1'，df_bugs_ample1，table=True)/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pands/io/pytables.pyc in put(self，key，value，table，append，**kwargs)456表格457"-->458自我_write_to_group(键，值，table=table，append=append，**kwargs)459460 def remove(self，key，where=None，start=None、stop=None)：/_write_to_group中的usr/local/lib/python2.6/dist-packages/pandas-0.110.1-py2.6-linux-x86_64.egg/pands/io/pytables.pyc(self、key、value、index、table、append、complb、**kwargs)786引发ValueError("非表上不支持压缩")787-->788 s.write(obj=值，append=追加，complb=complb，**kwargs)789如果s.is_table和索引：790 s.create_index(列=索引)/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pands/io/pytables.pyc写入(self、obj、axes、append、complb、complevel、fletcher32、min_itemsize、chunksize、expectedrows、**kwargs)2489#创建轴2490 self.create_axes(axes=axes，obj=obj，validate=append，->2491 min_itemsize=min_itemsize，**kwargs)24922493如果不是self.is_exists：/在create_axes(self、axes、obj、validate、nan_rep、data_columns、min_itemsize、**kwargs)中的/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/标准/io/pytables.pyc2252升高2253除外(例外)，详细信息：->2254引发异常("找不到正确的原子类型->[dtype->%s，items->%s]%s"%(b.dtype.name，b.items，str(detail))2255 j+=12256异常：找不到正确的原子类型->[dtype->object，items->Index([bug_file_loc，bug_verity，bug_status，cf_branch，cf_bug_source，cf_eta，cf_public_severity，cf_public_summary，cf_regression，cf_reported_by，cf_type，guest_op_sys，host_op_sys、keywords，lastdiffed，priority，rep_platform，resolution，short_desc，status_ebracket，target_milestone]，dtype=object)]类型的对象"datetime.datetime"没有len()

而另一个df似乎无法被完全放入数据帧中，如下面的示例所示，条目数量为13742515，但在我将数据帧放入hdfstore并将其取出后，条目数量变为1041998

In[123]：df_bugs_activity输出[123]：Int64Index:13742515个条目，0到13742514数据列：添加了13111366个非零值attach_id 1041998非空值bug_id 13742515非null值错误_当13742515个非空值fieldid 13742515非空值id 13742515非空值删除了13612258个非空值who 13742515个非零值数据类型：datetime64[ns](1)，float64(1)、int64(4)、object(2)在[121]中：%time store.put('df_bugs_activity2'，df_bugs_active，table=True)CPU时间：用户35.31秒，系统4.23秒，总计39.54秒壁时间：39.65秒在[122]中：%time store.get('df_bugs_activity2')CPU时间：用户7.56秒，系统0.26秒，总计7.82秒壁时间：7.84秒输出[122]：Int64索引：1041998个条目，2012年至13354656数据列：添加了1041981个非空值attach_id 1041998非空值bug_id 1041998非null值错误_当1041998非空值fieldid 1041998非空值id 1041998非空值删除了1041991个非null值who 1041998非零值dtypes:datetime64[ns](1)，float64(1)、int64(4)、object(2)

更新2

用于创建数据帧的代码：

def grab_data(table_name，size_of_page=20000)："从数据库表获取数据size_of_page：sql的limit子类的第二个参数"cur.execute('select count(*)from%s'%table_name)records_number=当前fetchone()[0]loop_number=记录数/大小_页码+1打印'*****\n开始抓取%s\n*****\nrecords_number:%s\nloop_number:%s'%(表名、记录s_number、循环_number)起始位置=0df=DataFrame()#警告：这个数据帧对象将包含一个表的所有记录，所以要小心MEMORY用法！对于范围(0，loop_number)中的i：sql_export="从%s限制%s中选择*，%s"%(表名、起始位置、大小_of_page)df=df.append(psql.read_frame(sql_export，conn)，verify_filterity=False，ignore_index=True)起始位置+=页面大小打印'start_position:%s'%start_position返回dfdf_bugs=grab_data('bugs')df_bugs=df_bugs.fillna(np.nan)df_bugs=df_bugs.convert_objects()

df_bugs的结构：

Int64Index:10003387个条目，0到1003386数据列：别名0非空值assigned_to 1003387个非空值bug_fileloc 498160非null值bug_id 1003387非null值bug_verity 1003387非null值bug_status 1003387非null值category_id 1003387非空值cclist_accessible 1003387非空值cf_attempted 102160个非空值cf_branch 691834非空值cf_bug_source 1003387非空值cf_build 357920非null值cf_change 324933非空值cf_doc_impact 1003387非空值cf_eta 7223非空值cf_failed 102123非空值cf_i18n_impact 1003387非空值cf_on_hold 1003387非零值cf_public_severity 1003387非空值cf_public_summary 587944非空值cf_regression 1003387非空值cf_reported_by 1003387非空值cf_reviewer 1003387非空值cf_security 1003387非空值cf_test_id 13475个非空值cf_type 1003387非空值cf_viss 1423非空值component_id 1003387非空值creation_ts 1003387非空值截止日期0非空值delta_ts 1003387非空值估计时间1003387个非零值everconfirmed 1003387个非空值found_in_phase_id 1003387非空值found_in_product_id 1003387非空值found_in_version_id 1003387非空值guest_op_sys 1003387非空值host_op_sys 1003387非空值关键字1003387非空值lastdiffed 1003237非空值优先级1003387非空值product_id 1003387非空值qa_contact 1003387非空值remaining_time 1003387个非空值rep_platform 1003387非空值报告器1003387非空值reporter_accessible 1003387个非空值决议1003387非零值short_desc 1003387非空值状态_白板1003387非空值target_milestone 1003387非空值投票1003387个非空值dtypes:datetime64[ns](2)，float64(10)，int64(19)，object(21)

更新3

写入csv并从csv读取：

In[184]：df_bugs.to_csv('df_bugs.sv')在[185]中：df_bugs_from_scv=pd.read_csv('df_bugs.sv')在[186]中：df_bugs_from_scv输出[186]：Int64Index:1003387个条目，0到1003386数据列：未命名：0 1003387非null值别名0非空值assigned_to 1003387个非空值错误文件日志0非null值bug_id 1003387非null值bug_verity 1003387非null值bug_status 1003387非null值category_id 1003387非空值cclist_accessible 1003387非空值cf_attempted 102160个非空值cf_branch 345133非空值cf_bug_source 1003387非空值cf_build 357920非null值cf_change 324933非空值cf_doc_impact 1003387非空值cf_eta 7223非空值cf_failed 102123非空值cf_i18n_impact 1003387非空值cf_on_hold 1003387非零值cf_public_severity 1003387非空值cf_public_summary 588个非空值cf_regression 1003387非空值cf_reported_by 1003387非空值cf_reviewer 1003387非空值cf_security 1003387非空值cf_test_id 13475个非空值cf_type 1003387非空值cf_viss 1423非空值component_id 1003387非空值creation_ts 1003387非空值截止日期0非空值delta_ts 1003387非空值估计时间1003387个非零值everconfirmed 1003387个非空值found_in_phase_id 1003387非空值found_in_product_id 1003387非空值found_in_version_id 1003387非空值guest_op_sys 805088非空值host_op_sys 806344非空值关键字532941非空值lastdiffed 1003237非空值优先级1003387非空值product_id 1003387非空值qa_contact 1003387非空值remaining_time 1003387个非空值rep_platform 424213非空值报告器1003387非空值reporter_accessible 1003387个非空值决议922282非零值short_desc 1003287非空值状态_白板0非空值target_milestone 423276非空值投票1003387个非空值dtypes:float64(12)，int64(20)，object(21)

我自己回答，感谢jeff的帮助。

首先，更新1中的第二个问题("df似乎无法完全放入数据帧")已经修复。

而且，我遇到的最大问题是处理那些同时包含python的datetimeobj和Noneobj的列。幸运的是，由于0.11-dev，熊猫提供了一种更方便的方式。我在我的项目中使用了下面的代码，我为一些行添加了注释，希望它能帮助其他人：)

cur.execute('select * from table_name')
result = cur.fetchall()
# For details: http://www.python.org/dev/peps/pep-0249/#description
db_description = cur.description
columns = [col_desc[0] for col_desc in db_description]
# As the pandas' doc said, `coerce_float`: Attempt to convert values to non-string, non-numeric objects (like decimal.Decimal) to floating point
df = DataFrame(result, columns=columns, coerce_float=True)
# dealing the missing data
for column_name in df.columns:
# Currently, calling function `fillna(np.nan) on a `datetime64[ns]` column will cause an exception
if df[column_name].dtype.str != '<M8[ns]':
df[column_name].fillna(np.nan)
# convert the type of columns which both have np.nan and datetime obj from 'object' to 'datetime64[ns]'(short as'<M8[ns]')
# find the table columns whose type is Date or Datetime
column_name_type_tuple = [column[:2] for column in db_description if column[1] in (10, 12)]
# check whose type is 'object'
columns_need_conv = [column_name for column_name, column_type in column_name_type_tuple if str(df[column_name].dtype) == 'object']
# do the type converting
for column_name in columns_need_conv:
df[column_name] = Series(df[column_name].values, dtype='M8[ns]')
df = df.convert_objects()

之后，df应该适合存储在h5文件中，不再需要"pickle"。

ps:

一些配置文件：
complib:'lzo'，complevel:1
表1，7810561条记录，包含2个int cols和1个datetime cols，投放操作成本49s

表2，1008794条记录，4个日期时间列，4个浮动64列，19个int列，24个对象(字符串)列，放入操作成本170s

更新1

更新2

更新3

相关内容

最新更新

热门标签：