使用pandas读取用空格不均匀分隔列的数据文件



我有一个数据文件,其中的列由不均匀数量的空白分隔,缺少的数据也表示为空白。以下是该文件的摘录:

52.75655753231478     34.68953070087977  221090730122510976 156.51972931061138       -17.572643739467043     1.3798599125392499 0.03571818775179444     -1.6014718579307516 0.056195796241959035    -1.2045525174577807 0.04408155757633151                                                    -0.39499593     0.58434874   -0.104220375    -0.09171637     -0.4853293    0.058708116     0.19899127    -0.04585829    -0.14105034      0.0810716  142 13.05808       0.9137573 0.8     AAfrt55          
52.73026256069882    34.341460673892975  221065132117493376 156.72043573419933        -17.86431195452588     1.6550088566259955 0.08398022438361424     -2.4547997015391947 0.07126912301702777     -1.9981359966222954 0.056926420123759786                                                   -0.18282509       0.857106     0.03257091     0.30434817    -0.22865324    0.036517035   -0.045696758     0.21635033     0.31341392     0.17989494  145 14.77572        1.280653 0.6     AAfrt55          
52.68434549160837     34.00359963856372  220988819138630016 156.90150379516513        -18.15747083558349     1.5876225837952878 0.0701188913290275      -2.4370594533812264 0.08793195992635172     -0.9494780100116795 0.0673739642789701                                                     -0.22557606      0.6524007    -0.18271385     0.12871169    -0.35870972     0.19210297     0.21133502     0.08772008  -0.0059730127      0.2099215  179 10.883165      0.7241144 0.1     AAfrt55          
52.676106438399984     34.73368685714245  221094367958697088 156.4364320462988         -17.57682950993822     1.5585619924217913 0.12270341264787946     -1.5320253890594473 0.20488197077169218     -1.5579812500350152 0.15149103725242027                                                    -0.38029686      0.5423903    0.039901607   -0.089358106    -0.47867253    -0.01811142     0.27065852   0.0061952006    -0.08777855   -0.018889245  140 16.879848      1.8379726 0.7     AAfrt55          
52.71209959964432    34.578003905054665  221083888238423552 156.55904659440634       -17.683874357669215      1.417315200916427 0.2902256449155021      -1.4031086014124174 0.4445107873794677      -1.2418716061247421 0.32415010404829614                                                    -0.32437858      0.5030531    0.007634173   -0.024154684    -0.61248195    0.007697445     0.16414167      0.0348852   -0.035160106      0.2062511  147 17.979982        2.28545 0.6     AAfrt55          
52.976769045904504      34.2043690118208  221013214552826112 156.97805068738012       -17.851469355301145     1.3517371901145452 0.05227677174066813     -1.6998011106704118 0.06393145316119067     -1.5066879492609095 0.05135957095097931                                                     -0.2882013       0.625621    -0.09970278     0.17078118    -0.54429597    0.024621502   0.0029883361     0.18108386     0.14520298     0.29568258  157 14.129068      1.2771263 0.6     AAfrt55          
52.725581800207     34.81203178094687  221096708716972032 156.42146313010704       -17.489732979976864     1.3637027483770132 0.06681005036392006     -1.2969967438574475 0.12798297569657666     -0.8700543685880945 0.07589931821548508                                                    -0.47563803     0.54669774     0.21084175    -0.13976684     -0.3919695    -0.15758438     0.29368496    0.073126756   -0.018135564    -0.23391931  117 15.063394      1.2766619 0.5     AAfrt55          
52.7985530101187     34.36596277011468  221065956751203840 156.75238845448155       -17.810891786683083     1.3354235835420394 0.07108435274492615     -1.5037101757150335 0.08349724817882194     -1.4176519719616552 0.06716461390950008                                                    -0.23177823     0.71539974    -0.14281902     0.16131751      -0.350566     0.20656587     0.13758425     0.04726053     0.02584095     0.20465821  167 12.012547      0.6679964 0.7     AAfrt55          
52.33478707580404     34.07755876601224  221033692956948224 156.61051515401235         -18.2713588192272     1.4506434823877878 0.042935822550938335     -1.711588054551144 0.08021024777572577      -1.428438652911233 0.052394146676305285                                                   -0.46284133     0.39245856    -0.36800838    0.020843003     -0.4325589     0.18059099      0.1403661     0.06952838    -0.13496804   -0.017565649  141 14.258665      1.1050224 1.0     AAfrt55          
53.236890326502746     34.60616842868128  221121379009541760 156.90327214337495       -17.401554596271392      1.670779589883186 0.06462934312116957     -1.3870861271226225 0.11007633727034921     -1.5354788621698447 0.06476830152651208                                                     -0.2876506     0.50366104     0.06364631     0.17200021    -0.43980953    0.101488166     0.13044362    0.098199986     0.07485064    0.014440983  139 15.197788       1.420907 0.5     AAfrt55          
53.36099004421408     35.02299915490862  221182058307431168 156.7245486656545         -17.00787802432042      1.421888821514962 0.0663340915531341      -1.8461295859228135 0.09336911688161233     -1.6670146028961077 0.0714590019457503    -30.271748692621472      10.746389544584678      -0.39191365      0.6037745    -0.32697898     0.29790965    -0.38904026       0.317259    -0.07794135     0.04330315    0.042706273   -0.004154171  146 12.469966     0.95415306 0.7     AAfrt55          
53.52872681724522      35.0477570978552  221159415239971968 156.82317128998952       -16.905034510408548      1.421141411397051 0.05109522675079716      -1.632919678940039 0.06302630418197872     -1.2685055408322787 0.05099118144236511                                                    -0.30175194      0.6874048    -0.17188887    0.065116994    -0.45524448     0.13649681     0.10539054    0.019577736   -0.022071656     0.24296543  155 14.162856      1.0796404 0.9     AAfrt55          
53.54751972550367     34.96103948561606  221156696524334848 156.89093588909614        -16.96465396429354      1.294728492354405 0.18934469304374682     -1.1449588491523415 0.24877251454435825      -0.583647366938886 0.200549970304947                                                      -0.37570783      0.5953056    -0.19945645      0.1845328     -0.5173996      0.1606674     0.08281563     0.10032614     0.04386764     0.13707627  157 17.58886       2.0971222 0.2     AAfrt55          
52.885088571817256    34.989840585617124  221193431380869120 156.41969263893048       -17.268992175403003     1.4476557809261963 0.07281647540924233     -1.9124556059186422 0.12004352922185996     -1.4703545228732189 0.0790687307010274                                                     -0.31885538     0.37405393     -0.4350186     0.06781138    -0.48610708     0.19664243     0.18642318    0.121886626    -0.13234825     0.15937497  157 15.75981       1.5027304 0.7     AAfrt55          
53.54336695375678     35.54122701584878  221234525628153216 156.52106112078582       -16.505126454354887     1.2002583233942294 0.06023879613593315     -1.8575899983935311 0.08916397207630337     -1.4987391120861537 0.05451125116477068                                                     -0.2842391      0.7054797    0.033138227     -0.0997483    -0.32082146    -0.14847293     0.06646432    0.055499673    0.012825485    0.025693728  148 14.743197      1.1996746 0.2     AAfrt55          
52.89661447038126     35.01501845115683  221193637539294720 156.41180216188386       -17.243179777505176     1.4365003873992706 0.04152001288165847     -1.5960560447477898 0.06942606759430278     -1.2600472896481731 0.04452615169143268                                                    -0.31485322     0.41341513    -0.44437912    0.016873619    -0.45052707     0.16905014     0.14041857     0.11158022   -0.111745454      0.1811417  167 14.112262      1.0594835 0.9     AAfrt55          

我也有每个列开始和结束的字节:

1- 21
23- 43
45- 63
65- 85
87-108
110-131
133-152
154-175
177-196
198-219
221-240
242-265
267-287
289-302
304-317
319-332
334-347
349-362
364-377
379-392
394-407
409-422
424-437
439-442
444-453
455-467
469-475
477-493

如何使用这些数据来引导使用pandas的文件?

感谢Quang Hoang向我建议我不知道的pd.read_fwf函数

就这么简单:

data = pd.read_fwf(data_file)

,该函数甚至可以检测列之间的正确分隔。否则,有关列分隔的信息可以用作:

data = pd.read_fwf(data_file,
colspecs=[(1,  21), (23,  43), (45,  63), (65,  85), (87, 108), (110, 131),
(133, 152), (154, 175), (177, 196), (198, 219), (221, 240), (242, 265),
(267, 287), (289, 302), (304, 317), (319, 332), (334, 347), (349, 362),
(364, 377), (379, 392), (394, 407), (409, 422), (424, 437), (439, 442),
(444, 453), (455, 467), (469, 475), (477, 493)])

最新更新