如何使用Python读取.data格式的数据



我已从https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/。如您所见,它具有.data格式。如何在Python中将其读作pandas数据框?

我试试这个。但它会起作用:

with open("arrhythmia.data", "r") as f:
arryth_df = pd.DataFrame(f.read())

它说ValueError:DataFrame构造函数没有正确调用!

您可以将文件的url传递给read_csv,因为这里的.data是csv格式,但没有标头,所以添加了header=None:

#if want see all data
pd.options.display.max_columns = None
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None)
print (df.head())
0    1    2    3    4    5    6    7    8    9   10   11  12  13  14   15   
0   75    0  190   80   91  193  371  174  121  -16  13   64  -2   ?  63    0   
1   56    1  165   64   81  174  401  149   39   25  37  -17  31   ?  53    0   
2   54    0  172   95  138  163  386  185  102   96  34   70  66  23  75    0   
3   55    0  175   94  100  202  380  179  143   28  11   -5  20   ?  71    0   
4   75    0  190   80   88  181  360  177  103  -16  13   61   3   ?   ?    0   
16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   
0   52   44    0    0   32    0    0    0    0    0    0    0   44   20   36   
1   48    0    0    0   24    0    0    0    0    0    0    0   64    0    0   
2   40   80    0    0   24    0    0    0    0    0    0   20   56   52    0   
3   72   20    0    0   48    0    0    0    0    0    0    0   64   36    0   
4   48   40    0    0   28    0    0    0    0    0    0    0   40   24    0   
...
...
...

如果还想将?转换为缺失值NaNs,则添加na_values='?'参数:

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data'
df = pd.read_csv(url, header=None, na_values='?')
print (df.head())
0    1    2    3    4    5    6    7    8    9     10    11    12    13   
0   75    0  190   80   91  193  371  174  121  -16  13.0  64.0  -2.0   NaN   
1   56    1  165   64   81  174  401  149   39   25  37.0 -17.0  31.0   NaN   
2   54    0  172   95  138  163  386  185  102   96  34.0  70.0  66.0  23.0   
3   55    0  175   94  100  202  380  179  143   28  11.0  -5.0  20.0   NaN   
4   75    0  190   80   88  181  360  177  103  -16  13.0  61.0   3.0   NaN   
14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   
0  63.0    0   52   44    0    0   32    0    0    0    0    0    0    0   44   
1  53.0    0   48    0    0    0   24    0    0    0    0    0    0    0   64   
2  75.0    0   40   80    0    0   24    0    0    0    0    0    0   20   56   
3  71.0    0   72   20    0    0   48    0    0    0    0    0    0    0   64   
4   NaN    0   48   40    0    0   28    0    0    0    0    0    0    0   40  
...
...

StringIO:这样做

from io import StringIO
import pandas as pd
with open("arrhythmia.data", "r") as f:
data = StringIO(f.read())
arryth_df = pd.read_csv(data)

最新更新