Pandas数据帧读取csv在值中提供隐藏字符



我试图阅读这个链接提供的CSV,它通常用于构建回收数据系统。我使用了BX-Books.csv和BX-Book-Ratings.csv。这是BX-Books.sv 的示例

"ISBN";"Book-Title";"Book-Author";"Year-Of-Publication";"Publisher";"Image-URL-S";"Image-URL-M";"Image-URL-L"
"0195153448";"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press";"http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg"
"0002005018";"Clara Callan";"Richard Bruce Wright";"2001";"HarperFlamingo Canada";"http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg";"http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg"

和BX-Book-Ratings.csv

"User-ID";"ISBN";"Book-Rating"
"276725";"034545104X";"0"
"276726";"0155061224";"5"
"276727";"0446520802";"0"

我试着用以下代码读取这两个文件:

Books = pd.read_csv(r"C:UsersYosafat VSPycharmProjectsRecomendation_KNNDataBX-Books.csv", sep=';', error_bad_lines=False, encoding="latin-1")
Ratings = pd.read_csv(r"C:UsersYosafat VSPycharmProjectsRecomendation_KNNDataBX-Book-Ratings.csv", sep=';', error_bad_lines=False, encoding="latin-1")

当我试图检查数据时,我发现两个csv上的一些ISBN加载错误(它应该不是ISBN代码(,就像这个

count
ISBN
_____________   
0330299891  2
0375404120  2
0586045007  1
9022906116  2
9032803328  1
...     ...
cn113107    1
ooo7156103  1
§423350229  1
´3499128624     1
Ô½crosoft   1

但当我检查CSV时,我没有发现任何数据问题,ISBN似乎是正确的。但是,两个csv中的每个值都用双引号括起来,BX-Book.csv使用ANSI而不是UTF-8。使用UTF-8 的BX图书评级

这就是为什么当我试图将以数据为中心的BX图书评级映射到BX图书时,它会给我错误:

KeyError: "None of [Index([' 0330299891', ' 0375404120', ' 9022906116', '*0452281903',n       '+0451197399', '0 7336 1053 6', '0 907 062 008', '0*708880258',n       '00000000', '000000000',n       ...n       'O77O428452', 'O786001690', 'O805063196', 'O9088446X', 'O971880107',n       'SBN425037452', 'X000000000', 'XXXXXXXXXX', 'ZR903CX0003',n       '`3502103682'],n      dtype='object', name='ISBN', length=135794)] are in the [index]"

哪个密钥实际上存在于两个csv 中

有时我们会发现一些益智编码问题,大多数时候我们可以忽略它,因为这些行很少。

我们可以使用CCD_ 1来处理该作业。

file = 'BX-Books.csv'
with open(file, errors='ignore') as fr:
data = fr.read()
df = pd.read_csv(io.StringIO(data), sep=';', error_bad_lines=False)

结果:

# BX-Book-Ratings.csv  (1149780, 3) -> filelines 1149781
# BX-Books.csv         (271360, 8)  -> filelines 271380

cols:

# BX-Book-Ratings.csv
['User-ID', 'ISBN', 'Book-Rating']

# BX-Books.csv 
['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
'Image-URL-S', 'Image-URL-M', 'Image-URL-L']

这个特定示例的另一个解决方案:

file = 'BX-Books.csv'
with open(file, errors='ignore') as fr:
data = fr.read()    
data_list = data.strip().split('n')   
obj = pd.Series(data_list)
obj = obj.str.strip('"')
dfn = obj.str.split('";"', expand=True)
dfn.columns = dfn.iloc[0]
dfn.drop(0, inplace =True)

dfn形状:

# BX-Books.csv         (271379, 8)  -> filelines 271380(with a header row)

最新更新