在github中有一个csv文件,开发人员使用read_csv函数读取,如下所示:
import pandas as pd
ref = pd.read_csv('/path/to/file.csv').fillna(0)
for _, line in ref.iterrows():
#print(line)
pid = line['pid']
cls = line.cls
scan_id = line.scan_id
is_seg = line.is_seg
我试图做同样的事情,但我得到以下错误:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 103, saw 2
我认为原因是当我在Visual Studio Code中打开文件时,它是html格式的。
文件头:
<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">
<head>
<div class="markdown-body">
<table class="js-csv-data csv-data js-file-line-container">
<thead>
<tr id="LC1" class="js-file-line">
<td id="L1" class="blob-num js-line-number" data-line-number="1"></td>
<th>cls</th>
<th>pid</th>
<th>scan_id</th>
<th>is_seg</th>
<th>is_repeat</th>
<th>repeat_ids</th>
<th>re_order</th>
</tr>
</thead>
</div>
读取这些文件的正确方法是什么?
这肯定不是传统的CSV文件格式。见https://en.wikipedia.org/wiki/Comma-separated_values
为什么不尝试将其作为html表读取https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html