读取 html 格式的 csv 文件 "ParserError: Error tokenizing data. C error: Expected 1 fields in line 103, saw



在github中有一个csv文件,开发人员使用read_csv函数读取,如下所示:

import pandas as pd
ref = pd.read_csv('/path/to/file.csv').fillna(0)
for _, line in ref.iterrows():
#print(line)
pid = line['pid']
cls = line.cls
scan_id = line.scan_id
is_seg = line.is_seg

我试图做同样的事情,但我得到以下错误:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 103, saw 2

我认为原因是当我在Visual Studio Code中打开文件时,它是html格式的。

文件头:

<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">
<head>
<div class="markdown-body">
<table class="js-csv-data csv-data js-file-line-container">
<thead>
<tr id="LC1" class="js-file-line">
<td id="L1" class="blob-num js-line-number" data-line-number="1"></td>
<th>cls</th>
<th>pid</th>
<th>scan_id</th>
<th>is_seg</th>
<th>is_repeat</th>
<th>repeat_ids</th>
<th>re_order</th>
</tr>
</thead>
</div> 

读取这些文件的正确方法是什么?

这肯定不是传统的CSV文件格式。见https://en.wikipedia.org/wiki/Comma-separated_values

为什么不尝试将其作为html表读取https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html