如何将csv读入使用引号分隔符的熊猫，但也使用需要忽略的非转义左右双引号

我有一些CSV文件需要读入，其中一些行读起来很好，但其他行看起来像这样：

2022-10-04, "", some data in col3, moredata, "data in quotes, like the size for this thing is 23'' x 28" with this description, moredata
2022-10-05, "", some data in col3, moredata, "data in quotes, like the size for this thing is 23“ x 28" with this description, moredata

所以我不能解决的问题是：这是一个CSV-所以逗号是分隔符，它对其中有多个逗号的值使用双引号分隔符，这些逗号不应该作为分隔符读取，好吧，我在pandas read_CSV选项中找到了如何解释这一点，

但是，在一些以引号分隔的字段中，当有以英寸为单位的数字时，它们使用以下所有4个：

转义双引号："

双单引号：''

和左或右双引号，如：未转义的“，我认为可能会被误读为引号分隔符，我不知道如何忽略它们。

我不知道如何让CSV在Pandas或任何其他方法中正确阅读。有很多数据行使用这些左右双引号而不转义，所以如果一行看起来像：

something, "one value with 23'', 25", 20“, ...", val 3, val_4

它有4个值，

并且值CCD_ 4应当被读入为1值：CCD_

但我尝试过的所有选项要么跳过这些行，要么将它们读错列，要么只是给出错误并将数据读取到数据帧中，失败

编辑：根据BeRT2me的请求，这里有一个来自CSV的带有"实际"数据的行的更好示例。(我无法提供任何"实际"值，因此以相同格式放入假数据中(

标头：start_date,end_date,product_code,available,category_rank,brand,name,category,price

csv:中的数据行

2022-10-05,2022-10-10,3716372837,1.0,"",brand1,"Puzzle map of the world, 300 pieces, 23” x 15", great for all ages",Games,39.99

给定test.txt:

start_date, end_date, product_code, available, category_rank, brand,name,category,price
2022-10-05,2022-10-10,3716372837,1.0,"",brand1,"Puzzle, 300'' p, 23” x 15", great",Games,39.99

操作：

df = pd.read_csv('test.txt', escapechar='\')
print(df)

输出：

start_date    end_date   product_code   available   category_rank   brand                               name category  price
0  2022-10-05  2022-10-10     3716372837         1.0             NaN  brand1  Puzzle, 300'' p, 23” x 15", great    Games  39.99

相关内容

最新更新

热门标签：