将字符串抓取和解码为熊猫DF



我正在网络抓取,并希望由于我的内容抓取而拥有一个 Pandas 数据帧。我能够获得一个我想作为 Pandas 数据帧读取的UTF-8字符串,但我不确定该怎么做,我想避免输出到 CSV 并将其读回。我该怎么做?

例如

string='term_ID,description,frequency,plot_X,plot_Y,plot_size,uniqueness,dispensability,representative,eliminatedrnGO:0006468,"protein phosphorylation",4.137%, 4.696, 0.927,5.725,0.430,0.000,6468,0rnGO:0050821,"protein stabilization, positive",0.045%,-4.700, 0.494,3.763,0.413,0.000,50821,0rn'

我正在用

fcsv_content=[x.split(',') for x in string.split("rn")]

但这不起作用,因为某些字段内部有逗号。我能做什么?我可以更改解码以解决此问题吗? 对于一些背景,我正在使用机器人浏览器来解码网页。

你可以使用 pythons csv 模块来读取和吐出你的 csv。它将处理诸如逗号位于引号字符串内之类的事情,并且知道不要拆分这些字符串。下面是一个使用输入字符串的小示例。正如您将在下面的示例中看到的字段protein stabilization, positive不会被拆分为单独的列,因为它是一个带引号的字符串

import csv
string = 'term_ID,description,frequency,plot_X,plot_Y,plot_size,uniqueness,dispensability,representative,eliminatedrnGO:0006468,"protein phosphorylation",4.137%, 4.696, 0.927,5.725,0.430,0.000,6468,0rnGO:0050821,"protein stabilization, positive",0.045%,-4.700, 0.494,3.763,0.413,0.000,50821,0rn'
csv_reader = csv.reader(string.splitlines())
for record in csv_reader:
    print(f'number of fields: {len(record)}, Record: {record}'

输出

number of fields: 10, Record: ['term_ID', 'description', 'frequency', 'plot_X', 'plot_Y', 'plot_size', 'uniqueness', 'dispensability', 'representative', 'eliminated']
number of fields: 10, Record: ['GO:0006468', 'protein phosphorylation', '4.137%', ' 4.696', ' 0.927', '5.725', '0.430', '0.000', '6468', '0']
number of fields: 10, Record: ['GO:0050821', 'protein stabilization, positive', '0.045%', '-4.700', ' 0.494', '3.763', '0.413', '0.000', '50821', '0']

最新更新