python csv和JSON文件的编码/解码故障排除

我最初使用:

转储了一个包含特定句子的文件

 with open(labelFile, "wb") as out:
        json.dump(result, out,indent=4)

JSON中的这句话如下:

"-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth u00c3 cents u00c2 $ u00c2 `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .",

然后通过:

加载

with open(sys.argv[1]) as sentenceFile:
    sentenceFile = json.loads(sentenceFile.read())

处理它，然后使用

将其写入CSV:

with open(sys.argv[2], 'wb') as csvfile:
    fieldnames = ['x','y','z'
                  ]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for sentence in sentence2locations2values:
         sentence = unicode(sentence['parsedSentence']).encode("utf-8")
         writer.writerow({'x': sentence})

使CSV文件中的句子在Excel for Mac中打开:

-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth Ãƒ cents Ã‚ $ Ã‚ `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .

然后我把它从Excel for mac转到Google Sheets，在那里它是:

-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth Ã cents Â $ Â `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .

注意，非常细微的不同，Â取代了Ã。

，然后贴上标签，把它带回到Excel for Mac中这时它又变成了:

-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth Ã cents Â $ Â `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .

我如何开始读取CSV，包含一个句子，如:

-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating NUMBER_SLOT per year , is a significant contributor to its population growth Ãƒ cents Ã‚ $ Ã‚ `` a daily quota of 150 Mainland Chinese with family ties in LOCATION_SLOT are granted a `` one way permit '' .

转换为

"-LSB- 97 -RSB- However , the influx of immigrants from mainland China , approximating 45,000 per year , is a significant contributor to its population growth u00c3 cents u00c2 $ u00c2 `` a daily quota of 150 Mainland Chinese with family ties in Hong Kong are granted a `` one way permit '' .",

以便与问题开始时原始json转储中的内容匹配?

编辑

我检查了一下，看到u00c3到Ã的编码，谷歌表格中的格式，实际上是拉丁8。

编辑

我运行enca，看到原始转储文件是7位ASCII字符，我的CSV是unicode。所以我需要加载为unicode和转换为7位ASCII?

我找到了解决办法。解决方案是将CSV文件从其原始格式(标识为UTF-8)解码，然后句子变为原始句子。所以:

csvfile = open(sys.argv[1], 'r')
fieldnames = ("x","y","z")
reader = csv.DictReader(csvfile, fieldnames)
next(reader)
for i,row in enumerate(reader):
    row['x'] = row['x'].decode("utf-8")

发生的非常奇怪的事情是，当我在Mac的Excel中编辑CSV文件并保存时，每次它似乎都转换为不同的编码。我警告其他用户，因为这是一个巨大的头痛

相关内容

最新更新

热门标签：