我在解析http请求时遇到问题。我在链接中.txt有这样的数据
https://drive.google.com/open?id=1RSyCYgxBCJnxAXDInyIs1cOp_3EoUyqG
我正在尝试将此数据转换为csv格式,但是像";"这样的特殊字符将数据分隔成新列
例:"接受"列中的数据应类似于 - text/xml;q=0.6、application/rtf;q=0.7、image/*
但是当我运行代码时,我在此列中以文本/xml的形式获取数据并且 q=0.6 进入新列。
我发现的一种解决方案是将单引号字符串转换为双引号,然后存储字符串,但这不起作用。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import urllib.parse
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
import io
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
import os
import json
import csv
from itertools import islice
import numpy as np
import pandas as pd
fields = ['Start - Id', 'class', 'Method', 'Url', 'Protocol', 'Content- Length','Content-Language','Content-Encoding','Content-Location','Content-MD5','Content-Type','Expires','Last-Modified', 'Host', 'Connection', 'Accept', 'Accept-Charset', 'Accept-Encoding', 'Accept-Language', 'Cache-Control','Client-ip', 'Cookie', 'Cookie2', 'Date', 'ETag', 'Expect', 'From', 'If-Modified-Since', 'If-Unmodified-Since', 'If-Match', 'If-None-Match', 'If-Range','Max-Forwards', 'MIME-Version', 'Pragma', 'Proxy-Authorization', 'Authorization', 'Range', 'Referer', 'TE', 'Trailer', 'User-Agent', 'UA-CPU', 'UA-Disp', 'UA-OS', 'UA-Color', 'UA-Pixels', 'Via', 'Transfer-Encoding', 'Upgrade', 'Warning', 'X-Forwarded-For', 'X-Serial-Number', '~~~~~','----']
listofKeys = dict.fromkeys(fields)
def init(file_out):
with open(file_out, 'w', encoding="utf-8") as csvfile:
csvwriter = csv.writer(csvfile, delimiter="t")
csvwriter.writerow(fields)
def write(file_out, lines):
with open(file_out, 'a', encoding="utf-8") as csvfile:
csvwriter = csv.writer(csvfile, delimiter ="t")
row = []
N = len(lines)
foundP = False
for i in range(N-1):
line = lines[i].strip()
if len(line)>0:
if i==2:
listofKeys['Method'] = line.split(" ")[0]
listofKeys['Url'] = line.split(" ")[1]
listofKeys['Protocol'] = line.split(" ")[2]
if(line.startswith("PUT") or line.startswith("POST")):
foundP = True
elif i==N-3 :
if foundP == True:
listofKeys['Url'] += (line)
else:
index = line.find(':')
key = line[0:index].strip()
value = line[index+1:].strip()
listofKeys[key] = str(value)
for keys in fields:
row.append(listofKeys[keys])
print(type(row))
print(row)
csvwriter.writerow(row)
def convertText2Csv(file_in, file_out):
init(file_out)
with open(file_in, 'r') as infile:
lines = []
count = 0
for line in infile:
if line.startswith("Start"):
count+=1
print("-------------------------------------------------------------------Request #",count," -------------------------------------------------------------------------")
lines.append(line)
elif line.startswith("End"):
lines.append(line)
write(file_out, lines)
lines = []
else:
lines.append(line)
csvFile = 'test.csv'
textFile = 'test.txt'
convertText2Csv(textFile, csvFile)
我得到的结果在链接中给出https://drive.google.com/open?id=1rLPdbuZkS6pcDQqHZZP6ck9H8XbnMPWM
我只想将数据转换为csv文件,每列都包含其特定值和特殊字符(如果存在)
您的csv文件完全正确。
以下是在 Libre Office calc 中加载Accept
列并指定"\t"作为唯一分隔符时的内容:
Accept
*/*
*/*
*/*
text/xml;q=0.6, application/rtf;q=0.7, image/*
你真正的问题是你用来打开csv文件的程序太
;
也是一个分隔符。
长话短说:您只是试图使用愚蠢的工作表程序显示正确的csv文件(可能是Excel吗?Excel是一个非常好的工具,除了当涉及到csv文件时,它只是狗屎。
正如您在评论中建议的那样,在这里应该无用的quoting=csv.QUOTE_ALL
选项可能足以解释它应该忽略的废话,也许是字段中的分隔符......