如何使用 python 或 scala 将非标准 csv 文件读入数据帧



我在下面有一个数据集采样,可以用python或scala处理:

FWD,13032009:09:01,10.56| FWD,13032009:10:53,11.23| FWD,13032009:15:40,23.20
SPOT,13032009:09:04,11.56| FWD,13032009:11:45,11.23| SPOT,13032009:12:30,23.20
FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20| FWD,13032009:14:340,56.00
FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20

每行都要拆分成多个较小的字符串,可以进一步拆分。

我正在寻找一种生成RDD或数据帧的有效方法,其中包含以下内容:

FWD,13032009:09:01,10.56 
FWD,13032009:10:53,11.23
FWD,13032009:15:40,23.20
SPOT,13032009:09:04,11.56
FWD,13032009:11:45,11.23
SPOT,13032009:12:30,23.20
FWD,13032009:08:01,10.56
SPOT,13032009:12:30,11.23
FWD,13032009:13:20,23.20
FWD,13032009:14:340,56.00
FWD,13032009:08:01,10.56
SPOT,13032009:12:30,11.23
FWD,13032009:13:20,23.20

请注意,效率越高越好,因为生产中的总行数可能高达百万

谢谢。

假设您正在从 csv 文件中读取,您可以将每一行读取到一个列表中。平展值,然后将它们作为单独的行进行处理。

将文件读入列表- 100 万行不应该太多,无法处理:

import csv
import itertools
import pandas as pd
with open('test.csv','r') as f:
reader = csv.reader(f, delimiter = '|')
rows = list(reader)

从单个列表中展平和拆分- Python 标准库中的优秀itertools库返回一个生成器,该生成器有助于内存且高效。

flat_rows = itertools.chain.from_iterable(rows)
list_rows = [i.strip().split(',') for i in flat_rows]

嵌套列表list_rows现在为您提供了一个干净且格式化的列表,如果您想创建dataframe,可以将其发送给pandas

list_rows
>>
[['FWD', '13032009:09:01', '10.56'],
['FWD', '13032009:10:53', '11.23'],
['FWD', '13032009:15:40', '23.20'],
['SPOT', '13032009:09:04', '11.56'],
['FWD', '13032009:11:45', '11.23'],
['SPOT', '13032009:12:30', '23.20'],
['FWD', '13032009:08:01', '10.56'],
['SPOT', '13032009:12:30', '11.23'],
['FWD', '13032009:13:20', '23.20'],
['FWD', '13032009:14:340', '56.00'],
['FWD', '13032009:08:01', '10.56'],
['SPOT', '13032009:12:30', '11.23'],
['FWD', '13032009:13:20', '23.20']]
df = pd.DataFrame(list_rows)

Python 解决方案:如果以字符串形式获取文本,则可以使用换行符 (n(replace()序列,然后将其读取为数据帧:

import pandas as pd
from io import StringIO
data_set = """FWD,13032009:09:01,10.56| FWD,13032009:10:53,11.23| FWD,13032009:15:40,23.20
SPOT,13032009:09:04,11.56| FWD,13032009:11:45,11.23| SPOT,13032009:12:30,23.20
FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20| FWD,13032009:14:340,56.00
FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20
"""
data_set *= 100000  # Make it over a million elements to ensure performance is adequate
data_set = data_set.replace("| ", "n")
data_set_stream = StringIO(data_set)  # Pandas needs to read a file-like object, so need to turn our string into a buffer
df = pd.read_csv(data_set_stream)
print(df)  # df is our desired DataFrame

如果你有兴趣,这里是Scala的方式,

val rdd1 = sc.parallelize(List("FWD,13032009:09:01,10.56| FWD,13032009:10:53,11.23| FWD,13032009:15:40,23.20", "SPOT,13032009:09:04,11.56| FWD,13032009:11:45,11.23| SPOT,13032009:12:30,23.20","FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20| FWD,13032009:14:340,56.00","FWD,13032009:08:01,10.56| SPOT,13032009:12:30,11.23| FWD,13032009:13:20,23.20"))

val rdd2 = rdd1.flatMap(l => l.replaceAll(" ","").split("\|")) val rds = rdd2.toDS

val df = spark.read.csv(rds)

df.show(false) +----+---------------+-----+ |_c0 |_c1 |_c2 | +----+---------------+-----+ |FWD |13032009:09:01 |10.56| |FWD |13032009:10:53 |11.23| |FWD |13032009:15:40 |23.20| |SPOT|13032009:09:04 |11.56| |FWD |13032009:11:45 |11.23| |SPOT|13032009:12:30 |23.20| |FWD |13032009:08:01 |10.56| |SPOT|13032009:12:30 |11.23| |FWD |13032009:13:20 |23.20| |FWD |13032009:14:340|56.00| |FWD |13032009:08:01 |10.56| |SPOT|13032009:12:30 |11.23| |FWD |13032009:13:20 |23.20| +----+---------------+-----+

最新更新