根据模式将csv文件拆分为多个文件



我有一个csv文件,结构如下:

time,magnitude
0,13517
292.5669,370
620.8469,528
0,377
832.3269,50187
5633.9419,3088
20795.0950,2922
21395.6879,2498
21768.2139,647
21881.2049,194
0,3566
292.5669,370
504.1510,712
1639.4800,287
46709.1749,365
46803.4400,500

我想把这个csv文件分割成单独的csv文件,如下所示:

文件1:

time,magnitude
0,13517
292.5669,370
620.8469,528

文件2:

time,magnitude
0,377
832.3269,50187
5633.9419,3088
20795.0950,2922
21395.6879,2498

等等…

我读过几个类似的帖子(例如,这个,这个,或者这个),但他们都在一个列中搜索特定的值,并将每组值保存到一个单独的文件中。然而,在我的情况下,时间列的值是不一样的。我想在一个条件下进行分割:If time = 0, save that row and all subsequent rows in a new file until the next time =0.

有谁能告诉我怎么做吗?

对于pandas,您可以使用groupby布尔索引:

#pip install pandas
import pandas as pd
df = pd.read_csv("input_file.csv", sep=",") # <- change the sep if needed
for n, g in df.groupby(df["time"].eq(0).cumsum()):
g.to_csv(f"file_{n}.csv", index=False, sep=",")

输出:

time  magnitude   # <- file_1.csv
0.0000      13517
292.5669        370
620.8469        528
time  magnitude # <- file_2.csv
0.0000        377
832.3269      50187
5633.9419       3088
20795.0950       2922
21395.6879       2498

datasplit.awk

#!/usr/bin/awk -f
BEGIN
{
filename = "output_file_"
fileext = ".csv"
FS = ","
c = 0
file = filename c fileext
getline
header = $0
}
{
if ($1 == 0){
c = c + 1
file = filename c fileext
print header > file
print $0 >> file
} else {
print >> file
}
}

使文件可执行:

chmod +x datasplit.awk

从需要写入数据的文件夹开始:

datasplit.awk datafile

我冒昧地创建了一些与您提供的数据类似的数据来测试我的解决方案。此外,我没有使用输入csv文件,而是使用dataframe文件。下面是我的解决方案:

import pandas as pd
import numpy as np
# Create a random DataFrame
data = {
'time': [0, 292.5669, 620.8469, 0, 832.3269, 5633.9419, 20795.0950, 21395.6879, 0, 230.5678, 456.8468, 0, 784.3265, 5445.9452, 20345.0980, 21095.6898],
'magnitude': [13517, 370, 528, 377, 50187, 3088, 2922, 2498, 13000, 369, 527, 376, 50100, 3087, 2921, 2497]
}
df = pd.DataFrame(data)
# Function to split a DataFrame based on a pattern
def split_dataframe_by_pattern(df, output_prefix):
file_count = 1
current_group = pd.DataFrame(columns=df.columns)  # Initialize the current group
for index, row in df.iterrows():
if row['time'] == 0 and not current_group.empty:  # If time = 0 and the current group is not empty, create a new file
output_file = f'{output_prefix}_{file_count}.csv'
# Save the current group to the new file
current_group.to_csv(output_file, index=False)
current_group = pd.DataFrame(columns=df.columns)  # Reset the current group
file_count += 1
# Use pandas.concat to append the row to the current group
current_group = pd.concat([current_group, row.to_frame().T], ignore_index=True)
# Save the last group to a file
current_group.to_csv(f'{output_prefix}_{file_count}.csv', index=False)
# Example usage:
output_prefix = 'output_file'
split_dataframe_by_pattern(df, output_prefix)

我的输出是四个csv文件:

output_file_1.csv

time,magnitude
0.0,13517.0
292.5669,370.0
620.8469,528.0

output_file_2.csv

time,magnitude
0.0,377.0
832.3269,50187.0
5633.9419,3088.0
20795.095,2922.0
21395.6879,2498.0

output_file_3.csv

time,magnitude
0.0,13000.0
230.5678,369.0
456.8468,527.0

output_file_4.csv

time,magnitude
0.0,376.0
784.3265,50100.0
5445.9452,3087.0
20345.098,2921.0
21095.6898,2497.0

您可以很容易地在panda中这样做:

import pandas as pd
df = pd.read_csv("mydata.csv")
last_idx = 0
file_idx = 0
for i,time in enumerate(df.time):
if time == 0 and i != 0:
df.iloc[last_idx:i].to_csv(f"mydata_{file_idx}.csv", index=None)
file_idx += 1
last_idx = i
df.iloc[last_idx:].to_csv(f"mydata_{file_idx}.csv", index=None)