我有一个csv文件,结构如下:
time,magnitude
0,13517
292.5669,370
620.8469,528
0,377
832.3269,50187
5633.9419,3088
20795.0950,2922
21395.6879,2498
21768.2139,647
21881.2049,194
0,3566
292.5669,370
504.1510,712
1639.4800,287
46709.1749,365
46803.4400,500
我想把这个csv文件分割成单独的csv文件,如下所示:
文件1:
time,magnitude
0,13517
292.5669,370
620.8469,528
文件2:
time,magnitude
0,377
832.3269,50187
5633.9419,3088
20795.0950,2922
21395.6879,2498
等等…
我读过几个类似的帖子(例如,这个,这个,或者这个),但他们都在一个列中搜索特定的值,并将每组值保存到一个单独的文件中。然而,在我的情况下,时间列的值是不一样的。我想在一个条件下进行分割:If time = 0, save that row and all subsequent rows in a new file until the next time =0
.
对于pandas,您可以使用groupby
和布尔索引:
#pip install pandas
import pandas as pd
df = pd.read_csv("input_file.csv", sep=",") # <- change the sep if needed
for n, g in df.groupby(df["time"].eq(0).cumsum()):
g.to_csv(f"file_{n}.csv", index=False, sep=",")
输出:
time magnitude # <- file_1.csv
0.0000 13517
292.5669 370
620.8469 528
time magnitude # <- file_2.csv
0.0000 377
832.3269 50187
5633.9419 3088
20795.0950 2922
21395.6879 2498
datasplit.awk
#!/usr/bin/awk -f
BEGIN
{
filename = "output_file_"
fileext = ".csv"
FS = ","
c = 0
file = filename c fileext
getline
header = $0
}
{
if ($1 == 0){
c = c + 1
file = filename c fileext
print header > file
print $0 >> file
} else {
print >> file
}
}
使文件可执行:
chmod +x datasplit.awk
从需要写入数据的文件夹开始:
datasplit.awk datafile
我冒昧地创建了一些与您提供的数据类似的数据来测试我的解决方案。此外,我没有使用输入csv
文件,而是使用dataframe
文件。下面是我的解决方案:
import pandas as pd
import numpy as np
# Create a random DataFrame
data = {
'time': [0, 292.5669, 620.8469, 0, 832.3269, 5633.9419, 20795.0950, 21395.6879, 0, 230.5678, 456.8468, 0, 784.3265, 5445.9452, 20345.0980, 21095.6898],
'magnitude': [13517, 370, 528, 377, 50187, 3088, 2922, 2498, 13000, 369, 527, 376, 50100, 3087, 2921, 2497]
}
df = pd.DataFrame(data)
# Function to split a DataFrame based on a pattern
def split_dataframe_by_pattern(df, output_prefix):
file_count = 1
current_group = pd.DataFrame(columns=df.columns) # Initialize the current group
for index, row in df.iterrows():
if row['time'] == 0 and not current_group.empty: # If time = 0 and the current group is not empty, create a new file
output_file = f'{output_prefix}_{file_count}.csv'
# Save the current group to the new file
current_group.to_csv(output_file, index=False)
current_group = pd.DataFrame(columns=df.columns) # Reset the current group
file_count += 1
# Use pandas.concat to append the row to the current group
current_group = pd.concat([current_group, row.to_frame().T], ignore_index=True)
# Save the last group to a file
current_group.to_csv(f'{output_prefix}_{file_count}.csv', index=False)
# Example usage:
output_prefix = 'output_file'
split_dataframe_by_pattern(df, output_prefix)
我的输出是四个csv
文件:
output_file_1.csv
time,magnitude
0.0,13517.0
292.5669,370.0
620.8469,528.0
output_file_2.csv
time,magnitude
0.0,377.0
832.3269,50187.0
5633.9419,3088.0
20795.095,2922.0
21395.6879,2498.0
output_file_3.csv
time,magnitude
0.0,13000.0
230.5678,369.0
456.8468,527.0
output_file_4.csv
time,magnitude
0.0,376.0
784.3265,50100.0
5445.9452,3087.0
20345.098,2921.0
21095.6898,2497.0
您可以很容易地在panda中这样做:
import pandas as pd
df = pd.read_csv("mydata.csv")
last_idx = 0
file_idx = 0
for i,time in enumerate(df.time):
if time == 0 and i != 0:
df.iloc[last_idx:i].to_csv(f"mydata_{file_idx}.csv", index=None)
file_idx += 1
last_idx = i
df.iloc[last_idx:].to_csv(f"mydata_{file_idx}.csv", index=None)