我有这个数据帧:
Metric ProcId TimeStamp Value
CPU proce_123 Mar-11-2022 11:00:00 1.4453125
CPU proce_126 Mar-11-2022 11:00:00 0.058320373
CPU proce_123 Mar-11-2022 11:00:00 0.095274389
CPU proce_000 Mar-11-2022 11:00:00 0.019654088
CPU proce_144 Mar-11-2022 11:00:00 0.019841269
CPU proce_1 Mar-11-2022 11:00:00 0.234741792
CPU proce_100 Mar-11-2022 11:00:00 5.32945776
CPU proce_57777 Mar-11-2022 11:00:00 0.25390625
CPU proce_0000 Mar-11-2022 11:00:00 0.019349845
CPU proce_123 Mar-11-2022 11:00:00 0.019500781
CPU proce_123 Mar-11-2022 11:00:00 2.32421875
CPU proce_123 Mar-11-2022 11:00:00 68.3903656
CPU proce_123 Mar-11-2022 11:00:00 0.057781201
CPU proce_123 Mar-11-2022 11:00:00 0.416666627
这只是一个示例数据帧;实际数据帧有数千行。我需要在ProdID
列的块中遍历此数据帧,并且我需要创建一个字符串,将这些ProdID
组合在每次迭代的块中。
例如,给定块大小 3,字符串需要如下所示:
proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
proce_123%22%2C%2proce_126%22%2C%2proce_111%22%29
请注意,在第 3 块之后,我们需要添加%22%29
。在第一个广告秒之后,我们需要添加%22%2C%2
.
我可以做这样的事情来打印出块:
n = 3 #size of chunks
chunks = [] #list of chunks
for i in range(0, len(id), n):
chunks.append(id[i:i + n])
我不确定如何将这 3 个项目组合在一个字符串中并在末尾添加其他字符串。
避免在 for 循环中遍历数据框。 与使用groupby
、merge
、shift
和其他面向数组的 NumPy 或 Pandas 操作的组合相比,您的性能几乎可以保证更差
通过对索引的整数划分从数据帧中创建区块 ID(假定索引值为增量)
chunk_size = 3
df['ChunkId'] = df.index // chunk_size
将后缀添加到每个 ProcId 以创建新的列ProcEnds
,然后在每个组中联接这些列。
df['ProcEnds'] = (df.ProcId + '%22%2C%2').where(
df.index % chunk_size != chunk_size - 1,
df.ProcId + '%22%29')
# note DataFrame.where replaces values with other when cond is False
df['ChunkString'] = df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())
(可选)删除ChunkId
和ProcEnds
列以获得仅包含额外列ChunkString
的输出
df = df.drop(columns=['ChunkId', 'ProcEnds'])
df
现在输出:
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2
更新
谷歌CoLab笔记本显示带有示例数据的输出 https://colab.research.google.com/drive/1f9ZHXE2ATZXD2qWsoATxEWABIBt0tMRN?usp=sharing
更新 2
OP问道:
快速提问。我们可以根据 df['公制'] 进行分组吗?例如,它将是CPU,内存。我需要基于 CPU 或内存的 ChunkString?
若要在每个指标组中应用此转换,最简单的方法是将转换逻辑包含在函数中并将其应用于数据。
需要格外小心以保留原始索引。
def transform(frame):
_df = frame.reset_index(drop=True)
_df['ChunkId'] = _df.index // chunk_size
_df['ProcEnds'] = (_df.ProcId + '%22%2C%2').where(
_df.index % chunk_size != chunk_size - 1,
_df.ProcId + '%22%29')
_df['ChunkString'] = _df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())
return _df.drop(columns=['ChunkId', 'ProcEnds'])
idx = df.index
df.groupby('Metric').apply(transform).set_index(idx)
生成与之前相同的输出,为简洁起见省略了输出。
您可以使用 Python 整数除法 (//
) 将索引形成 N 组:
N = 3
df['ChunkString'] = df.groupby(df.index//N)['ProcId'].transform(lambda x: '%22%2C%2'.join(x.tolist() + ['']*(N-len(x))) + ('%22%29' if len(x) == N else ''))
笔记:
x.tolist() + ['']*(N-len(x))
只是将x
转换为列表,并用空项填充它,直到达到长度N
输出
>>> df
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2
N = 5
:
>>> df
Metric ProcId TimeStamp Value ChunkString
0 CPU proce_123 2022-03-11 11:00:00 1.445312 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
1 CPU proce_126 2022-03-11 11:00:00 0.058320 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
2 CPU proce_123 2022-03-11 11:00:00 0.095274 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
3 CPU proce_000 2022-03-11 11:00:00 0.019654 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
4 CPU proce_144 2022-03-11 11:00:00 0.019841 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
5 CPU proce_1 2022-03-11 11:00:00 0.234742 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
6 CPU proce_100 2022-03-11 11:00:00 5.329458 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
7 CPU proce_57777 2022-03-11 11:00:00 0.253906 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
8 CPU proce_0000 2022-03-11 11:00:00 0.019350 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
9 CPU proce_123 2022-03-11 11:00:00 0.019501 proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
10 CPU proce_123 2022-03-11 11:00:00 2.324219 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
11 CPU proce_123 2022-03-11 11:00:00 68.390366 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
12 CPU proce_123 2022-03-11 11:00:00 0.057781 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
13 CPU proce_123 2022-03-11 11:00:00 0.416667 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
chunk_size = 3
list_of_proc_ids = []
# First, generate a list of the procIds
for obj in range(0, len(id)):
list_of_proc_ids.append(procId) # Not sure how you're appending this, guessing you use a slice on the string line?
final_str = ''
# Then enumerate through that list, adding a unique ending at every third
for index, obj in enumerate(list_of_proc_ids]:
final_str += str(obj)
if (index + 1) % chunk_size == 0: # Checks if divisible by 3, accounting for 0 index
final_str += '%22%29'
else:
final_str += '%22%2C%2'
如果你的目标是处理大量的数据,Pandas提供了一个关于它的很棒的页面,并描述了一种创建块的有效方法:https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html
我的建议是使用 Dask 实现来有效地处理块(在本地,或者通过在集群上分配计算来为更大的数据集做准备)。Dask 带有一个 DataFrameAPI,看起来像 Pandas 的 API,让您使用它感到舒适。
import dask
import dask.dataframe as dd
df1 = dd.read_csv('myfile1.csv')
df2 = dd.read_csv('myfile2.csv')
# do something with your dataframes
# combine them (append -- but you could also join etc.)
final_df = df1.append(df2)
你会在这里找到许多很好的例子:
- https://tutorial.dask.org/04_dataframe.html
- https://towardsdatascience.com/pandas-with-dask-for-an-ultra-fast-notebook-e2621c3769f
您可以使用GroupBy.apply
:
N = 3
out = (df.groupby(df.index//N)['ProcId']
.apply(lambda x: '%22%2C%2'.join(x)+'%22%29')
)
输出(作为系列):
0 proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1 proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
2 proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
3 proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
4 proce_123%22%2C%2proce_123%22%29
Name: ProcId, dtype: object
或输出为列表:
# out.to_list()
['proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29',
'proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29',
'proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29',
'proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29',
'proce_123%22%2C%2proce_123%22%29']
问题更新前的旧答案
为了提高效率,请使用矢量方法:
import numpy as np
N = 3
# map code every N procid
s = np.where(np.arange(len(df))%N < N-1, '%22%2C%2', '%22%29')
# concatenate strings
out = (df['ProcId']+'_'+s).str.cat()
输出:'proce_123_%22%2C%2proce_126_%22%2C%2proce_123_%22%29proce_000_%22%2C%2proce_144_%22%2C%2proce_1_%22%29proce_100_%22%2C%2proce_57777_%22%2C%2proce_0000_%22%29proce_123_%22%2C%2proce_123_%22%2C%2proce_123_%22%29proce_123_%22%2C%2proce_123_%22%2C%2'
我猜你想应用.rolling
但应用于字符串列。从 滚动总和 用字符串修改位。
def create_chunk_string(dff):
res_df = pd.concat(
[dff['ProcId'].shift(-i).fillna('') for i in range(3)],
axis=1
).sum(axis=1)
res_df = res_df.iloc[:-3+1].to_frame(name='ChunkStr')
return dff.join(res_df, how='left')
print(df.groupby('Metric').apply(create_chunk_string))
Metric ProcId TimeStamp Value ChunkStr
0 CPU proce_123 Mar-11-202211:00:00 1.445312 proce_123proce_126proce_123
1 CPU proce_126 Mar-11-202211:00:00 0.058320 proce_126proce_123proce_000
2 CPU proce_123 Mar-11-202211:00:00 0.095274 proce_123proce_000proce_144
3 CPU proce_000 Mar-11-202211:00:00 0.019654 proce_000proce_144proce_1
4 CPU proce_144 Mar-11-202211:00:00 0.019841 proce_144proce_1proce_100
5 CPU proce_1 Mar-11-202211:00:00 0.234742 NaN
6 CPU proce_100 Mar-11-202211:00:00 5.329458 NaN
7 GPU proce_57777 Mar-11-202211:00:00 0.253906 proce_57777proce_0000proce_123
8 GPU proce_0000 Mar-11-202211:00:00 0.019350 proce_0000proce_123proce_123
9 GPU proce_123 Mar-11-202211:00:00 0.019501 proce_123proce_123proce_123
10 GPU proce_123 Mar-11-202211:00:00 2.324219 proce_123proce_123proce_123
11 GPU proce_123 Mar-11-202211:00:00 68.390366 proce_123proce_123proce_123
12 GPU proce_123 Mar-11-202211:00:00 0.057781 NaN
13 GPU proce_123 Mar-11-202211:00:00 0.416667 NaN
在执行此操作之前,您可能需要对数据帧进行排序。 在ChunkStr
之间和末尾添加字符串的解决方案应该很简单。