您如何浏览块中的数据框值并将它们组合在一起?



我有这个数据帧:

Metric  ProcId  TimeStamp               Value
CPU     proce_123   Mar-11-2022 11:00:00    1.4453125
CPU     proce_126   Mar-11-2022 11:00:00    0.058320373
CPU     proce_123   Mar-11-2022 11:00:00    0.095274389
CPU     proce_000   Mar-11-2022 11:00:00    0.019654088
CPU     proce_144   Mar-11-2022 11:00:00    0.019841269
CPU     proce_1     Mar-11-2022 11:00:00    0.234741792
CPU     proce_100   Mar-11-2022 11:00:00    5.32945776
CPU     proce_57777 Mar-11-2022 11:00:00    0.25390625
CPU     proce_0000  Mar-11-2022 11:00:00    0.019349845
CPU     proce_123   Mar-11-2022 11:00:00    0.019500781
CPU     proce_123   Mar-11-2022 11:00:00    2.32421875
CPU     proce_123   Mar-11-2022 11:00:00    68.3903656
CPU     proce_123   Mar-11-2022 11:00:00    0.057781201
CPU     proce_123   Mar-11-2022 11:00:00    0.416666627

这只是一个示例数据帧;实际数据帧有数千行。我需要在ProdID列的块中遍历此数据帧,并且我需要创建一个字符串,将这些ProdID组合在每次迭代的块中。

例如,给定块大小 3,字符串需要如下所示:

proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
proce_123%22%2C%2proce_126%22%2C%2proce_111%22%29

请注意,在第 3 块之后,我们需要添加%22%29。在第一个广告秒之后,我们需要添加%22%2C%2.

我可以做这样的事情来打印出块:

n = 3 #size of chunks
chunks = [] #list of chunks
for i in range(0, len(id), n): 
chunks.append(id[i:i + n])

我不确定如何将这 3 个项目组合在一个字符串中并在末尾添加其他字符串。

避免在 for 循环中遍历数据框。 与使用groupbymergeshift和其他面向数组的 NumPy 或 Pandas 操作的组合相比,您的性能几乎可以保证更差

通过对索引的整数划分从数据帧中创建区块 ID(假定索引值为增量)

chunk_size = 3
df['ChunkId'] = df.index // chunk_size

将后缀添加到每个 ProcId 以创建新的列ProcEnds,然后在每个组中联接这些列。

df['ProcEnds'] =  (df.ProcId + '%22%2C%2').where(
df.index % chunk_size != chunk_size - 1, 
df.ProcId + '%22%29')
# note DataFrame.where replaces values with other when cond is False
df['ChunkString'] = df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())

(可选)删除ChunkIdProcEnds列以获得仅包含额外列ChunkString的输出

df = df.drop(columns=['ChunkId', 'ProcEnds'])

df现在输出:

Metric       ProcId           TimeStamp      Value                                           ChunkString
0     CPU    proce_123 2022-03-11 11:00:00   1.445312     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1     CPU    proce_126 2022-03-11 11:00:00   0.058320     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2     CPU    proce_123 2022-03-11 11:00:00   0.095274     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3     CPU    proce_000 2022-03-11 11:00:00   0.019654       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4     CPU    proce_144 2022-03-11 11:00:00   0.019841       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5     CPU      proce_1 2022-03-11 11:00:00   0.234742       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6     CPU    proce_100 2022-03-11 11:00:00   5.329458  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7     CPU  proce_57777 2022-03-11 11:00:00   0.253906  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8     CPU   proce_0000 2022-03-11 11:00:00   0.019350  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9     CPU    proce_123 2022-03-11 11:00:00   0.019501     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10    CPU    proce_123 2022-03-11 11:00:00   2.324219     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11    CPU    proce_123 2022-03-11 11:00:00  68.390366     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12    CPU    proce_123 2022-03-11 11:00:00   0.057781                    proce_123%22%2C%2proce_123%22%2C%2
13    CPU    proce_123 2022-03-11 11:00:00   0.416667                    proce_123%22%2C%2proce_123%22%2C%2

更新

谷歌CoLab笔记本显示带有示例数据的输出 https://colab.research.google.com/drive/1f9ZHXE2ATZXD2qWsoATxEWABIBt0tMRN?usp=sharing

更新 2

OP问道:

快速提问。我们可以根据 df['公制'] 进行分组吗?例如,它将是CPU,内存。我需要基于 CPU 或内存的 ChunkString?

若要在每个指标组中应用此转换,最简单的方法是将转换逻辑包含在函数中并将其应用于数据。

需要格外小心以保留原始索引。

def transform(frame):
_df = frame.reset_index(drop=True)
_df['ChunkId'] = _df.index // chunk_size
_df['ProcEnds'] =  (_df.ProcId + '%22%2C%2').where(
_df.index % chunk_size != chunk_size - 1, 
_df.ProcId + '%22%29')
_df['ChunkString'] = _df.groupby('ChunkId').ProcEnds.transform(lambda x: x.str.cat())
return _df.drop(columns=['ChunkId', 'ProcEnds'])
idx = df.index
df.groupby('Metric').apply(transform).set_index(idx)

生成与之前相同的输出,为简洁起见省略了输出。

您可以使用 Python 整数除法 (//) 将索引形成 N 组:

N = 3
df['ChunkString'] = df.groupby(df.index//N)['ProcId'].transform(lambda x: '%22%2C%2'.join(x.tolist() + ['']*(N-len(x))) + ('%22%29' if len(x) == N else ''))

笔记:

  • x.tolist() + ['']*(N-len(x))只是将x转换为列表,并用空项填充它,直到达到长度N

输出

>>> df
Metric       ProcId           TimeStamp      Value                                           ChunkString
0     CPU    proce_123 2022-03-11 11:00:00   1.445312     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1     CPU    proce_126 2022-03-11 11:00:00   0.058320     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
2     CPU    proce_123 2022-03-11 11:00:00   0.095274     proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
3     CPU    proce_000 2022-03-11 11:00:00   0.019654       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
4     CPU    proce_144 2022-03-11 11:00:00   0.019841       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
5     CPU      proce_1 2022-03-11 11:00:00   0.234742       proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
6     CPU    proce_100 2022-03-11 11:00:00   5.329458  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
7     CPU  proce_57777 2022-03-11 11:00:00   0.253906  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
8     CPU   proce_0000 2022-03-11 11:00:00   0.019350  proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
9     CPU    proce_123 2022-03-11 11:00:00   0.019501     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
10    CPU    proce_123 2022-03-11 11:00:00   2.324219     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
11    CPU    proce_123 2022-03-11 11:00:00  68.390366     proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
12    CPU    proce_123 2022-03-11 11:00:00   0.057781                    proce_123%22%2C%2proce_123%22%2C%2
13    CPU    proce_123 2022-03-11 11:00:00   0.416667                    proce_123%22%2C%2proce_123%22%2C%2

N = 5

>>> df
Metric       ProcId           TimeStamp      Value                                                                           ChunkString
0     CPU    proce_123 2022-03-11 11:00:00   1.445312   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
1     CPU    proce_126 2022-03-11 11:00:00   0.058320   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
2     CPU    proce_123 2022-03-11 11:00:00   0.095274   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
3     CPU    proce_000 2022-03-11 11:00:00   0.019654   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
4     CPU    proce_144 2022-03-11 11:00:00   0.019841   proce_123%22%2C%2proce_126%22%2C%2proce_123%22%2C%2proce_000%22%2C%2proce_144%22%29
5     CPU      proce_1 2022-03-11 11:00:00   0.234742  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
6     CPU    proce_100 2022-03-11 11:00:00   5.329458  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
7     CPU  proce_57777 2022-03-11 11:00:00   0.253906  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
8     CPU   proce_0000 2022-03-11 11:00:00   0.019350  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
9     CPU    proce_123 2022-03-11 11:00:00   0.019501  proce_1%22%2C%2proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%2C%2proce_123%22%29
10    CPU    proce_123 2022-03-11 11:00:00   2.324219                  proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
11    CPU    proce_123 2022-03-11 11:00:00  68.390366                  proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
12    CPU    proce_123 2022-03-11 11:00:00   0.057781                  proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
13    CPU    proce_123 2022-03-11 11:00:00   0.416667                  proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2proce_123%22%2C%2
chunk_size = 3
list_of_proc_ids = []
# First, generate a list of the procIds
for obj in range(0, len(id)):
list_of_proc_ids.append(procId) # Not sure how you're appending this, guessing you use a slice on the string line?
final_str = ''
# Then enumerate through that list, adding a unique ending at every third
for index, obj in enumerate(list_of_proc_ids]:
final_str += str(obj)
if (index + 1) % chunk_size == 0: # Checks if divisible by 3, accounting for 0 index
final_str += '%22%29'
else:
final_str += '%22%2C%2'

如果你的目标是处理大量的数据,Pandas提供了一个关于它的很棒的页面,并描述了一种创建块的有效方法:https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html

我的建议是使用 Dask 实现来有效地处理块(在本地,或者通过在集群上分配计算来为更大的数据集做准备)。Dask 带有一个 DataFrameAPI,看起来像 Pandas 的 API,让您使用它感到舒适。

import dask
import dask.dataframe as dd
df1 = dd.read_csv('myfile1.csv')
df2 = dd.read_csv('myfile2.csv')
# do something with your dataframes
# combine them (append -- but you could also join etc.)
final_df = df1.append(df2)

你会在这里找到许多很好的例子:

  • https://tutorial.dask.org/04_dataframe.html
  • https://towardsdatascience.com/pandas-with-dask-for-an-ultra-fast-notebook-e2621c3769f

您可以使用GroupBy.apply

N = 3
out = (df.groupby(df.index//N)['ProcId']
.apply(lambda x: '%22%2C%2'.join(x)+'%22%29')
)

输出(作为系列):

0       proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29
1         proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29
2    proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29
3       proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29
4                        proce_123%22%2C%2proce_123%22%29
Name: ProcId, dtype: object

或输出为列表:

# out.to_list()
['proce_123%22%2C%2proce_126%22%2C%2proce_123%22%29',
'proce_000%22%2C%2proce_144%22%2C%2proce_1%22%29',
'proce_100%22%2C%2proce_57777%22%2C%2proce_0000%22%29',
'proce_123%22%2C%2proce_123%22%2C%2proce_123%22%29',
'proce_123%22%2C%2proce_123%22%29']

问题更新前的旧答案

为了提高效率,请使用矢量方法:

import numpy as np
N = 3
# map code every N procid
s = np.where(np.arange(len(df))%N < N-1, '%22%2C%2', '%22%29')
# concatenate strings
out = (df['ProcId']+'_'+s).str.cat()

输出:'proce_123_%22%2C%2proce_126_%22%2C%2proce_123_%22%29proce_000_%22%2C%2proce_144_%22%2C%2proce_1_%22%29proce_100_%22%2C%2proce_57777_%22%2C%2proce_0000_%22%29proce_123_%22%2C%2proce_123_%22%2C%2proce_123_%22%29proce_123_%22%2C%2proce_123_%22%2C%2'

我猜你想应用.rolling但应用于字符串列。从 滚动总和 用字符串修改位。

def create_chunk_string(dff):
res_df = pd.concat(
[dff['ProcId'].shift(-i).fillna('') for i in range(3)], 
axis=1
).sum(axis=1)
res_df = res_df.iloc[:-3+1].to_frame(name='ChunkStr')
return dff.join(res_df, how='left')    
print(df.groupby('Metric').apply(create_chunk_string))
Metric       ProcId            TimeStamp      Value                        ChunkStr
0     CPU    proce_123  Mar-11-202211:00:00   1.445312     proce_123proce_126proce_123
1     CPU    proce_126  Mar-11-202211:00:00   0.058320     proce_126proce_123proce_000
2     CPU    proce_123  Mar-11-202211:00:00   0.095274     proce_123proce_000proce_144
3     CPU    proce_000  Mar-11-202211:00:00   0.019654       proce_000proce_144proce_1
4     CPU    proce_144  Mar-11-202211:00:00   0.019841       proce_144proce_1proce_100
5     CPU      proce_1  Mar-11-202211:00:00   0.234742                             NaN
6     CPU    proce_100  Mar-11-202211:00:00   5.329458                             NaN
7     GPU  proce_57777  Mar-11-202211:00:00   0.253906  proce_57777proce_0000proce_123
8     GPU   proce_0000  Mar-11-202211:00:00   0.019350    proce_0000proce_123proce_123
9     GPU    proce_123  Mar-11-202211:00:00   0.019501     proce_123proce_123proce_123
10    GPU    proce_123  Mar-11-202211:00:00   2.324219     proce_123proce_123proce_123
11    GPU    proce_123  Mar-11-202211:00:00  68.390366     proce_123proce_123proce_123
12    GPU    proce_123  Mar-11-202211:00:00   0.057781                             NaN
13    GPU    proce_123  Mar-11-202211:00:00   0.416667                             NaN

在执行此操作之前,您可能需要对数据帧进行排序。 在ChunkStr之间和末尾添加字符串的解决方案应该很简单。

最新更新