如何根据未命名列上的字符串匹配条件重新排列Pandas上的行?



我们有一个熊猫数据框架如下:

Unnamed:0               T1    T2    T3   ...  T120
0  cheetah Running         x1    x2    x1   ...   x3
1  Running Jaguar          x1    x10   x3   ...   x7
2  Cougar Running          x1    x2    x1   ...   x3
3  Bengal Tiger Running    x5    x2    x4   ...   x17
4  Sleeping Bali Tiger     x55   x61   x11  ...   x31
5  Javan Leopard Sleeping  x42   x67   x17  ...   x34
6  Leopard Running         x2    x5    x2   ...   x3
7  Bengal Tiger Running    x5    x2    x4   ...   x17
..      ...                ...   ...   ...  ...   ...
199 Florida Panther Eating  x71   x80   x101 ...   x94
200 Running Eastern Cougar  x5    x1     x2  ...   x3
201 Congo Lion Sleeping     x57   x61    x14  ...  x38

我们正尝试按如下方式重组此数据框架。在上面的数据框架中,第一列是未命名的。我们尝试在未命名的列中检查已知的常见行为,如"运行"、"睡眠"等,并尝试按如下方式重新排列数据帧。

Unnamed:0               T1    T2    T3   ...  T120
0  cheetah Running         x1    x2    x1   ...   x3
1  Running Jaguar          x1    x10   x3   ...   x7
2  Cougar Running          x1    x2    x1   ...   x3
3  Bengal Tiger Running    x5    x2    x4   ...   x17
4  Running Eastern Cougar  x5    x1     x2  ...   x3
5  Bengal Tiger Running    x5    x2    x4   ...   x17
6  Leopard Running         x2    x5    x2   ...   x3
4  Sleeping Bali Tiger     x55   x61   x11  ...   x31
5  Javan Leopard Sleeping  x42   x67   x17  ...   x34
6  Congo Lion Sleeping     x57   x61    x14  ...  x38  
..      ...                ...   ...   ...  ...   ...
201 Florida Panther Eating  x71   x80   x101 ...   x94 

我尝试用下面的方法,但是我给列添加了一个名称。我试过df[df.columns.str.contains('^Unnamed')],但运气不好。

import pandas as pd
df = pd.read_csv('a_behav_cat.csv')
df_new = pd.DataFrame()
df_new = df_new.append(df[df["name"].str.contains("Running")])
df_new = df_new.append(df[df["name"].str.contains("Sleeping")])
print(df_new.to_string())

问题1:我认为应该有一个更好的或python的方法来做这件事。感谢您对此事的关照。此外,这将检查字符串的精确匹配,这不是理想的,因为数据集可能有简单的"运行"。和简单的"睡觉";:)等。我尝试了.lower()功能,但没有工作。

:这样做的目的是为单个观察确定有多少个不同的x类别。这里T1, T2, T3,…T120是观测值。我们需要确定每个观测值有多少个共同值。即对于T1,对于Running"有3个'x1'和3个'x5'和1个'x2'

要做到这一点,我们首先考虑重新排列数据帧,如上所述。

然而,我们不确定这种重新排列是否为目的所必需。此外,输出看起来是多维的。这是T1,也就是"跑步"有多少x1 x3 x5需要存储。同样,这也需要应用到其他行为上,比如"吃饭"、"睡觉"等等。

问题2:实现这一目标的最佳方法是什么?有适合这个目的的数据结构吗?有没有更好的方法来实现上述目的而不重新安排数据框架?

如果你想做测试,这里有一个示例csv。

,T1,T2,T3,T4
cheetah Running,x1,x2,x1,x3
Running Jaguar,x1,x10,x3,x7
Cougar Running,x1,x2,x1,x3
Bengal Tiger Running,x5,x2,x4,x17
Sleeping Bali Tiger,x55,x61,x11,x31
Javan Leopard Sleeping,x42,x67,x17,x34
Leopard Running,x2,x5,x2,x3
Bengal Tiger Running,x5,x2,x4,x17
Florida Panther Eating,x71,x80,x101,x94
Running Eastern Cougar,x5,x1,x2,x3
Congo Lion Sleeping,x57,x61,x14,x38

iuc,您可以使用字典映射将正确的类别设置为行:

# Your list of patterns
MAPPING = {'S': ['sleep', 'sleeping'],
'R': ['run', 'running'],
'E': ['eat', 'eating']}
# Reverse the mapping (swap keys and values)
rev = {v: k for k, l in MAPPING.items() for v in l}
# Create the regex pattern
pat = fr"b({'|'.join(rev)})b"
# Extract from data
df['CAT'] = df['Unnamed: 0'].str.lower().str.extract(pat, expand=False).map(rev)

输出:

>>> df
Unnamed: 0   T1   T2    T3   T4 CAT
0          cheetah Running   x1   x2    x1   x3   R
1           Running Jaguar   x1  x10    x3   x7   R
2           Cougar Running   x1   x2    x1   x3   R
3     Bengal Tiger Running   x5   x2    x4  x17   R
4      Sleeping Bali Tiger  x55  x61   x11  x31   S
5   Javan Leopard Sleeping  x42  x67   x17  x34   S
6          Leopard Running   x2   x5    x2   x3   R
7     Bengal Tiger Running   x5   x2    x4  x17   R
8   Florida Panther Eating  x71  x80  x101  x94   E
9   Running Eastern Cougar   x5   x1    x2   x3   R
10     Congo Lion Sleeping  x57  x61   x14  x38   S

最新更新