从 R 到 python pandas:按重复项的顺序创建 id 键系列

我想为字符串上的序列创建一个 id 键，该字符串在一列中重复。前十行应为 id #1，接下来的十行应为 id #2，依此类推。在 R 中，这很简单，我使用 dplyr 得到了预期的结果。

R：

library(tidyverse)
question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5')
answer <- c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5')
df <- data_frame(question, answer)
# A tibble: 15 x 2
question answer
<chr>    <chr> 
1 q1       a1    
2 q2       a2    
3 q3       a3    
4 q4       a4    
5 q5       a5    
6 q1       a1    
7 q2       a2    
8 q3       a3    
9 q4       a4    
10 q5       a5    
11 q1       a1    
12 q2       a2    
13 q3       a3    
14 q4       a4    
15 q5       a5

如果我们只运行一个group_by和一个突变来为系列添加一个键，我们就会得到我想要的：

df2 <- df %>% 
group_by(question) %>% 
mutate(id = row_number())
# A tibble: 15 x 3
# Groups:   question [5]
question answer    id
<chr>    <chr>  <int>
1 q1       a1         1
2 q2       a2         1
3 q3       a3         1
4 q4       a4         1
5 q5       a5         1
6 q1       a1         2
7 q2       a2         2
8 q3       a3         2
9 q4       a4         2
10 q5       a5         2

我最后说：

df2 <- df %>% 
group_by(question) %>% 
mutate(id = row_number()) %>% 
spread(question, answer) 
# final table:
# A tibble: 3 x 6
id    q1    q2    q3    q4    q5   
<int> <chr> <chr> <chr> <chr> <chr>
1     1 a1    a2    a3    a4    a5   
2     2 a1    a2    a3    a4    a5   
3     3 a1    a2    a3    a4    a5

蟒：

现在，我不知道如何在熊猫中获得相同的结果。我尝试过分组和合并，但没有运气。

import pandas as pd
data = {'question': ['question one', 'question two', 
'question three', 'question four', 
'question five', 'question one', 
'question two', 'question three', 
'question four', 'question five', 
'question one', 'question two', 
'question three', 'question four', 'question five'], 
'answer':['answer one', 'answer two', 'answer three', 
'answer four', 'answer five', 'answer one', 
'answer two', 'answer three', 'answer four', 
'answer five', 'answer one', 'answer two', 
'answer three', 'answer four', 'answer five']}
df = pd.DataFrame(data)

使用 merge 和 rest_index((，它对行重新排序并在新订单上分配一个 id，这不是我想要的：

df2 = df.merge(df.drop_duplicates('question').reset_index(), on='question')
question      answer_x  index      answer_y
0     question one    answer one      0    answer one
1     question one    answer one      0    answer one
2     question one    answer one      0    answer one
3     question two    answer two      1    answer two
4     question two    answer two      1    answer two
5     question two    answer two      1    answer two

使用 groupby，我得到的混乱也不是我想要的：

df['id'] = df.groupby('question').ngroup()
question        answer  id
0     question one    answer one   2
1     question two    answer two   4
2   question three  answer three   3
3    question four   answer four   1
4    question five   answer five   0
5     question one    answer one   2
6     question two    answer two   4
7   question three  answer three   3
8    question four   answer four   1
9    question five   answer five   0

如何获得与 dplyr 相同的输出？编辑：要添加更多详细信息，我需要输出像 dplyr 给我的那样，因为这是自动化系统的一部分。

ngroup是组的编号，而不是组中的数字。正如文档所解释的那样，对此的补充由cumcount.

粗略地说，你可以将assign用于mutate，groupby/cumcount用于row_number，pivot用于您的spread：

In [306]: df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
Out[306]: 
question  q1  q2  q3  q4  q5
id                          
0         a1  a2  a3  a4  a5
1         a1  a2  a3  a4  a5

如果你想id成为一列，那就扔进一个reset_index()。

不幸的是，我想要真正匹配预期的输出，我们必须保证订单。 github上有几个关于自动排序如何不方便的开放票证，但我们可以手动完成。我们将切换回英文文本：

In [327]: d2 = df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
In [328]: d2.reindex(df.question.drop_duplicates(), axis=1)
Out[328]: 
question question one question two question three question four question five
id                                                                           
0          answer one   answer two   answer three   answer four   answer five
1          answer one   answer two   answer three   answer four   answer five

使用datar，您可以像在 R 中一样轻松地复制它：

>>> from datar.all import c, f, tibble, group_by, mutate, row_number, pivot_wider
>>> 
>>> question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q
5')
>>> answer = c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5'
)
>>> 
>>> df = tibble(question, answer)
>>> df
question answer
0        q1     a1
1        q2     a2
2        q3     a3
3        q4     a4
4        q5     a5
5        q1     a1
6        q2     a2
7        q3     a3
8        q4     a4
9        q5     a5
10       q1     a1
11       q2     a2
12       q3     a3
13       q4     a4
14       q5     a5
>>> df2 = (df >>
...   group_by(f.question) >>
...   mutate(id = row_number()))
>>> 
>>> df2
question answer  id
0        q1     a1   1
1        q2     a2   1
2        q3     a3   1
3        q4     a4   1
4        q5     a5   1
5        q1     a1   2
6        q2     a2   2
7        q3     a3   2
8        q4     a4   2
9        q5     a5   2
10       q1     a1   3
11       q2     a2   3
12       q3     a3   3
13       q4     a4   3
14       q5     a5   3
[Groups: ['question'] (n=5)]
>>> df2 = (df >>
...   group_by(f.question) >>
...   mutate(id = row_number()) >>
...   pivot_wider(names_from=f.question, values_from=f.answer))
>>> 
>>> df2
id  q1  q2  q3  q4  q5
0   1  a1  a2  a3  a4  a5
1   2  a1  a2  a3  a4  a5
2   3  a1  a2  a3  a4  a5

我是该软件包的作者。如果您有任何问题，请随时提交问题。

我知道问题是关于如何在 python 中获得解决方案，尽管如此，我还是会使用data.table和reshape2离开这个解决方案。

library(data.table)
library(reshape2)
setDT(df)[,new := (1:.N), by = question]
dcast(df, new ~ question, value.var = "answer")
new q1 q2 q3 q4 q5
1:   1 a1 a2 a3 a4 a5
2:   2 a1 a2 a3 a4 a5
3:   3 a1 a2 a3 a4 a5

相关内容

最新更新

热门标签：