从 R 到 python pandas:按重复项的顺序创建 id 键系列



我想为字符串上的序列创建一个 id 键,该字符串在一列中重复。前十行应为 id #1,接下来的十行应为 id #2,依此类推。在 R 中,这很简单,我使用 dplyr 得到了预期的结果。

R:

library(tidyverse)
question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5')
answer <- c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5')
df <- data_frame(question, answer)
# A tibble: 15 x 2
question answer
<chr>    <chr> 
1 q1       a1    
2 q2       a2    
3 q3       a3    
4 q4       a4    
5 q5       a5    
6 q1       a1    
7 q2       a2    
8 q3       a3    
9 q4       a4    
10 q5       a5    
11 q1       a1    
12 q2       a2    
13 q3       a3    
14 q4       a4    
15 q5       a5 

如果我们只运行一个group_by和一个突变来为系列添加一个键,我们就会得到我想要的:

df2 <- df %>% 
group_by(question) %>% 
mutate(id = row_number())
# A tibble: 15 x 3
# Groups:   question [5]
question answer    id
<chr>    <chr>  <int>
1 q1       a1         1
2 q2       a2         1
3 q3       a3         1
4 q4       a4         1
5 q5       a5         1
6 q1       a1         2
7 q2       a2         2
8 q3       a3         2
9 q4       a4         2
10 q5       a5         2

我最后说:

df2 <- df %>% 
group_by(question) %>% 
mutate(id = row_number()) %>% 
spread(question, answer) 
# final table:
# A tibble: 3 x 6
id    q1    q2    q3    q4    q5   
<int> <chr> <chr> <chr> <chr> <chr>
1     1 a1    a2    a3    a4    a5   
2     2 a1    a2    a3    a4    a5   
3     3 a1    a2    a3    a4    a5 

蟒:

现在,我不知道如何在熊猫中获得相同的结果。 我尝试过分组和合并,但没有运气。

import pandas as pd
data = {'question': ['question one', 'question two', 
'question three', 'question four', 
'question five', 'question one', 
'question two', 'question three', 
'question four', 'question five', 
'question one', 'question two', 
'question three', 'question four', 'question five'], 
'answer':['answer one', 'answer two', 'answer three', 
'answer four', 'answer five', 'answer one', 
'answer two', 'answer three', 'answer four', 
'answer five', 'answer one', 'answer two', 
'answer three', 'answer four', 'answer five']}
df = pd.DataFrame(data)

使用 merge 和 rest_index((,它对行重新排序并在新订单上分配一个 id,这不是我想要的:

df2 = df.merge(df.drop_duplicates('question').reset_index(), on='question')
question      answer_x  index      answer_y
0     question one    answer one      0    answer one
1     question one    answer one      0    answer one
2     question one    answer one      0    answer one
3     question two    answer two      1    answer two
4     question two    answer two      1    answer two
5     question two    answer two      1    answer two

使用 groupby,我得到的混乱也不是我想要的:

df['id'] = df.groupby('question').ngroup()
question        answer  id
0     question one    answer one   2
1     question two    answer two   4
2   question three  answer three   3
3    question four   answer four   1
4    question five   answer five   0
5     question one    answer one   2
6     question two    answer two   4
7   question three  answer three   3
8    question four   answer four   1
9    question five   answer five   0

如何获得与 dplyr 相同的输出?编辑:要添加更多详细信息,我需要输出像 dplyr 给我的那样,因为这是自动化系统的一部分。

ngroup是组编号,而不是组中的数字。正如文档所解释的那样,对此的补充由cumcount.

粗略地说,你可以将assign用于mutategroupby/cumcount用于row_numberpivot用于您的spread

In [306]: df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
Out[306]: 
question  q1  q2  q3  q4  q5
id                          
0         a1  a2  a3  a4  a5
1         a1  a2  a3  a4  a5

如果你想id成为一列,那就扔进一个reset_index()

不幸的是,我想要真正匹配预期的输出,我们必须保证订单。 github上有几个关于自动排序如何不方便的开放票证,但我们可以手动完成。 我们将切换回英文文本:

In [327]: d2 = df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
In [328]: d2.reindex(df.question.drop_duplicates(), axis=1)
Out[328]: 
question question one question two question three question four question five
id                                                                           
0          answer one   answer two   answer three   answer four   answer five
1          answer one   answer two   answer three   answer four   answer five

使用datar,您可以像在 R 中一样轻松地复制它:

>>> from datar.all import c, f, tibble, group_by, mutate, row_number, pivot_wider
>>> 
>>> question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q
5')
>>> answer = c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5'
)
>>> 
>>> df = tibble(question, answer)
>>> df
question answer
0        q1     a1
1        q2     a2
2        q3     a3
3        q4     a4
4        q5     a5
5        q1     a1
6        q2     a2
7        q3     a3
8        q4     a4
9        q5     a5
10       q1     a1
11       q2     a2
12       q3     a3
13       q4     a4
14       q5     a5
>>> df2 = (df >>
...   group_by(f.question) >>
...   mutate(id = row_number()))
>>> 
>>> df2
question answer  id
0        q1     a1   1
1        q2     a2   1
2        q3     a3   1
3        q4     a4   1
4        q5     a5   1
5        q1     a1   2
6        q2     a2   2
7        q3     a3   2
8        q4     a4   2
9        q5     a5   2
10       q1     a1   3
11       q2     a2   3
12       q3     a3   3
13       q4     a4   3
14       q5     a5   3
[Groups: ['question'] (n=5)]
>>> df2 = (df >>
...   group_by(f.question) >>
...   mutate(id = row_number()) >>
...   pivot_wider(names_from=f.question, values_from=f.answer))
>>> 
>>> df2
id  q1  q2  q3  q4  q5
0   1  a1  a2  a3  a4  a5
1   2  a1  a2  a3  a4  a5
2   3  a1  a2  a3  a4  a5

我是该软件包的作者。如果您有任何问题,请随时提交问题。

我知道问题是关于如何在 python 中获得解决方案,尽管如此,我还是会使用data.tablereshape2离开这个解决方案。

library(data.table)
library(reshape2)
setDT(df)[,new := (1:.N), by = question]
dcast(df, new ~ question, value.var = "answer")
new q1 q2 q3 q4 q5
1:   1 a1 a2 a3 a4 a5
2:   2 a1 a2 a3 a4 a5
3:   3 a1 a2 a3 a4 a5

最新更新