我想为字符串上的序列创建一个 id 键,该字符串在一列中重复。前十行应为 id #1,接下来的十行应为 id #2,依此类推。在 R 中,这很简单,我使用 dplyr 得到了预期的结果。
R:
library(tidyverse)
question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5')
answer <- c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5')
df <- data_frame(question, answer)
# A tibble: 15 x 2
question answer
<chr> <chr>
1 q1 a1
2 q2 a2
3 q3 a3
4 q4 a4
5 q5 a5
6 q1 a1
7 q2 a2
8 q3 a3
9 q4 a4
10 q5 a5
11 q1 a1
12 q2 a2
13 q3 a3
14 q4 a4
15 q5 a5
如果我们只运行一个group_by和一个突变来为系列添加一个键,我们就会得到我想要的:
df2 <- df %>%
group_by(question) %>%
mutate(id = row_number())
# A tibble: 15 x 3
# Groups: question [5]
question answer id
<chr> <chr> <int>
1 q1 a1 1
2 q2 a2 1
3 q3 a3 1
4 q4 a4 1
5 q5 a5 1
6 q1 a1 2
7 q2 a2 2
8 q3 a3 2
9 q4 a4 2
10 q5 a5 2
我最后说:
df2 <- df %>%
group_by(question) %>%
mutate(id = row_number()) %>%
spread(question, answer)
# final table:
# A tibble: 3 x 6
id q1 q2 q3 q4 q5
<int> <chr> <chr> <chr> <chr> <chr>
1 1 a1 a2 a3 a4 a5
2 2 a1 a2 a3 a4 a5
3 3 a1 a2 a3 a4 a5
蟒:
现在,我不知道如何在熊猫中获得相同的结果。 我尝试过分组和合并,但没有运气。
import pandas as pd
data = {'question': ['question one', 'question two',
'question three', 'question four',
'question five', 'question one',
'question two', 'question three',
'question four', 'question five',
'question one', 'question two',
'question three', 'question four', 'question five'],
'answer':['answer one', 'answer two', 'answer three',
'answer four', 'answer five', 'answer one',
'answer two', 'answer three', 'answer four',
'answer five', 'answer one', 'answer two',
'answer three', 'answer four', 'answer five']}
df = pd.DataFrame(data)
使用 merge 和 rest_index((,它对行重新排序并在新订单上分配一个 id,这不是我想要的:
df2 = df.merge(df.drop_duplicates('question').reset_index(), on='question')
question answer_x index answer_y
0 question one answer one 0 answer one
1 question one answer one 0 answer one
2 question one answer one 0 answer one
3 question two answer two 1 answer two
4 question two answer two 1 answer two
5 question two answer two 1 answer two
使用 groupby,我得到的混乱也不是我想要的:
df['id'] = df.groupby('question').ngroup()
question answer id
0 question one answer one 2
1 question two answer two 4
2 question three answer three 3
3 question four answer four 1
4 question five answer five 0
5 question one answer one 2
6 question two answer two 4
7 question three answer three 3
8 question four answer four 1
9 question five answer five 0
如何获得与 dplyr 相同的输出?编辑:要添加更多详细信息,我需要输出像 dplyr 给我的那样,因为这是自动化系统的一部分。
ngroup
是组的编号,而不是组中的数字。正如文档所解释的那样,对此的补充由cumcount
.
粗略地说,你可以将assign
用于mutate
,groupby/cumcount
用于row_number
,pivot
用于您的spread
:
In [306]: df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
Out[306]:
question q1 q2 q3 q4 q5
id
0 a1 a2 a3 a4 a5
1 a1 a2 a3 a4 a5
如果你想id
成为一列,那就扔进一个reset_index()
。
不幸的是,我想要真正匹配预期的输出,我们必须保证订单。 github上有几个关于自动排序如何不方便的开放票证,但我们可以手动完成。 我们将切换回英文文本:
In [327]: d2 = df.assign(id=df.groupby("question").cumcount()).pivot("id", "question", "answer")
In [328]: d2.reindex(df.question.drop_duplicates(), axis=1)
Out[328]:
question question one question two question three question four question five
id
0 answer one answer two answer three answer four answer five
1 answer one answer two answer three answer four answer five
使用datar
,您可以像在 R 中一样轻松地复制它:
>>> from datar.all import c, f, tibble, group_by, mutate, row_number, pivot_wider
>>>
>>> question = c('q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q5', 'q1', 'q2', 'q3', 'q4', 'q
5')
>>> answer = c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a3', 'a4', 'a5'
)
>>>
>>> df = tibble(question, answer)
>>> df
question answer
0 q1 a1
1 q2 a2
2 q3 a3
3 q4 a4
4 q5 a5
5 q1 a1
6 q2 a2
7 q3 a3
8 q4 a4
9 q5 a5
10 q1 a1
11 q2 a2
12 q3 a3
13 q4 a4
14 q5 a5
>>> df2 = (df >>
... group_by(f.question) >>
... mutate(id = row_number()))
>>>
>>> df2
question answer id
0 q1 a1 1
1 q2 a2 1
2 q3 a3 1
3 q4 a4 1
4 q5 a5 1
5 q1 a1 2
6 q2 a2 2
7 q3 a3 2
8 q4 a4 2
9 q5 a5 2
10 q1 a1 3
11 q2 a2 3
12 q3 a3 3
13 q4 a4 3
14 q5 a5 3
[Groups: ['question'] (n=5)]
>>> df2 = (df >>
... group_by(f.question) >>
... mutate(id = row_number()) >>
... pivot_wider(names_from=f.question, values_from=f.answer))
>>>
>>> df2
id q1 q2 q3 q4 q5
0 1 a1 a2 a3 a4 a5
1 2 a1 a2 a3 a4 a5
2 3 a1 a2 a3 a4 a5
我是该软件包的作者。如果您有任何问题,请随时提交问题。
我知道问题是关于如何在 python 中获得解决方案,尽管如此,我还是会使用data.table
和reshape2
离开这个解决方案。
library(data.table)
library(reshape2)
setDT(df)[,new := (1:.N), by = question]
dcast(df, new ~ question, value.var = "answer")
new q1 q2 q3 q4 q5
1: 1 a1 a2 a3 a4 a5
2: 2 a1 a2 a3 a4 a5
3: 3 a1 a2 a3 a4 a5