我的数据集看起来像这样:
movie.id unknown Action Adventure rating
1 0 0 0 3.831461
2 0 1 1 3.416667
3 0 0 0 3.945946
4 0 1 0 2.894737
5 1 0 0 4.358491
我想计算每个类型的平均评分。我可以手动对每个子集进行子集,但我希望更自动地进行
update1:每个电影可以有多个类型,对于每个类型,如果电影属于该类型,则有一个值为1的列,如果不属于该类型,则为0
update2:所以我想计算每个电影在冒险栏中有1的评分平均值,然后是每个电影在动作栏中有1,未知栏中有1(未知也是类型)等等
我认为这看起来也是有效的:
genres = names(DF)[2:4]
ret = lapply(genres, function(x) mean(DF[["rating"]][as.logical(DF[[x]])]))
cbind.data.frame(genres, means = unlist(ret)) #or whatever formating manipulation
# genres means
#1 unknown 4.358491
#2 Action 3.155702
#3 Adventure 3.416667
其中DF
:
DF = structure(list(movie.id = 1:5, unknown = c(0L, 0L, 0L, 0L, 1L
), Action = c(0L, 1L, 0L, 1L, 0L), Adventure = c(0L, 1L, 0L,
0L, 0L), rating = c(3.831461, 3.416667, 3.945946, 2.894737, 4.358491
)), .Names = c("movie.id", "unknown", "Action", "Adventure",
"rating"), class = "data.frame", row.names = c(NA, -5L))
使用reshape2
和dplyr
包:
首先安装它们:
> install.packages("reshape2")
> install.packages("dplyr")
> require(reshape2)
> require(dplyr)
:
> m
id unknown Action Adventure rating
1 1 0 0 0 0.51391395
2 2 0 1 1 0.02915435
3 3 0 0 0 0.88752693
4 4 0 1 0 0.57660751
5 5 1 0 0 0.59169393
那么它就是一行:
> melt(m,measure=c("Action","Adventure","unknown")) %.% filter(value==1) %.% group_by(variable) %.% summarize(meanRating = mean(rating))
Source: local data frame [3 x 2]
variable meanRating
1 Action 0.30288093
2 Adventure 0.02915435
3 unknown 0.59169393
检查一下,唯一不平凡的是:
> mean(m$rating[m$Action==1])
[1] 0.3028809
当您有很多类型时,将measure=
参数设置为类型列的名称。
修改变量的名称,使其更美观:
> melt(m,measure=c("Action","Adventure","unknown"),variable.name="genre") %.% filter(value==1) %.% group_by(genre) %.% summarize(meanRating = mean(rating))
Source: local data frame [3 x 2]
genre meanRating
1 Action 0.30288093
2 Adventure 0.02915435
3 unknown 0.59169393