如何在r中从excel中清除具有特定规则的数据集



我有一个数据清理问题。数据集的简短版本示例如下:

df <- data.frame(Text1 = c(NA, "Name","Jack","Jack"),
Text2 = c(NA, "District","Leon","Leon"),
Text3 = c(NA, "School","Black","Black"),
Text4 = c(NA, "Grade",1,1),
Scoring = c("Text", "Teacher","Ana","Ana"),
Item1 = c("11_ada", "/a/","1.0","0.0"),
Text7 = c(NA, "/c/","1.0","1.0"),
Text8 = c(NA, "/a/","1.0","1.0"),
Item2 = c("10_bada", "/x/","1.0","0.0"),
Text10 = c(NA, "/y/","0.0","1.0"),
Text11 = c(NA, "/z/","1.0","0.0"))
> df
Text1    Text2  Text3 Text4 Scoring  Item1 Text7 Text8   Item2 Text10 Text11
1  <NA>     <NA>   <NA>  <NA>    Text 11_ada  <NA>  <NA> 10_bada   <NA>   <NA>
2  Name District School Grade Teacher    /a/   /c/   /a/     /x/    /y/    /z/
3  Jack     Leon  Black     1     Ana    1.0   1.0   1.0     1.0    0.0    1.0
4  Jack     Leon  Black     1     Ana    0.0   1.0   1.0     0.0    1.0    0.0

数据集的第一部分包含人口统计信息。它还具有具有correct (1.0)incorrect (0.0)响应的项目信息。我正在从excel文件中获取这些信息。我需要重组前两行,如下所示。特别是对于Item信息,我需要将它们重命名为1.1 1.2 1.3 2.1 2.2 2.3。这只是数据集的一部分。每个Item都有多个列,并且有许多项。

如何按如下方式清理此数据集?

df1

Name District School Grade Teacher    1.1   1.2   1.3     2.1    2.2    2.3
Jack     Leon  Black     1     Ana    1.0   1.0   1.0     1.0    0.0    1.0
Jack     Leon  Black     1     Ana    0.0   1.0   1.0     0.0    1.0    0.0

此外,如何获得第二个所需的数据集?

df2

Name District School Grade Teacher    1.1   1.2   1.3     2.1    2.2    2.3
NA       NA    NA       NA     NA    11_ada 11_ada 11_ada 10_bada 10_bada 10_bada
Jack     Leon  Black     1     Ana    1.0   1.0   1.0     1.0    0.0    1.0
Jack     Leon  Black     1     Ana    0.0   1.0   1.0     0.0    1.0    0.0
library(janitor)
library(dplyr)
i1 <- cumsum(grepl("Item", names(df)))
i2 <- match(1, i1)
nm1 <-  make.unique(as.character(i1[i1 > 0]))
row_to_names(df, 2) %>%
setNames(make.unique(names(.))) %>%
rename_with(~nm1, i2:last_col())

-输出

Name District School Grade Teacher   1 1.1 1.2   2 2.1 2.2
3 Jack     Leon  Black     1     Ana 1.0 1.0 1.0 1.0 0.0 1.0
4 Jack     Leon  Black     1     Ana 0.0 1.0 1.0 0.0 1.0 0.0

对于第二种情况

i1 <- cumsum(grepl("Item", names(df)))
i2 <- match(1, i1)
nm1 <-  make.unique(as.character(i1[i1 > 0]))
i3 <- match("Name", df$Text1)
df[i3, i2:ncol(df)] <- nm1
names(df) <- unlist(df[i3,])
df <- df[-i3,]
i4 <- is.na(df$Name)
df[i4, i2:ncol(df)] <- t(apply(df[i4, i2:ncol(df)], 1, 
function(x) x[!is.na(x)][cumsum(!is.na(x))]))

-输出

> df
Name District School Grade Teacher      1    1.1    1.2       2     2.1     2.2
1 <NA>     <NA>   <NA>  <NA>    Text 11_ada 11_ada 11_ada 10_bada 10_bada 10_bada
3 Jack     Leon  Black     1     Ana    1.0    1.0    1.0     1.0     0.0     1.0
4 Jack     Leon  Black     1     Ana    0.0    1.0    1.0     0.0     1.0     0.0

最新更新