对于我的硕士论文,我正在分析一所大学的课程。我有 1141 门课程(作为行(,有 4 个变量(作为列(。数据集的预览如下所示:
预览数据集
数据集可在此处下载。
在"描述"和"学习成果"列中,有过滤(没有停用词、标点符号等(文本。我想在虚拟变量中转换此文本的每个单词。总共有 10.493 个独立单词。例如,一个实例:
预期输出为 Excel 示例
到目前为止,我一直在努力使用 Tidyr 独立功能,但我没有从中获得输出。有人有解决这个问题的想法吗?
PS 要转换的数据集位于名为"database2"的最后一行
rm(list=ls());
#import database
database <- read_excel("/Volumes/GoogleDrive/My Drive/TU e Innovation Management /Thesis/testdatabasematrix.xlsx");
#name columns
colnames(database)[1] <- "Name";
colnames(database)[2] <- "Description";
colnames(database)[3] <- "LearningOutcomes";
# import packages
library(tidytext)
library(dplyr)
# here the text transformations for titles
title <- tibble(line = 1:nrow(database), text = database$Name) %>% # as tibble
unnest_tokens(word, text)%>% # remove punctuations, lowercase, put words in column
anti_join(stop_words, by = c("word" = "word")) %>% # remove stopwords
group_by(line) %>% summarise(title = paste(word,collapse =' ')) # now all in a row!
# here the text transformations for descriptions
description <- tibble(line = 1:nrow(database), text = database$Description) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = c("word" = "word")) %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
# here the text transformations for learning outcomes
learningoutcomes <- tibble(line = 1:nrow(database), text = database$LearningOutcomes) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = c("word" = "word")) %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
# now the full dataset
database2 <- title %>% left_join(description, by = 'line') %>% left_join(learningoutcomes, by = 'line')
colnames(database2) <- c("line","Name","Description","LearningOutcomes")
database2
# to do: stemming and remove numbers
#wordfreq_LearningOutcome <- data.frame(table(unlist(strsplit(tolower(database2$LearningOutcomes), " "))))
#wordfreq_Description <- data.frame(table(unlist(strsplit(tolower(database2$Description), " "))))
AEF 代码:
testdata <- database2
transformed_data <-
testdata %>%
## split the strings into a list of words
mutate_at(vars(Description, LearningOutcomes), funs(strsplit(., " "))) %>%
## in each row, concatenate the lists from description and outcomes
rowwise() %>%
mutate(words_used = list(unique(c(Description, LearningOutcomes)))) %>%
ungroup() %>%
## the old variables are no longer needed
select(-Description, -LearningOutcomes) %>%
## unnest the data to get a row for each course/word combination
unnest(words_used) %>%
## add a dummy variable that indicates that all combinations in the data are indeed present
mutate(present = 1) %>%
## use spread to convert from tall to wide format. Use 0 as filling for missing combinations
spread(words_used, present, fill=0)
transformed_data
我创建了一些测试数据,因为您的链接无法访问:
library(tidyverse)
testdata <-
tribble(~coursename, ~description, ~outcomes,
"Course1", "word1 word2 word3", "word7 word4 word5",
"Course2", "word3 word4", "word6 word1 word7",
"Course3", "word3 word1 word8 word9", "word2")
> testdata
# A tibble: 3 x 3
coursename description outcomes
<chr> <chr> <chr>
1 Course1 word1 word2 word3 word7 word4 word5
2 Course2 word3 word4 word6 word1 word7
3 Course3 word3 word1 word8 word9 word2
您可以使用 dplyr 分多个步骤转换数据:
transformed_data <-
testdata %>%
## split the strings into a list of words
mutate_at(vars(description, outcomes), funs(strsplit(., " "))) %>%
## in each row, concatenate the lists from description and outcomes
rowwise() %>%
mutate(words_used = list(unique(c(description, outcomes)))) %>%
ungroup() %>%
## the old variables are no longer needed
select(-description, -outcomes) %>%
## unnest the data to get a row for each course/word combination
unnest(words_used) %>%
## add a dummy variable that indicates that all combinations in the data are indeed present
mutate(present = 1) %>%
## use spread to convert from tall to wide format. Use 0 as filling for missing combinations
spread(words_used, present, fill=0)
这给出了:
> transformed_data
# A tibble: 3 x 10
coursename word1 word2 word3 word4 word5 word6 word7 word8 word9
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Course1 1 1 1 1 1 0 1 0 0
2 Course2 1 0 1 1 0 1 1 0 0
3 Course3 1 1 1 0 0 0 0 1 1