r语言 - 将包含文本的两列转换为虚拟变量,文本的每个单词表示一个新列



对于我的硕士论文,我正在分析一所大学的课程。我有 1141 门课程(作为行(,有 4 个变量(作为列(。数据集的预览如下所示:

预览数据集

数据集可在此处下载。

在"描述"和"学习成果"列中,有过滤(没有停用词、标点符号等(文本。我想在虚拟变量中转换此文本的每个单词。总共有 10.493 个独立单词。例如,一个实例:

预期输出为 Excel 示例

到目前为止,我一直在努力使用 Tidyr 独立功能,但我没有从中获得输出。有人有解决这个问题的想法吗?

PS 要转换的数据集位于名为"database2"的最后一行

rm(list=ls());
#import database                       
database <- read_excel("/Volumes/GoogleDrive/My Drive/TU e Innovation Management /Thesis/testdatabasematrix.xlsx");
#name columns
colnames(database)[1] <- "Name";
colnames(database)[2] <- "Description";
colnames(database)[3] <- "LearningOutcomes";

# import packages
library(tidytext)
library(dplyr)
# here the text transformations for titles
title <- tibble(line = 1:nrow(database), text = database$Name) %>%        # as tibble
unnest_tokens(word, text)%>%                                     # remove punctuations, lowercase, put words in column
anti_join(stop_words, by = c("word" = "word")) %>%               # remove stopwords
group_by(line) %>% summarise(title = paste(word,collapse =' '))  # now all in a row!
# here the text transformations for descriptions
description <- tibble(line = 1:nrow(database), text = database$Description) %>%
unnest_tokens(word, text) %>%  
anti_join(stop_words, by = c("word" = "word"))  %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
# here the text transformations for learning outcomes
learningoutcomes <- tibble(line = 1:nrow(database), text = database$LearningOutcomes) %>% 
unnest_tokens(word, text) %>%
anti_join(stop_words, by = c("word" = "word"))  %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
# now the full dataset
database2 <- title %>% left_join(description, by = 'line') %>% left_join(learningoutcomes, by = 'line')
colnames(database2) <- c("line","Name","Description","LearningOutcomes")
database2
# to do: stemming and remove numbers
#wordfreq_LearningOutcome <- data.frame(table(unlist(strsplit(tolower(database2$LearningOutcomes), " "))))
#wordfreq_Description <- data.frame(table(unlist(strsplit(tolower(database2$Description), " "))))

AEF 代码:

testdata <- database2
transformed_data <- 
testdata %>% 
## split the strings into a list of words
mutate_at(vars(Description, LearningOutcomes), funs(strsplit(., " "))) %>% 
## in each row, concatenate the lists from description and outcomes
rowwise() %>% 
mutate(words_used = list(unique(c(Description, LearningOutcomes)))) %>% 
ungroup() %>% 
## the old variables are no longer needed
select(-Description, -LearningOutcomes) %>% 
## unnest the data to get a row for each course/word combination
unnest(words_used) %>% 
## add a dummy variable that indicates that all combinations in the data are indeed present
mutate(present = 1) %>% 
## use spread to convert from tall to wide format. Use 0 as filling for missing combinations
spread(words_used, present, fill=0)
transformed_data

我创建了一些测试数据,因为您的链接无法访问:

library(tidyverse)
testdata <-
tribble(~coursename, ~description, ~outcomes,
"Course1", "word1 word2 word3", "word7 word4 word5",
"Course2", "word3 word4", "word6 word1 word7",
"Course3", "word3 word1 word8 word9", "word2")
> testdata
# A tibble: 3 x 3
coursename description             outcomes         
<chr>      <chr>                   <chr>            
1 Course1    word1 word2 word3       word7 word4 word5
2 Course2    word3 word4             word6 word1 word7
3 Course3    word3 word1 word8 word9 word2     

您可以使用 dplyr 分多个步骤转换数据:

transformed_data <- 
testdata %>% 
## split the strings into a list of words
mutate_at(vars(description, outcomes), funs(strsplit(., " "))) %>% 
## in each row, concatenate the lists from description and outcomes
rowwise() %>% 
mutate(words_used = list(unique(c(description, outcomes)))) %>% 
ungroup() %>% 
## the old variables are no longer needed
select(-description, -outcomes) %>% 
## unnest the data to get a row for each course/word combination
unnest(words_used) %>% 
## add a dummy variable that indicates that all combinations in the data are indeed present
mutate(present = 1) %>% 
## use spread to convert from tall to wide format. Use 0 as filling for missing combinations
spread(words_used, present, fill=0)

这给出了:

> transformed_data
# A tibble: 3 x 10
coursename word1 word2 word3 word4 word5 word6 word7 word8 word9
<chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Course1        1     1     1     1     1     0     1     0     0
2 Course2        1     0     1     1     0     1     1     0     0
3 Course3        1     1     1     0     0     0     0     1     1

最新更新