r语言 - 改进了从句子中检测"she"和"her"等单词并因此返回"Female"



我有一个变量"bio_sentences",正如变量名称所暗示的那样,它有四到五个个体的生物句子(从"bio"变量中提取并拆分为句子(。我试图使用这种逻辑来确定一个人的性别......

Femalew <- c("She", "Her")
Check <- str_extract_all(bio,Femalew)
Check <- Check[Check != "character(0)"]
Gender <- vector("character")
if(length(Check) > 0){
Gender[1] <- "Female"
}else{
Gender[1] <- "Male"
}
for(i in 1:length(bio_sentences)){
Gender[i] <- Gender[1]
} 

我得到了一个很好的结果(我的数据集中的大多数是男性(,但是尽管句子中有"她"或"她",但很少有遗漏(一些女性没有被检测到(。无论如何,我可以提高逻辑的准确性或部署一些新功能,例如 grepl?

编辑:

data1.Gender    A B C D E   data1.Description
1   Female  0   0   0   0   0   Ranjit Singh President of Boparan Holdings Limited Ranjit is President of Boparan Holdings Limited.
2   Female  0   0   0   NA  NA  He founded the business in 1993 and has more than 25 years’ experience in the food industry.
3   Female  0   0   0   NA  NA  Ranjit is particularly skilled at growing businesses, both organically and through acquisition.
4   Female  0   0   0   NA  NA  Notable acquisitions include Northern Foods and Brookes Avana in 2011.
5   Female  0   0   0   NA  NA  Ranjit and his wife Baljinder Boparan are the sole shareholders of Boparan Holdings, the holding company for 2 Sisters Food Group.
6   Female  0   0   0   NA  NA  s

以上是来自数据的人,我的要求是代码读取"data1.description"中的所有行(在我的代码中这是在for循环中,因此它读取每个人的所有句子(,正如你所看到的,这个人是男性,其中一个句子中显然有一个"他", 但是,通过应用我之前编写的上述逻辑,我将其视为"女性"。

正如@Merijn van Tilborg所说,你应该在脑海中非常清楚你的句子,因为如果有多个代词,你的工作就无法给出所需的输出。
但是,您也可以管理这些情况,我们可以尝试使用dplyrtidytext包,但我们必须稍微清理一下数据:

# explicit the genders
female <- c("She", "Her")
male <- c("He", "His")
# here your data, with several examples of cases
df <- data.frame(
line = c(1,2,3,4,5,6),
text = c("She is happy",            # female
"Her dog is happy",        # female (if we look at the subject, it's not female..)
"He is happy",             # male
"His dog is happy",        # male
"It is happy",             # ?
"She and he are happy"),   # both!
stringsAsFactors = FALSE ) # life saver

现在我们可以尝试这样的事情:

library(tidytext)
library(dplyr)
df %>%
unnest_tokens(word, text) %>%                                            # put words in rows
mutate(gender = ifelse(word %in% tolower(female),'female',
ifelse(word %in% tolower(male), 'male','unknown'))) %>%  # detect male and female, remember tolower!
filter(gender!='unknown') %>%                                            # remove the unknown
right_join(df) %>%                                                       # join with the original sentences keeping all of them
select(-word)                                                            # remove useless column
line gender                 text
1    1 female         She is happy
2    2 female     Her dog is happy
3    3   male          He is happy
4    4   male     His dog is happy
5    5   <NA>          It is happy
6    6 female She and he are happy
7    6   male She and he are happy

你可以看到1,2,3,4个句子符合你的标准,"it"没有定义,如果有男性和女性,我们加倍行,让你明白为什么。

最后,您可以在一行中折叠,将以下内容添加到dplyr链中:

%>% group_by(text, line) %>% summarise(gender = paste(gender, collapse = ','))
# A tibble: 6 x 3
# Groups:   text [?]
text                  line gender     
<chr>                <dbl> <chr>      
1 He is happy              3 male       
2 Her dog is happy         2 female     
3 His dog is happy         4 male       
4 It is happy              5 NA         
5 She and he are happy     6 female,male
6 She is happy             1 female    

编辑: 让我们尝试使用您的数据:

data1 <- read.table(text="
data1.Gender    A B C D E   data1.Description
1   Female  0   0   0   0   0   'Ranjit Singh President of Boparan Holdings Limited Ranjit is President of Boparan Holdings Limited.'
2   Female  0   0   0   NA  NA  'He founded the business in 1993 and has more than 25 years’ experience in the food industry.'
3   Female  0   0   0   NA  NA  'Ranjit is particularly skilled at growing businesses, both organically and through acquisition.'
4   Female  0   0   0   NA  NA  'Notable acquisitions include Northern Foods and Brookes Avana in 2011.'
5   Female  0   0   0   NA  NA  'Ranjit and his wife Baljinder Boparan are the sole shareholders of Boparan Holdings, the holding company for 2 Sisters Food Group.'
6   Female  0   0   0   NA  NA  's'",stringsAsFactors = FALSE)

# explicit the genders, in this case I've put also the names
female <- c("She", "Her","Baljinder")
male <- c("He", "His","Ranjit")
# clean the data
df <- data.frame(
line = rownames(data1),
text = data1$data1.Description,
stringsAsFactors = FALSE)
library(tidytext)
library(dplyr)
df %>%
unnest_tokens(word, text) %>%                                            # put words in rows
mutate(gender = ifelse(word %in% tolower(female),'female',
ifelse(word %in% tolower(male), 'male','unknown'))) %>%  # detect male and female, remember tolower!
filter(gender!='unknown') %>%                                            # remove the unknown
right_join(df) %>%                                                       # join with the original sentences keeping all of them
select(-word) %>% 
group_by(text, line) %>%
summarise(gender = paste(gender, collapse = ',')) 

结果:

Joining, by = "line"
# A tibble: 6 x 3
# Groups:   text [?]
text                                                            line  gender       
<chr>                                                           <chr> <chr>        
1 He founded the business in 1993 and has more than 25 years’ ex~ 2     male         
2 Notable acquisitions include Northern Foods and Brookes Avana ~ 4     NA           
3 Ranjit and his wife Baljinder Boparan are the sole shareholder~ 5     male,male,fe~
4 Ranjit is particularly skilled at growing businesses, both org~ 3     male         
5 Ranjit Singh President of Boparan Holdings Limited Ranjit is P~ 1     male,male    
6 s                                                               6     NA  

真正的游戏是将所有你能想到的词定义为"男性"或"女性"。

这要复杂得多,因为上下文是这里的关键。看看下面的三个短语...

苏珊有一位伟大的教授,他的名字叫亚当。他教了他最喜欢的学生所有需要知道的东西...... (苏珊不是被检测为女性,而是被检测为男性(

苏珊有一位伟大的教授,他的名字叫亚当。他教了她所有要知道的事情... (好吧,我们现在有一个SHE和一个HE(

苏珊有一位伟大的教授,名叫亚当。亚当教了她所有要知道的事情... (好的,我们有一个 SHE(

除了已经给出的答案外,我还强烈建议将最常见的女性名字添加到该列表中。例如,她们可以很容易地在网上找到,作为一个国家/地区最受欢迎的 100 个女性名字。我敢肯定,即使你在女性名单上加上大约500个最常出现的名字,你也会得到一个相当不错的开始,对男性也是如此。

此外,我给你举一个例子,里面有一些决策规则。它是女性或男性的可能性有多大。一种方法可能是只计算出现次数并计算比率。根据比例,您可以做出自己的决定。我的选择只是一个任意的例子,每个决策一行(可以更有效地编码(。

library(data.table) ## just my personal preference above dplyr
library(stringr) ## just my personal favorite when I deal with strings
df = data.table(text = c("Because Sandra is a female name and we talk a few times about her, she is most likely a female he says.",
"Sandra is mentioned and the only references are about how she did everything to achieve her goals.", 
"Nothing is mentioned that reveals a gender.",
"She talks about him and he talks about her.",
"Sandra says: he is nice and she is nice too.",
"Adam is a male and we only talk about him")))
f.indicators = c("she", "her", "susan", "sandra")
m.indicators = c("he", "him", "his", "steve", "adam")
df[, f.count := sum(str_split(str_to_lower(text), "[[:space:]]|[[:punct:]]")[[1]] %in% f.indicators, na.rm = TRUE), by = text]
df[, m.count := sum(str_split(str_to_lower(text), "[[:space:]]|[[:punct:]]")[[1]] %in% m.indicators, na.rm = TRUE), by = text]
df[f.count != 0 | m.count != 0, gender_ratio_female := f.count / (f.count + m.count)]
df[, decision := "Unknown"]
df[gender_ratio_female == 1, decision := "Female, no male indications"]
df[gender_ratio_female == 0, decision := "Male, no female indicators"]
df[gender_ratio_female > 0.4 & gender_ratio_female < 0.6, decision := "Gender should be checked"]
df[gender_ratio_female > 0.6 & gender_ratio_female < 1, decision := "Probably a Female"]
df[gender_ratio_female > 0 & gender_ratio_female < 0.4, decision := "Probably a Male"]

附言抱歉,我正在努力在此处格式化输出表,我是这里的新手

 text f.count m.count   gender_ratio_female                    decision
1: Because Sandra is a female name and we talk a few times about her, she is most likely a female he says.       3       1              0.7500           Probably a Female
2:      Sandra is mentioned and the only references are about how she did everything to achieve her goals.       3       0              1.0000 Female, no male indications
3:                                                             Nothing is mentioned that reveals a gender.       0       0                  NA                     Unknown
4:                                                             She talks about him and he talks about her.       2       2              0.5000    Gender should be checked
5:                                                            Sandra says: he is nice and she is nice too.       2       1              0.6667           Probably a Female
6:                                                               Adam is a male and we only talk about him       0       2              0.0000  Male, no female indicators

最新更新