在尝试将数据集(作为CSV)上传到H2O后,发现FirstName列被转换为null/missing,我了解到当前版本的H2O不支持类字符串的列,并且因子只能达到65k个唯一值。所以现在我正在寻找另一种方法来解决这个问题。
我想以一个模型结束,给定任何FirstName,都会返回:
- 该人是男性/女性的概率(+1.0到-1.0)
- 如果可能的话,此人可能的年龄(平均值,stdev)
哪些R函数(或包::函数)适用于此?最好是文档齐全的包/函数,这样我就可以边走边了解更多。
这是R中的数据集示例。列类型有:Numerical、factor、factor和Numerical。
> head(TrainingNames)
Year FirstName Gender Freq
1 1880 Mary F 7065
2 1880 Anna F 2604
3 1880 Emma F 2003
4 1880 Elizabeth F 1939
5 1880 Minnie F 1746
6 1880 Margaret F 1578
> summary(TrainingNames)
Year FirstName Gender Freq
Min. :1880 Francis: 268 F:1062432 Min. : 5.0
1st Qu.:1948 James : 268 M: 729659 1st Qu.: 7.0
Median :1981 Jean : 268 Median : 12.0
Mean :1972 Jesse : 268 Mean : 186.1
3rd Qu.:2000 Jessie : 268 3rd Qu.: 32.0
Max. :2013 John : 268 Max. :99674.0
(Other):1790483
以下是用于提取/处理数据源的R代码。
# Create data dir, download and extract data source
dir.create('Data Files', showWarnings = F)
if(!file.exists('Data Files/names.zip')) {
download.file(url = 'http://www.ssa.gov/oact/babynames/names.zip', destfile = 'Data Files/names.zip', cacheOK = T)
setwd('Data Files/')
unzip(zipfile = 'names.zip')
setwd('../')
}
FileList <- list.files(path = "Data Files/", pattern = ".txt") # List of data files
# Create data-source of names for R/Tableau
munge <- function(f) { # Return data frame of single data file
y <- as.numeric(gsub(pattern = '[^0-9]', replacement = "", x = f))
l <- read.csv(file = paste0("Data Files/", f), header = F, quote = "'")
d <- cbind(y, l)
colnames(d) <- c("Year", "FirstName", "Gender", "Freq")
return(data.frame(d))
}
if(!file.exists('TrainingNames.csv')) {
pb <- txtProgressBar(min = 1, max = length(FileList), style = 3) # Start progress bar
TrainingNames <- munge(FileList[[1]]) # Munge first data file
for(n in 2:length(FileList)) { # Munge remaining data files
TrainingNames <- rbind(TrainingNames, munge(FileList[[n]]))
setTxtProgressBar(pb, n)
}
close(pb) # Close progress bar
rm(n, pb)
write.table(x = TrainingNames, file = "TrainingNames.csv", sep = ";", row.names = F, col.names = T) # Write results to CSV file
}
summary(TrainingNames)
这里我定义了一个函数name_stats
,它可以根据您的请求执行操作。您需要先运行问题中的代码来创建TrainingNames,然后函数才能工作。
您可以编辑任何您喜欢的内容,使其符合您的特定需求。
name_stats=function(name){
df=subset(TrainingNames,FirstName==name)
gender=tapply(df[,'Freq'],df[,'Gender'],sum)
prob_male=gender['M']/sum(gender)
prob_female=gender['F']/sum(gender)
age=tapply(df[,'Freq'],as.factor(df[,'Year']),sum)
dimnames(age)=list(age=round((Sys.Date()-as.Date(unlist(dimnames(age)),format='%Y'))/365))
mean_age=mean(rep(as.numeric(unlist(dimnames(age))),age))
sd_age=sd(rep(as.numeric(unlist(dimnames(age))),age))
cat('Probability',name,'is male is',round(prob_male,6),'n','Probability',name,'is female is',round(prob_female,6),'n','Mean age of',name,'is',round(mean_age,6),'n','SD age of',name,'is',round(sd_age,6))
}