r-从整个数据帧中删除所有特殊字符,但保留因子级定义



我正试图从我的数据帧中完全删除特殊字符,如"-"、"/"、"("、"("等。然而,我的数据框架只包含一个观察值,因为它正在输入到将在生产中使用的模型中。我已经为数据帧明确定义了因子级别。

我试过以下几种:

sanitize_string <- function(string){
gsub('\s+', "_", string) %>%
gsub("[(]", "_", .) %>%
gsub("[)]", "_", .) %>%
gsub("[/]", "_", .) %>%
gsub("[-]", "_", .)}

然后:

df <- as.data.frame(lapply(df, function(dataframe) sapply(dataframe, sanitize_string)), stringsAsFactors=FALSE)

但当我这样做的时候,我放松了我的因子水平,它只认为每个因子都有一个水平,这会在以后我试图从我的模型中获得预测时造成问题,因为稀疏性。model.matrix需要为每个因子提供2个或多个水平,但实际上在生产中,它只会发送一个观测值。

谢谢。

这是我的数据帧:

$ children_under16                : Factor w/ 2 levels "No","Yes": 1
$ ft_employment_status            : Factor w/ 5 levels "Employed","Full-Time Education(Student)",..: 1
$ fuel_type                       : Factor w/ 2 levels "D","P": 2
$ homeowner                       : Factor w/ 2 levels "FALSE","TRUE": 2
$ marital_status                  : Factor w/ 6 levels "Married","Separated",..: 1
$ overnight_loc                   : Factor w/ 7 levels "In a private Driveway",..: NA
$ usage_type                      : Factor w/ 3 levels "CLASS_1","SDPC",..: 1
$ licence_type                    : Factor w/ 3 levels "UK","European",..: 1
$ yad_relationship_to_policyholder: Factor w/ 8 levels "Spouse","No_YAD",..: 1
$ A                          : Factor w/ 7 levels "1","2","5","3",..: 1
$ B                          : Factor w/ 19 levels "C","E","Q","D",..: 1
$ C                           : Factor w/ 63 levels "11","19","58",..: 1
$ region                          : Factor w/ 12 levels "Yorkshire and The Humber",..: 1
$ D                      : Factor w/ 28 levels "Semi-Detached Suburbia",..: 27
$ E                   : Factor w/ 77 levels "Families in Terraces and Flats",..: 77
$ F                 : Factor w/ 9 levels "Suburbanites",..: 1
$ industry_band                   : Factor w/ 18 levels "13","14","15",..: 14
$ occ_band_goco                   : Factor w/ 17 levels "0","1","2","3",..: 2
$ transmission                    : Factor w/ 2 levels "A","M": 2
$ vehicle_make                    : Factor w/ 19 levels "OTHER","AUDI",..: 1
$ vehicle_type           : Factor w/ 17 levels "Mid Exec Saloon/Estate/Coupe",..: 1
$ rural_urban                     : Factor w/ 19 levels "Urban major conurbation",..: 2
$ water_company                   : Factor w/ 23 levels "Affinity Water",..: 23
$ seats                           : Factor w/ 6 levels "-99","2","4",..: ```

您可以清除因子的levels,而不是列。这将保留级别的顺序——尽管如果您的净化处理采用两个不同的级别并使其相同,则会产生错误。我只想做一个for循环:

for (i in 1:ncol(df)) {
if(is.factor(df[[i]])) {
levels(df[[i]]) = sanitize_string(levels(df[[i]]))
}
}

我无法在你发布的结构上测试这一点,但如果你有问题,请与dput()共享一些数据,这样我就可以复制/粘贴它(例如,dput(df[1:10, ])或其他说明问题的小子集(,我很乐意测试和改进。

最新更新