用r中的平均值替换离群值时的问题



我有一个HR数据框架,其中包含与组织中员工相关的信息,例如工资,部门,ID等。

我要做的是替换列"salary_2018"中的异常值(USD>200000)。为了"销售";部门与列本身的平均值。

这是我正在学习的专业课程,我得到了数据框和代码,这是:

library(readxl)
df<-read_excel("C:\Media Mean Mode.xlsx")
df1<-df[df$department=="Sales",]
df2 = df1
df2[df2$salary_2018<200000,]<-mean(df2$salary_2018)

在我正在学习的视频中,讲师使用了完全相同的数据帧和完全相同的代码,并且它有效。然而,当我尝试同样的事情时,我收到以下错误:

Errore: Assigned data `mean(df2$salary_2018)` must be compatible with existing data.
i Error occurred for column `department`.
x Can't convert <double> to <character>.

如果我试图替换"部门"中的信息,我会理解错误的。列,因为数据类型为"字符"。

但是考虑到我正在做的是&;salary_2018&;,这是&;double&;,为什么错误指的是&;department&;?

你知道为什么会这样吗?

谢谢!

编辑:根据Peter的建议,我在下面添加了数据框架的结构。
> dput(head(df, 5))
structure(list(age = c(41, 49, 37, 33, 27), department = c("Sales", 
"Research & Development", "Research & Development", "Research & Development", 
"Research & Development"), employee_number = c(1, 2, 4, 5, 7), 
gender = c("Female", "Male", "Male", "Female", "Male"), job_level = c(2, 
2, 1, 1, 1), marital_status = c("Single", "Married", "Single", 
"Married", "Married"), over_time = c("Yes", "No", "Yes", 
"Yes", "No"), performance_rating = c(3, 4, 3, 3, 3), totalW_working_years = c(8, 
10, 7, 8, 6), training_times_last_year = c(0, 3, 3, 3, 3), 
years_since_last_promotion = c(0, 1, 0, 3, 2), years_with_curr_manager = c(5, 
7, 0, 0, 2), monthly_income = c(5993, 5130, 2090, 2909, 3468
), salary_2017 = c(71916, 61560, 25080, 34908, 41616), salary_2018 = c(79826.76, 
75718.8, 28842, 38747.88, 46609.92), year_of_joining = c(2012, 
2008, 2018, 2010, 2016), last_role_change = c(2014, 2011, 
2018, 2011, 2016), percent_hike = c(11, 23, 15, 11, 12)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

即使没有实际数据,您的代码也会尝试用工资低于200k(不应该高于200k吗?)的平均工资替换所有列。这是因为您没有在逗号后指定列,空格表示所有列。请注意这段代码中的区别:

# all columns
mtcars[1:4, ]
#>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
# columns one, two and three
mtcars[1:4, 1:3]
#>                 mpg cyl disp
#> Mazda RX4      21.0   6  160
#> Mazda RX4 Wag  21.0   6  160
#> Datsun 710     22.8   4  108
#> Hornet 4 Drive 21.4   6  258

在你的情况下,试试:

df2[df2$salary_2018 > 200000, "salary_2018"] <- mean(df2$salary_2018, na.rm = TRUE)

最新更新