>我有以下数据框:
library(tidyverse)
ndf <- structure(list(experiment_status = c("Negative?", "Negative?",
"Negative", "Negative?", "Negative?", "Negative?"), id = 1:6), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
ndf
#> # A tibble: 6 x 2
#> experiment_status id
#> <chr> <int>
#> 1 Negative? 1
#> 2 Negative? 2
#> 3 Negative 3
#> 4 Negative? 4
#> 5 Negative? 5
#> 6 Negative? 6
我想做的是过滤仅保留那些没有问号?
的行,即在管道之后只保留第 3 行。
为什么会失败?
ndf %>%
filter(!grepl("[?]", experiment_status))
正确的方法是什么?
ndf %>%
filter(!grepl(intToUtf8(65311), experiment_status))
# A tibble: 1 x 2
experiment_status id
<chr> <int>
1 Negative 3
您还注意到的一件事是,如果您将 tibble 强制到数据帧,它会为您提供其 hex-Unicode,这是<U+FF1F>
。您也可以使用它来过滤。
即:
ndf %>%
filter(!grepl(intToUtf8(0xFF1F), experiment_status))
# A tibble: 1 x 2
experiment_status id
<chr> <int>
1 Negative 3
导入在非英语操作系统中编写的csv
文件时可能会出现问题。
> '?' =='?'
[1] FALSE
ndf %>% filter(!grepl('?',experiment_status))
#Try removing white space but it fails
> trimws(ndf$experiment_status,'both')
[1] "Negative?" "Negative?" "Negative" "Negative?" "Negative?" "Negative?"
#Change '?' to '?' using gsub
> gsub('?', '?', ndf$experiment_status)
[1] "Negative?" "Negative?" "Negative" "Negative?" "Negative?" "Negative?"
ndf %>% mutate(experiment_status_clean = gsub('?', '?', experiment_status))
#Now you are search for a litteral ? so you need to escape ? using \
ndf %>% mutate(experiment_status_clean = gsub('?', '?', experiment_status)) %>%
filter(!grepl('\?',experiment_status_clean))
ndf %>%
filter(!grepl("?", experiment_status, fixed = TRUE))
但是在你的例子中,我认为filter(experiment_status == "Negative")
也可以。
编辑:或者因为我们也可以有"积极" -
ndf %>%
filter(experiment_status %in% c("Negative", "Positive"))
要清理您的审讯标记,您可以使用stringi::stri_trans_general
.我建议您尽早在数据上使用它,以避免出现不好的意外。
library(stringi)
ndf %>%
mutate_at("experiment_status", stri_trans_general, "latin-ascii") %>%
filter(!grepl("[?]", experiment_status)) # or filter(!grepl("\?$", experiment_status))
# A tibble: 1 x 2
# experiment_status id
# <chr> <int>
# 1 Negative 3
在这里不需要有关有问题的字符的知识,您可以通过相同的标记清除其他不幸的标点符号或替代字符。