如何在使用R的数据框架的列中提取字符串的一部分?

(对不起，如果我用错了术语和格式，这是我的第一篇文章)

我正在尝试从数据框架中提取字符串的特定部分。这是整个单元格的样子:

{温度:6.689724，地理位置(经度):-159.0224，采集日期:2011-10-05/2011-10-06，环境(生物群落):海洋生物群落(ENVO:00000447)，环境(特征):中远洋带(ENVO:00000213)，环境(物质):颗粒物质，包括浮游生物(ENVO:xxxxxxxx)，环境包装:水，样品采集设备或方法:带有CTD的ROSETTE采样器(sbe9C)和10个Niskin瓶，盐度:34.000507，地理位置(纬度):31.528，仪器型号:Illumina Genome Analyzer IIx}

我想提取粗体部分，并删除它之前和之后的所有内容。我想对该列中的每个单元格重复这一点，我最初的计划是使用str_extract()并删除之前的字符串，包括"water，"然后再次使用str_extract来删除"salinity"之后的字符串。下面是我的尝试，输出是Column1下的所有内容都被删除并替换为NA。

df$Column1 <- str_extract(df$Column1, "(?<=water, )(\w+)")

提前感谢您，再次为格式问题道歉…

这是一种基R方法，它在逗号处分割字符串，然后在结果向量中选择以"样本收集设备"开头的元素。假设x是该列中的单个字符串。


grep("^sample collection device", unlist(strsplit(x, ",")), value = TRUE, perl = TRUE)
[1] "sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles"

如果数据长度相同，并且您想要的字符串在每一行或每一列的位置相同:

library(stringr)
stringyouwant <- str_sub(df$column, startingpositionofstringyouwant, endingpositionofstringyouwant)

我会用datatable来做，这样你就可以轻松地处理所有行。

library(data.table)
string<- "{temperature:6.689724,geographic location (longitude):-159.0224,collection date:2011-10-05/2011-10-06,environment (biome):marine biome (ENVO:00000447),environment (feature):mesopelagic zone (ENVO:00000213),environment (material):particulate matter, including plankton (ENVO:xxxxxxxx),environmental package:water,sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles,salinity:34.000507,geographic location (latitude):31.528,instrument model:Illumina Genome Analyzer IIx}}"
dat<- data.frame(String=rbind(string,string))
dat$Substring<- unlist(lapply(dat$String, function(x) data.table::transpose(strsplit(x,','))[9] ))

相关内容

最新更新

热门标签：