我有一个数据帧,由Match <- read.table("Match.txt", sep="", fill =T, stringsAsFactors = FALSE, quote = "", header = F)
读取,如下所示:
> ab
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Inspecting sequence ID chr1:173244300-173244500 NA NA
2 V$ATF3_Q6 | 19 (-) | 0.877 | 0.622 | aagtccCATCAggg
3 V$ATF3_Q6 | 34 (-) | 0.788 | 0.655 | agggaaCGACAcag
4 V$ATF3_Q6 | 102 (+) | 0.738 | 0.685 | cccTGAGCttagga
5 V$CEBPB_01 | 24 (+) | 0.950 | 0.882 | ccatcagGGAAGgg
72 V$YY1_01 | 117 (+) | 0.996 | 0.984 | acttCCCATcttttaag
73 Inspecting sequence ID chr1:173244350-173244550 NA NA
74 V$ATF3_Q6 | 52 (+) | 0.738 | 0.685 | cccTGAGCttagga
75 V$ATF3_Q6 | 160 (+) | 0.862 | 0.687 | gtcTGACCtggaga
76 V$CEBPB_01 | 57 (+) | 0.966 | 0.958 | agcttagGAAACtt
它包含数百万个这样的重复,其中第一行是:Inspecting sequence ID chr1:173244300-173244500
,然后是上面可以看到的一些值。我想在处理它时牢记以下几点:
- 提取第一行,在
:
和-
上打断它,这样我将得到三列,如:chr1 173244300 173244500
- 第4列应该包含V1$Row2-1st元素,在
$
和_
上拆分,只取第二个索引,即ATF3
,像这样,我有30个确定的(让我们称之为名称)情况,有些情况会被观察到,而另一些情况则不是在每种情况下(1个情况从第1行到第72行,第二个情况从73行开始) - 如果该名称出现在1个大小写中,则值
B
将分配给该列,如果没有,则值为U
因此,根据我的输入,我想得到以下输出:
chr start stop ATF3 CEBPB YY1 ..(All which appear e.g from row 1 to 72, ignoring duplicates)
chr1 173244300 173244500 B B B
chr1 173244350 173244550 B B U
我想在标题中固定列的编号(我知道它们是32个这样的名称),所以如果它们出现在一种情况下,B
将被分配,否则U
将被分配。
如果有人能帮我做这件事,那将是一个很大的帮助。
以下是此示例数据帧的dput:
> ab <- dput(Match[c(1:5,72:76), ])
structure(list(V1 = c("Inspecting", "V$ATF3_Q6", "V$ATF3_Q6",
"V$ATF3_Q6", "V$CEBPB_01", "V$YY1_01", "Inspecting", "V$ATF3_Q6",
"V$ATF3_Q6", "V$CEBPB_01"), V2 = c("sequence", "|", "|", "|",
"|", "|", "sequence", "|", "|", "|"), V3 = c("ID", "19", "34",
"102", "24", "117", "ID", "52", "160", "57"), V4 = c("chr1:173244300-173244500",
"(-)", "(-)", "(+)", "(+)", "(+)", "chr1:173244350-173244550",
"(+)", "(+)", "(+)"), V5 = c("", "|", "|", "|", "|", "|", "",
"|", "|", "|"), V6 = c(NA, 0.877, 0.788, 0.738, 0.95, 0.996,
NA, 0.738, 0.862, 0.966), V7 = c("", "|", "|", "|", "|", "|",
"", "|", "|", "|"), V8 = c(NA, 0.622, 0.655, 0.685, 0.882, 0.984,
NA, 0.685, 0.687, 0.958), V9 = c("", "|", "|", "|", "|", "|",
"", "|", "|", "|"), V10 = c("", "aagtccCATCAggg", "agggaaCGACAcag",
"cccTGAGCttagga", "ccatcagGGAAGgg", "acttCCCATcttttaag", "",
"cccTGAGCttagga", "gtcTGACCtggaga", "agcttagGAAACtt")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10"), row.names = c(1L,
2L, 3L, 4L, 5L, 72L, 73L, 74L, 75L, 76L), class = "data.frame")
将您在本问题中的输入文件作为/c/tmp.txt
这个awk脚本保存为SO-38563400.awk
:
BEGIN {
OFS="t" # Set the output separator
i=0 # Just to init the counter and be sure to start at 1 later
}
{
#print $0
}
/Inspecting sequence ID/ { # Changing sequence, initialize new entry with start and end
split($4,arr,"[:-]") # split the string in fields, split on : and -
seq[i++,"chr"]=arr[1] # Save the chr part and increase the sequence beforehand
seq[i,"start"]=arr[2] # save the start date
seq[i,"end"]=arr[3] # Save the end date
}
/V[$][^_]+_.*/ { # V line type,
split($1,arr,"[$_]") # Split on $ and underscore
seq[i,arr[2]]="B" # This has been seen, setting to B
seq[i,"print"]=1
names[arr[2]]++ # Save the name for output
# (and count occurences, just for fun, well mainly because an int is cheaper to store)
# Main reason is it allow a quicker access toa rray keys ant END block
}
END {
head=sprintf("char%sstart%sstop",OFS,OFS,OFS)
for (h in names) {
head=sprintf("%s%s%s",head,OFS,h)
}
print(head)
for (l=1; l<i; l++) { # loop over each line/sequence
line=sprintf("%s%s%s%s%s",seq[l,"chr"],OFS,seq[l,"start"],OFS,seq[l,"end"])
for (h in names) {
if (seq[l,h]=="B") line=sprintf("%s%s%s",line,OFS,"B")
else line=sprintf("%s%s%s",line,OFS,"U")
}
if (seq[l,"print"]) print line
}
}
传递此命令:
awk -f SO-38563400.awk /c/tmp.txt > /c/Rtable.txt
提供:
$ cat /c/Rtable.txt
char start stop STAT3 ATF3 TEAD4 GATA3 JUND HNF4A FOXA2 MAX CEBPB SPI1 GABPA CMYC P300 E2F1 CTCF ATF2
chr22 16049850 16050050 B B U B U B B U U U U U B B U B
chr22 16049900 16050100 B B B B B B B B B B B B B B B B
然后读取r:
> x <- read.table("/c/Rtable.txt", sep="t", stringsAsFactors = FALSE, header=T)
> x
char start stop STAT3 ATF3 TEAD4 GATA3 JUND HNF4A FOXA2 MAX CEBPB SPI1 GABPA CMYC P300 E2F1 CTCF ATF2
1 chr22 16049850 16050050 B B U B U B B U U U U U B B U B
2 chr22 16049900 16050100 B B B B B B B B B B B B B B B B
请忽略/c/
路径的设置,这可以在windows或linux上工作,windows下有awk
端口,由于操作系统在文件流上的容量,我建议对大文件使用linux。
我们可以通过在打印结果之前不读取整个文件来节省更多的内存,但这需要一组固定的"名称",但你太懒了,无法自己提取名称,只给我发了一堆条目,exercise由你来调整,在BEGIN块中制作列表,将其用作每个seq的条目,并在每个新的seq上打印之前的结果。
我希望下次你能花点时间提出一个合适的问题,你会明白你必须努力让别人帮助你,尤其是在一连串的评论要求你改进你的问题之后。
也许不是stringr
或tidyr
的最佳用法,但这可以在hadleyverse中以某种可读的方式完成。。。
逻辑流程为:
- 使用
tidyr::fill
和ifelse("Inspecting", rowname, NA)
确定组 - 将字段更改为您想要的字段
- 使用整形(
dcast
)可以获得所需的格式
library(dplyr)
library(tidyr)
library(reshape2)
library(stringr)
is_in <- function(v1part) {
return(ifelse(length(v1part) > 0, "B", "U"))
}
ab1<- ab %>%
add_rownames() %>%
mutate(rowname = ifelse(V1=="Inspecting", rowname, NA),
V4a = ifelse(V4 == "(-)" | V4 == "(+)", NA, V4),
chr = str_extract_all(ab$V4, "^chr[^:]+", simplify = T)[,1],
chr = ifelse(chr=="", NA, chr),
start = str_split_fixed(V4a, ":|-", 3)[,2],
start = ifelse(start=="", NA, start),
stop = str_split_fixed(V4a, ":|-", 3)[,3],
stop = ifelse(stop=="", NA, stop),
V1part = str_split_fixed(V1, "\$|_", 3)[,2]) %>%
fill(rowname, .direction="down") %>%
group_by(rowname) %>%
fill(chr, .direction="down") %>%
fill(start, .direction="down") %>%
fill(stop, .direction="down") %>%
dcast(chr+start+stop ~ V1part, fun.aggregate=is_in)
> ab1
chr start stop Var.4 ATF3 CEBPB YY1
1 chr1 173244300 173244500 B B B B
2 chr1 173244350 173244550 B B B U
不优雅,但它应该可以工作(您的数据有一个带"|"的列…我将其命名为df):
cond <- which(!df$V2 == "|")
new_df <- data.frame(chr=character(length(cond)), start=character(length(cond)), stop=character(length(cond)))
for (i in 1:length(cond)) {
line <- df[cond[i], ]
var <- unlist(strsplit(line$V4, split = ":"))
var2 <- unlist(strsplit(var[2], split = "-"))
new_df$chr[i] <- var[1]
new_df$start[i] <- var2[1]
new_df$stop[i] <- var2[2]
for (k in (i+1):(cond[i+1]-1)) {
# Your code using name <- df$V1 (Use strsplit again)
# df[i, name] <- ...
}
}