r-使用分离和变异函数



我是R的新手,正在尝试变异以下字符变量"税费率"分为四个不同的栏(即CGST、SGST、UTGST和IGST(,该栏下的税率适用于该标题。数据集示例如下:

df#tibble:3 x 1税费率

1"CGST 2.5%+SGST 2.5%";2〃;CGST 6%+UTGST 6%">
3〃;IGST 12%";

我曾尝试使用"分离"one_answers"变异"功能,但收效甚微

如有任何指导,将不胜感激

我相信这也可以在基R中简洁地完成,但这里有一种不同的方法,我首先在每个加号处将数据拆分为新行,然后修剪多余的空格,然后拆分为两列。

library(tidyverse)
df <- data.frame(Tax.Rate = c("CGST 2.5% + SGST 2.5%", "CGST 6% + UTGST 6%", "IGST 12% "))
df %>%
mutate(orig_row = row_number()) %>% # optional, for later tracking
separate_rows(Tax.Rate, sep = "\+") %>%
mutate(Tax.Rate = str_trim(Tax.Rate)) %>%
separate(Tax.Rate, c("group", "rate"), extra = "merge", remove = FALSE)
# A tibble: 5 × 4
Tax.Rate  group rate  orig_row
<chr>     <chr> <chr>    <int>
1 CGST 2.5% CGST  2.5%         1
2 SGST 2.5% SGST  2.5%         1
3 CGST 6%   CGST  6%           2
4 UTGST 6%  UTGST 6%           2
5 IGST 12%  IGST  12%          3

这将产生";"长";形状的桌子,但如果你想要它";宽";每个组都有单独的列(管辖区?(,然后你可以添加以下内容:

[from the end of the "separate()" line] %>%
select(-Tax.Rate) %>%
pivot_wider(names_from = group, values_from = rate)

对于这个结果

# A tibble: 3 × 5
orig_row CGST  SGST  UTGST IGST 
<int> <chr> <chr> <chr> <chr>
1        1 2.5%  2.5%  NA    NA   
2        2 6%    NA    6%    NA   
3        3 NA    NA    NA    12%

我们可以:

  1. 使用separate_rows分隔+使用\+转义特殊字符
  2. 则CCD_ 4去除起始空间等
  3. separate本栏由" "4.group_by并添加id以避免嵌套输出
  4. pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>% 
separate_rows(Tax.Rate, sep = "\+") %>% 
mutate(Tax.Rate = str_trim(Tax.Rate)) %>% 
separate(Tax.Rate, c("name", "value"), sep = " ") %>% 
group_by(name) %>% 
mutate(id = row_number()) %>% 
pivot_wider(
names_from = name, 
values_from = value
) %>% 
select(-id)
CGST  SGST  UTGST IGST 
<chr> <chr> <chr> <chr>
1 2.5%  2.5%  6%    12%  
2 6%    NA    NA    NA   

数据:

structure(list(Tax.Rate = c("CGST 2.5% + SGST 2.5%", "CGST 6% + UTGST 6%", 
"IGST 12%")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L))

最新更新