我是R的新手,正在尝试变异以下字符变量"税费率"分为四个不同的栏(即CGST、SGST、UTGST和IGST(,该栏下的税率适用于该标题。数据集示例如下:
df#tibble:3 x 1税费率
1"CGST 2.5%+SGST 2.5%";2〃;CGST 6%+UTGST 6%">
3〃;IGST 12%";
我曾尝试使用"分离"one_answers"变异"功能,但收效甚微
如有任何指导,将不胜感激
我相信这也可以在基R中简洁地完成,但这里有一种不同的方法,我首先在每个加号处将数据拆分为新行,然后修剪多余的空格,然后拆分为两列。
library(tidyverse)
df <- data.frame(Tax.Rate = c("CGST 2.5% + SGST 2.5%", "CGST 6% + UTGST 6%", "IGST 12% "))
df %>%
mutate(orig_row = row_number()) %>% # optional, for later tracking
separate_rows(Tax.Rate, sep = "\+") %>%
mutate(Tax.Rate = str_trim(Tax.Rate)) %>%
separate(Tax.Rate, c("group", "rate"), extra = "merge", remove = FALSE)
# A tibble: 5 × 4
Tax.Rate group rate orig_row
<chr> <chr> <chr> <int>
1 CGST 2.5% CGST 2.5% 1
2 SGST 2.5% SGST 2.5% 1
3 CGST 6% CGST 6% 2
4 UTGST 6% UTGST 6% 2
5 IGST 12% IGST 12% 3
这将产生";"长";形状的桌子,但如果你想要它";宽";每个组都有单独的列(管辖区?(,然后你可以添加以下内容:
[from the end of the "separate()" line] %>%
select(-Tax.Rate) %>%
pivot_wider(names_from = group, values_from = rate)
对于这个结果
# A tibble: 3 × 5
orig_row CGST SGST UTGST IGST
<int> <chr> <chr> <chr> <chr>
1 1 2.5% 2.5% NA NA
2 2 6% NA 6% NA
3 3 NA NA NA 12%
我们可以:
- 使用
separate_rows
分隔+
使用\+
转义特殊字符 - 则CCD_ 4去除起始空间等
separate
本栏由" "
4.group_by
并添加id
以避免嵌套输出pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(Tax.Rate, sep = "\+") %>%
mutate(Tax.Rate = str_trim(Tax.Rate)) %>%
separate(Tax.Rate, c("name", "value"), sep = " ") %>%
group_by(name) %>%
mutate(id = row_number()) %>%
pivot_wider(
names_from = name,
values_from = value
) %>%
select(-id)
CGST SGST UTGST IGST
<chr> <chr> <chr> <chr>
1 2.5% 2.5% 6% 12%
2 6% NA NA NA
数据:
structure(list(Tax.Rate = c("CGST 2.5% + SGST 2.5%", "CGST 6% + UTGST 6%",
"IGST 12%")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))