我有名为my_data
的数据。数据量> 100000。示例输出如下所示:
id source
8166923397733625478 happimobiles
8166923397733625478 Springfit
7301100145962413274 Duroflex
6703062895304712434 happimobiles
6897156268457025524 themrphone
37564799155342281 Sangeetha Mobiles
1159098248970201145 Sangeetha Mobiles
链接到整个数据:
我希望代码运行的 https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0。
我想要一个输出,其中每个 id 都应该计入源下,并且重复该输出 以 5478 结尾的 Ex ID 同时属于happimobiles
和springfit
。所以happimobiles
有 id 8166923397733625478 和 6703062895304712434这使得它 2 和 1 在springfit
中很常见。
输出:
happimobiles Springfit Duroflex themrphone Sangeetha
happimobiles 2 1 0 0 0
Springfit 1 1 0 0 0
Duroflex 0 0 1 0 0
themrphone 0 0 0 1 0
Sangeetha 0 0 0 0 2
我也尝试过:
Pivot <- dcast(my_data,source~source,value.var = "id",function(x) length((x)))
这只给了我特定合作伙伴的唯一记录,但没有重叠。
我也试过:
crossprod(table(my_data))
但这并不能给出正确的答案。
任何其他可以让我获得这种输出的解决方案
不确定你的逻辑,但希望这能给你一些前进的想法:
library(data.table)
# set as data.table
setDT(df)
# get counts by source
df = df[,.N,source]
# create a duplicate column (a trick to make dcast work)
df$source2 <- df$source
# convert shape using dcast
dcast(df, source ~ source2, value.var="N", fun.aggregate=sum)
在基数 R 中,您可以在数据帧中获取唯一的源,使用lapply
查找id
的存在并使用table
计算它们的频率。
unique_source <- unique(df$source)
cbind.data.frame(unique_source, do.call(rbind, lapply(unique_source, function(x)
table(df$source[df$id %in% df$id[df$source == x]]))))
# unique_source Duroflex happimobiles SangeethaMobiles Springfit themrphone
#1 happimobiles 0 2 0 1 0
#2 Springfit 0 1 0 1 0
#3 Duroflex 1 0 0 0 0
#4 themrphone 0 0 0 0 1
#5 SangeethaMobiles 0 0 2 0 0
数据
df <- structure(list(id = c(8166923397733625856, 8166923397733625856,
7301100145962413056, 6703062895304712192, 6897156268457025536,
37564799155342280, 1159098248970201088), source = structure(c(2L,
4L, 1L, 2L, 5L, 3L, 3L), .Label = c("Duroflex", "happimobiles",
"SangeethaMobiles", "Springfit", "themrphone"), class = "factor")),
class = "data.frame", row.names = c(NA, -7L))
我认为您在使用crossprod()
时已经非常接近您的期望。这是一个基本的 R 解决方案,您可以尝试一下
r <- crossprod(table(df)[,match(unique(df$source),colnames(table(df)))])
这给了
> r
source
source happimobiles Springfit Duroflex themrphone SangeethaMobiles
happimobiles 2 1 0 0 0
Springfit 1 1 0 0 0
Duroflex 0 0 1 0 0
themrphone 0 0 0 1 0
SangeethaMobiles 0 0 0 0 2
这可以通过自连接和随后的重塑来解决:
library(data.table)
setDT(my_data)[, source := forcats::fct_inorder(source)]
my_data[my_data, on = "id"][
, dcast(.SD, source ~ i.source, value.var = "id")]
Aggregate function missing, defaulting to 'length' source happimobiles Springfit Duroflex themrphone Sangeetha Mobiles 1: happimobiles 2 1 0 0 0 2: Springfit 1 1 0 0 0 3: Duroflex 0 0 1 0 0 4: themrphone 0 0 0 1 0 5: Sangeetha Mobiles 0 0 0 0 2
调用forcats::fct_inorder()
只需要重现 OP 的预期结果,其中列按出现顺序而不是字母顺序排列。
OP 具有 10000 行的生产数据集在重塑为宽格式时将生成 85 列的结果。因此,将数据保留为长格式可能更方便:
library(data.table)
library(magrittr) # piping used to improve readability
googlesheets4::read_sheet(
"https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0",
col_types = "c") %>%
setDT() %>%
unique() %>%
.[., on = "FVID", allow.cartesian = TRUE] %>%
.[, .N, keyby = .(Partner, i.Partner)]
Partner i.Partner N 1: AOSmithIndia AOSmithIndia 21 2: Adityavision Adityavision 2 3: BajajElectronics BajajElectronics 128 4: BajajElectronics SangeethaMobiles 1 5: BajajElectronics happimobiles 3 6: BigC BigC 27 7: BigC BnewMobiles 1 8: BigC Celekt 2 9: BigC LotMobiles 2 10: BigC SangeethaMobiles 11 11: BigC Springfit 1 12: BigC happimobiles 3 13: Bluestar Bluestar 56 14: BnewMobiles BigC 1 15: BnewMobiles BnewMobiles 57 16: BnewMobiles LotMobiles 1 17: BnewMobiles SangeethaMobiles 2 18: Celekt BigC 2 19: Celekt Celekt 10 20: Celekt LotMobiles 1 21: Celekt SangeethaMobiles 1 22: Celekt happimobiles 1 23: ChennaiMobiles ChennaiMobiles 51 24: ChennaiMobiles LotMobiles 1 25: ChennaiMobiles SangeethaMobiles 12 26: ChennaiMobiles iPlanet 1 27: Creaticity Creaticity 7 28: Duroflex Duroflex 91 29: EdunGruru EdunGruru 2 30: FirefoxBikes FirefoxBikes 38 31: Greateastern Greateastern 12 32: ICA ICA 24 33: LG LG 2 34: LotMobiles BigC 2 35: LotMobiles BnewMobiles 1 36: LotMobiles Celekt 1 37: LotMobiles ChennaiMobiles 1 38: LotMobiles LotMobiles 30 39: LotMobiles SangeethaMobiles 14 40: LotMobiles happimobiles 3 41: NarayanNethralaya NarayanNethralaya 7 42: NipponPaint NipponPaint 4 43: Onida Onida 27 44: PoorvikaMobile PoorvikaMobile 2 45: Richfeel Richfeel 55 46: Richfeel VLCCWellness 1 47: SKMobile SKMobile 5 48: SKMobile mobilewalavadodara 1 49: SS_Communication SS_Communication 11 50: SS_Communication themrphone 1 51: SabkaDentist SabkaDentist 48 52: SangeethaMobiles BajajElectronics 1 53: SangeethaMobiles BigC 11 54: SangeethaMobiles BnewMobiles 2 55: SangeethaMobiles Celekt 1 56: SangeethaMobiles ChennaiMobiles 12 57: SangeethaMobiles LotMobiles 14 58: SangeethaMobiles SangeethaMobiles 1060 59: SangeethaMobiles happimobiles 12 60: SangeethaMobiles iPlanet 1 61: Springfit BigC 1 62: Springfit Springfit 123 63: Springfit happimobiles 1 64: TecQ TecQ 2 65: TheMaark TheMaark 31 66: TronxThings TronxThings 14 67: VLCCInstitute VLCCInstitute 36 68: VLCCWellness Richfeel 1 69: VLCCWellness VLCCWellness 27 70: VutsTV VutsTV 12 71: happimobiles BajajElectronics 3 72: happimobiles BigC 3 73: happimobiles Celekt 1 74: happimobiles LotMobiles 3 75: happimobiles SangeethaMobiles 12 76: happimobiles Springfit 1 77: happimobiles happimobiles 474 78: iPlanet ChennaiMobiles 1 79: iPlanet SangeethaMobiles 1 80: iPlanet iPlanet 15 81: mobilewalavadodara SKMobile 1 82: mobilewalavadodara mobilewalavadodara 3 83: poojaratele poojaratele 24 84: themrphone SS_Communication 1 85: themrphone themrphone 85 Partner i.Partner N
步骤如下:
- 从 Google 文档中读取生产数据集
- 胁迫
data.table
- 删除重复条目(假设这符合OP的意图(
- 执行自加入
- 计算重叠(以及非重叠(
FVID
的数量。keyby =
的用法要求结果按字母顺序排序。