变量中的数据并发

  • 本文关键字:并发 数据 变量 r
  • 更新时间 :
  • 英文 :


我有名为my_data的数据。数据量> 100000。示例输出如下所示:

id                      source
8166923397733625478 happimobiles
8166923397733625478 Springfit
7301100145962413274 Duroflex
6703062895304712434 happimobiles
6897156268457025524 themrphone
37564799155342281   Sangeetha Mobiles
1159098248970201145 Sangeetha Mobiles

链接到整个数据:

我希望代码运行的 https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0。

我想要一个输出,其中每个 id 都应该计入源下,并且重复该输出 以 5478 结尾的 Ex ID 同时属于happimobilesspringfit。所以happimobiles有 id 8166923397733625478 和 6703062895304712434这使得它 2 和 1 在springfit中很常见。

输出:

happimobiles   Springfit  Duroflex themrphone   Sangeetha    
happimobiles         2                1        0          0            0
Springfit            1                1        0          0            0
Duroflex             0                0        1          0            0  
themrphone           0                0        0          1            0
Sangeetha            0                0        0          0            2

我也尝试过:

Pivot <- dcast(my_data,source~source,value.var = "id",function(x) length((x)))

这只给了我特定合作伙伴的唯一记录,但没有重叠。

我也试过:

crossprod(table(my_data))

但这并不能给出正确的答案。

任何其他可以让我获得这种输出的解决方案

不确定你的逻辑,但希望这能给你一些前进的想法:

library(data.table)
# set as data.table
setDT(df)
# get counts by source
df = df[,.N,source]
# create a duplicate column (a trick to make dcast work)
df$source2 <- df$source
# convert shape using dcast
dcast(df, source ~ source2, value.var="N", fun.aggregate=sum)

在基数 R 中,您可以在数据帧中获取唯一的源,使用lapply查找id的存在并使用table计算它们的频率。

unique_source <- unique(df$source)
cbind.data.frame(unique_source, do.call(rbind, lapply(unique_source, function(x) 
table(df$source[df$id %in% df$id[df$source == x]]))))

#     unique_source Duroflex happimobiles SangeethaMobiles Springfit themrphone
#1     happimobiles        0            2                0         1          0
#2        Springfit        0            1                0         1          0
#3         Duroflex        1            0                0         0          0
#4       themrphone        0            0                0         0          1
#5 SangeethaMobiles        0            0                2         0          0

数据

df <- structure(list(id = c(8166923397733625856, 8166923397733625856, 
7301100145962413056, 6703062895304712192, 6897156268457025536, 
37564799155342280, 1159098248970201088), source = structure(c(2L, 
4L, 1L, 2L, 5L, 3L, 3L), .Label = c("Duroflex", "happimobiles", 
"SangeethaMobiles", "Springfit", "themrphone"), class = "factor")), 
class = "data.frame", row.names = c(NA, -7L))

我认为您在使用crossprod()时已经非常接近您的期望。这是一个基本的 R 解决方案,您可以尝试一下

r <- crossprod(table(df)[,match(unique(df$source),colnames(table(df)))])

这给了

> r
source
source             happimobiles Springfit Duroflex themrphone SangeethaMobiles
happimobiles                2         1        0          0                0
Springfit                   1         1        0          0                0
Duroflex                    0         0        1          0                0
themrphone                  0         0        0          1                0
SangeethaMobiles            0         0        0          0                2

这可以通过自连接和随后的重塑来解决:

library(data.table)
setDT(my_data)[, source := forcats::fct_inorder(source)]
my_data[my_data, on = "id"][
, dcast(.SD, source ~ i.source, value.var = "id")]
Aggregate function missing, defaulting to 'length'
source happimobiles Springfit Duroflex themrphone Sangeetha Mobiles
1:      happimobiles            2         1        0          0                 0
2:         Springfit            1         1        0          0                 0
3:          Duroflex            0         0        1          0                 0
4:        themrphone            0         0        0          1                 0
5: Sangeetha Mobiles            0         0        0          0                 2

调用forcats::fct_inorder()只需要重现 OP 的预期结果,其中列按出现顺序而不是字母顺序排列。


OP 具有 10000 行的生产数据集在重塑为宽格式时将生成 85 列的结果。因此,将数据保留为长格式可能更方便:

library(data.table)
library(magrittr) # piping used to improve readability
googlesheets4::read_sheet(
"https://docs.google.com/spreadsheets/d/1HUoRlVVf8EBedj1puXdgtTS6GGeFsXYqjVicUwbc5KE/edit#gid=0",
col_types = "c") %>% 
setDT() %>% 
unique() %>% 
.[., on = "FVID", allow.cartesian = TRUE] %>% 
.[, .N, keyby = .(Partner, i.Partner)]
Partner          i.Partner    N
1:       AOSmithIndia       AOSmithIndia   21
2:       Adityavision       Adityavision    2
3:   BajajElectronics   BajajElectronics  128
4:   BajajElectronics   SangeethaMobiles    1
5:   BajajElectronics       happimobiles    3
6:               BigC               BigC   27
7:               BigC        BnewMobiles    1
8:               BigC             Celekt    2
9:               BigC         LotMobiles    2
10:               BigC   SangeethaMobiles   11
11:               BigC          Springfit    1
12:               BigC       happimobiles    3
13:           Bluestar           Bluestar   56
14:        BnewMobiles               BigC    1
15:        BnewMobiles        BnewMobiles   57
16:        BnewMobiles         LotMobiles    1
17:        BnewMobiles   SangeethaMobiles    2
18:             Celekt               BigC    2
19:             Celekt             Celekt   10
20:             Celekt         LotMobiles    1
21:             Celekt   SangeethaMobiles    1
22:             Celekt       happimobiles    1
23:     ChennaiMobiles     ChennaiMobiles   51
24:     ChennaiMobiles         LotMobiles    1
25:     ChennaiMobiles   SangeethaMobiles   12
26:     ChennaiMobiles            iPlanet    1
27:         Creaticity         Creaticity    7
28:           Duroflex           Duroflex   91
29:          EdunGruru          EdunGruru    2
30:       FirefoxBikes       FirefoxBikes   38
31:       Greateastern       Greateastern   12
32:                ICA                ICA   24
33:                 LG                 LG    2
34:         LotMobiles               BigC    2
35:         LotMobiles        BnewMobiles    1
36:         LotMobiles             Celekt    1
37:         LotMobiles     ChennaiMobiles    1
38:         LotMobiles         LotMobiles   30
39:         LotMobiles   SangeethaMobiles   14
40:         LotMobiles       happimobiles    3
41:  NarayanNethralaya  NarayanNethralaya    7
42:        NipponPaint        NipponPaint    4
43:              Onida              Onida   27
44:     PoorvikaMobile     PoorvikaMobile    2
45:           Richfeel           Richfeel   55
46:           Richfeel       VLCCWellness    1
47:           SKMobile           SKMobile    5
48:           SKMobile mobilewalavadodara    1
49:   SS_Communication   SS_Communication   11
50:   SS_Communication         themrphone    1
51:       SabkaDentist       SabkaDentist   48
52:   SangeethaMobiles   BajajElectronics    1
53:   SangeethaMobiles               BigC   11
54:   SangeethaMobiles        BnewMobiles    2
55:   SangeethaMobiles             Celekt    1
56:   SangeethaMobiles     ChennaiMobiles   12
57:   SangeethaMobiles         LotMobiles   14
58:   SangeethaMobiles   SangeethaMobiles 1060
59:   SangeethaMobiles       happimobiles   12
60:   SangeethaMobiles            iPlanet    1
61:          Springfit               BigC    1
62:          Springfit          Springfit  123
63:          Springfit       happimobiles    1
64:               TecQ               TecQ    2
65:           TheMaark           TheMaark   31
66:        TronxThings        TronxThings   14
67:      VLCCInstitute      VLCCInstitute   36
68:       VLCCWellness           Richfeel    1
69:       VLCCWellness       VLCCWellness   27
70:             VutsTV             VutsTV   12
71:       happimobiles   BajajElectronics    3
72:       happimobiles               BigC    3
73:       happimobiles             Celekt    1
74:       happimobiles         LotMobiles    3
75:       happimobiles   SangeethaMobiles   12
76:       happimobiles          Springfit    1
77:       happimobiles       happimobiles  474
78:            iPlanet     ChennaiMobiles    1
79:            iPlanet   SangeethaMobiles    1
80:            iPlanet            iPlanet   15
81: mobilewalavadodara           SKMobile    1
82: mobilewalavadodara mobilewalavadodara    3
83:        poojaratele        poojaratele   24
84:         themrphone   SS_Communication    1
85:         themrphone         themrphone   85
Partner          i.Partner    N

步骤如下:

  1. 从 Google 文档中读取生产数据集
  2. 胁迫data.table
  3. 删除重复条目(假设这符合OP的意图(
  4. 执行自加入
  5. 计算重叠(以及非重叠(FVID的数量。keyby =的用法要求结果按字母顺序排序。

最新更新