创建具有周期性重复值的匹配标签



我有类似以下输入数据data_df2示例的数据。下面的代码创建了标签列,方法是将Cleaned列的值与之前记录中的值进行比较,然后如果值匹配,则给它相同的字母,或者给它一个新值。我遇到的问题是,我希望为标签列选择的字母从每个新的label_set_id开始。因此,第一个label_set_id=2的标签值将是A。每20条记录,label_set_id就会增加1。有人能建议我如何修改下面的代码来实现这一点吗?或者有没有一种更巧妙的方法来处理panda,比如使用apply函数。这段代码确实运行得有点慢。

代码:

data_df2['label']=''

c=65
data_df2.label[0]=chr(c)
c=c+1
for i in range(1,len(data_df2)):
if(data_df2.loc[i,'Cleaned']==data_df2.loc[i-1,'Cleaned']):
data_df2.label[i]=data_df2.label[i-1]
else:
data_df2.label[i]=chr(c)

c=c+1

输入数据:

print(data_df2[:30])
id                                    Source  
0    1                                ,O-PEN 2.0   
1    2  .7 FRAM BLOWER - BROTHERLY LOVE MECHANIC   
2    3                           @BEEZLEEXTRACTS   
3    4                              @CALISIFTCO_   
4    5                @CALISIFTCO_ X @_ZKITTLEZ_   
5    6               @CALISIFTCO_ X @WONDERBRETT   
6    7              @CALISIFTCO_ X @WONDERBRETT_   
7    8                             @DNA_GENETICS   
8    9                          @EDENEXTRACTS_CA   
9   10           @EDENEXTRACTS_CA X @CALISIFTCO_   
10  11                         @FULLFLAVAEXTRACT   
11  12                                @GGSTRAINS   
12  13                            @SHERBINSKI415   
13  14          @STR8MECHANIC X @ICEDOUTEXTRACTS   
14  15              @STR8MECHANIC X @REZHEADS215   
15  16                             [SS] 710 LABS   
16  17                    [SS] ABSOLUTE EXTRACTS   
17  18                           [SS] BIG PETE'S   
18  19                          [SS] BLOOM FARMS   
19  20                           [SS] BLUE RIVER   
20  21                           [SS] BRITE LABS   
21  22                       [SS] BROTHERLY LOVE   
22  23              [SS] BROTHERLY LOVE [3 PACK]   
23  24                   [SS] CALIFORNIA DREAMIN   
24  25                             [SS] DIME BAG   
25  26                       [SS] EDEN INFUSIONS   
26  27                            [SS] EEL RIVER   
27  28                           [SS] GANJA GOLD   
28  29                       [SS] GLOWING BUDDHA   
29  30                                [SS] JETTY   
Cleaned  label_set_id label  
0                     O.PEN VAPE             1     A  
1                 BROTHERLY LOVE             1     B  
2                BEEZLE EXTRACTS             1     C  
3                   CALI SIFT CO             1     D  
4                   CALI SIFT CO             1     D  
5   @CALISIFTCO_ X @WONDERBRETT_             1     E  
6   @CALISIFTCO_ X @WONDERBRETT_             1     E  
7                   DNA GENETICS             1     F  
8                           EDEN             1     G  
9                   CALI SIFT CO             1     H  
10                       FLAV RX             1     I  
11                    GG STRAINS             1     J  
12                    SHERBINSKI             1     K  
13                 STR8 MECHANIC             1     L  
14                 STR8 MECHANIC             1     L  
15                      710 LABS             1     M  
16              ABSOLUTE XTRACTS             1     N  
17             BIG PETE'S TREATS             1     O  
18                   BLOOM FARMS             1     P  
19                    BLUE RIVER             1     Q  
20                    BRITE LABS             2     R  
21                BROTHERLY LOVE             2     S  
22                BROTHERLY LOVE             2     S  
23            CALIFORNIA DREAMIN             2     T  
24                      DIME BAG             2     U  
25                          EDEN             2     V  
26                     EEL RIVER             2     W  
27                    GANJA GOLD             2     X  
28                GLOWING BUDDHA             2     Y  
29                JETTY EXTRACTS             2     Z 

输出数据:

id                                    Source  
0    1                                ,O-PEN 2.0   
1    2  .7 FRAM BLOWER - BROTHERLY LOVE MECHANIC   
2    3                           @BEEZLEEXTRACTS   
3    4                              @CALISIFTCO_   
4    5                @CALISIFTCO_ X @_ZKITTLEZ_   
5    6               @CALISIFTCO_ X @WONDERBRETT   
6    7              @CALISIFTCO_ X @WONDERBRETT_   
7    8                             @DNA_GENETICS   
8    9                          @EDENEXTRACTS_CA   
9   10           @EDENEXTRACTS_CA X @CALISIFTCO_   
10  11                         @FULLFLAVAEXTRACT   
11  12                                @GGSTRAINS   
12  13                            @SHERBINSKI415   
13  14          @STR8MECHANIC X @ICEDOUTEXTRACTS   
14  15              @STR8MECHANIC X @REZHEADS215   
15  16                             [SS] 710 LABS   
16  17                    [SS] ABSOLUTE EXTRACTS   
17  18                           [SS] BIG PETE'S   
18  19                          [SS] BLOOM FARMS   
19  20                           [SS] BLUE RIVER   
20  21                           [SS] BRITE LABS   
21  22                       [SS] BROTHERLY LOVE   
22  23              [SS] BROTHERLY LOVE [3 PACK]   
23  24                   [SS] CALIFORNIA DREAMIN   
24  25                             [SS] DIME BAG   
25  26                       [SS] EDEN INFUSIONS   
26  27                            [SS] EEL RIVER   
27  28                           [SS] GANJA GOLD   
28  29                       [SS] GLOWING BUDDHA   
29  30                                [SS] JETTY   
Cleaned  label_set_id label  
0                     O.PEN VAPE             1     A  
1                 BROTHERLY LOVE             1     B  
2                BEEZLE EXTRACTS             1     C  
3                   CALI SIFT CO             1     D  
4                   CALI SIFT CO             1     D  
5   @CALISIFTCO_ X @WONDERBRETT_             1     E  
6   @CALISIFTCO_ X @WONDERBRETT_             1     E  
7                   DNA GENETICS             1     F  
8                           EDEN             1     G  
9                   CALI SIFT CO             1     H  
10                       FLAV RX             1     I  
11                    GG STRAINS             1     J  
12                    SHERBINSKI             1     K  
13                 STR8 MECHANIC             1     L  
14                 STR8 MECHANIC             1     L  
15                      710 LABS             1     M  
16              ABSOLUTE XTRACTS             1     N  
17             BIG PETE'S TREATS             1     O  
18                   BLOOM FARMS             1     P  
19                    BLUE RIVER             1     Q  
20                    BRITE LABS             2     A  
21                BROTHERLY LOVE             2     B  
22                BROTHERLY LOVE             2     B  
23            CALIFORNIA DREAMIN             2     C  
24                      DIME BAG             2     D  
25                          EDEN             2     E  
26                     EEL RIVER             2     F  
27                    GANJA GOLD             2     G  
28                GLOWING BUDDHA             2     H  
29                JETTY EXTRACTS             2     I  

IIUC,您可以在label_set_id上使用groupby,并检查以下两行与shift的不同之处,然后使用cumsum获得每组的增量值。为mapchr函数添加64。

#dummy example
df = pd.DataFrame({'Cleaned':list('abbcddeffijkllmn'), 
'label_set_id':[1]*8+[2]*8})
#create the column label
df['label'] = list(map(chr, df.groupby('label_set_id')['Cleaned']
.apply(lambda x: x.ne(x.shift()).cumsum())+64))
print (df)
Cleaned  label_set_id label
0        a             1     A
1        b             1     B 
2        b             1     B #same cleaned than previous row 
3        c             1     C
4        d             1     D
5        d             1     D
6        e             1     E
7        f             1     F
8        f             2     A #restart at A for new label_set_id
9        i             2     B
10       j             2     C
11       k             2     D
12       l             2     E
13       l             2     E
14       m             2     F
15       n             2     G

EDIT:如果数据是按照label_set_id排序的,那么您可以在不分组的情况下进行:

df['label'] = df['Cleaned'].ne(df['Cleaned'].shift()) .cumsum()
df['label'] = list(map(chr, df['label']
-df['label'].where(df['label_set_id'].ne(df['label_set_id'].shift()))
.ffill().astype(int) + 65 ))

最新更新