我有类似以下输入数据data_df2示例的数据。下面的代码创建了标签列,方法是将Cleaned列的值与之前记录中的值进行比较,然后如果值匹配,则给它相同的字母,或者给它一个新值。我遇到的问题是,我希望为标签列选择的字母从每个新的label_set_id开始。因此,第一个label_set_id=2的标签值将是A。每20条记录,label_set_id就会增加1。有人能建议我如何修改下面的代码来实现这一点吗?或者有没有一种更巧妙的方法来处理panda,比如使用apply函数。这段代码确实运行得有点慢。
代码:
data_df2['label']=''
c=65
data_df2.label[0]=chr(c)
c=c+1
for i in range(1,len(data_df2)):
if(data_df2.loc[i,'Cleaned']==data_df2.loc[i-1,'Cleaned']):
data_df2.label[i]=data_df2.label[i-1]
else:
data_df2.label[i]=chr(c)
c=c+1
输入数据:
print(data_df2[:30])
id Source
0 1 ,O-PEN 2.0
1 2 .7 FRAM BLOWER - BROTHERLY LOVE MECHANIC
2 3 @BEEZLEEXTRACTS
3 4 @CALISIFTCO_
4 5 @CALISIFTCO_ X @_ZKITTLEZ_
5 6 @CALISIFTCO_ X @WONDERBRETT
6 7 @CALISIFTCO_ X @WONDERBRETT_
7 8 @DNA_GENETICS
8 9 @EDENEXTRACTS_CA
9 10 @EDENEXTRACTS_CA X @CALISIFTCO_
10 11 @FULLFLAVAEXTRACT
11 12 @GGSTRAINS
12 13 @SHERBINSKI415
13 14 @STR8MECHANIC X @ICEDOUTEXTRACTS
14 15 @STR8MECHANIC X @REZHEADS215
15 16 [SS] 710 LABS
16 17 [SS] ABSOLUTE EXTRACTS
17 18 [SS] BIG PETE'S
18 19 [SS] BLOOM FARMS
19 20 [SS] BLUE RIVER
20 21 [SS] BRITE LABS
21 22 [SS] BROTHERLY LOVE
22 23 [SS] BROTHERLY LOVE [3 PACK]
23 24 [SS] CALIFORNIA DREAMIN
24 25 [SS] DIME BAG
25 26 [SS] EDEN INFUSIONS
26 27 [SS] EEL RIVER
27 28 [SS] GANJA GOLD
28 29 [SS] GLOWING BUDDHA
29 30 [SS] JETTY
Cleaned label_set_id label
0 O.PEN VAPE 1 A
1 BROTHERLY LOVE 1 B
2 BEEZLE EXTRACTS 1 C
3 CALI SIFT CO 1 D
4 CALI SIFT CO 1 D
5 @CALISIFTCO_ X @WONDERBRETT_ 1 E
6 @CALISIFTCO_ X @WONDERBRETT_ 1 E
7 DNA GENETICS 1 F
8 EDEN 1 G
9 CALI SIFT CO 1 H
10 FLAV RX 1 I
11 GG STRAINS 1 J
12 SHERBINSKI 1 K
13 STR8 MECHANIC 1 L
14 STR8 MECHANIC 1 L
15 710 LABS 1 M
16 ABSOLUTE XTRACTS 1 N
17 BIG PETE'S TREATS 1 O
18 BLOOM FARMS 1 P
19 BLUE RIVER 1 Q
20 BRITE LABS 2 R
21 BROTHERLY LOVE 2 S
22 BROTHERLY LOVE 2 S
23 CALIFORNIA DREAMIN 2 T
24 DIME BAG 2 U
25 EDEN 2 V
26 EEL RIVER 2 W
27 GANJA GOLD 2 X
28 GLOWING BUDDHA 2 Y
29 JETTY EXTRACTS 2 Z
输出数据:
id Source
0 1 ,O-PEN 2.0
1 2 .7 FRAM BLOWER - BROTHERLY LOVE MECHANIC
2 3 @BEEZLEEXTRACTS
3 4 @CALISIFTCO_
4 5 @CALISIFTCO_ X @_ZKITTLEZ_
5 6 @CALISIFTCO_ X @WONDERBRETT
6 7 @CALISIFTCO_ X @WONDERBRETT_
7 8 @DNA_GENETICS
8 9 @EDENEXTRACTS_CA
9 10 @EDENEXTRACTS_CA X @CALISIFTCO_
10 11 @FULLFLAVAEXTRACT
11 12 @GGSTRAINS
12 13 @SHERBINSKI415
13 14 @STR8MECHANIC X @ICEDOUTEXTRACTS
14 15 @STR8MECHANIC X @REZHEADS215
15 16 [SS] 710 LABS
16 17 [SS] ABSOLUTE EXTRACTS
17 18 [SS] BIG PETE'S
18 19 [SS] BLOOM FARMS
19 20 [SS] BLUE RIVER
20 21 [SS] BRITE LABS
21 22 [SS] BROTHERLY LOVE
22 23 [SS] BROTHERLY LOVE [3 PACK]
23 24 [SS] CALIFORNIA DREAMIN
24 25 [SS] DIME BAG
25 26 [SS] EDEN INFUSIONS
26 27 [SS] EEL RIVER
27 28 [SS] GANJA GOLD
28 29 [SS] GLOWING BUDDHA
29 30 [SS] JETTY
Cleaned label_set_id label
0 O.PEN VAPE 1 A
1 BROTHERLY LOVE 1 B
2 BEEZLE EXTRACTS 1 C
3 CALI SIFT CO 1 D
4 CALI SIFT CO 1 D
5 @CALISIFTCO_ X @WONDERBRETT_ 1 E
6 @CALISIFTCO_ X @WONDERBRETT_ 1 E
7 DNA GENETICS 1 F
8 EDEN 1 G
9 CALI SIFT CO 1 H
10 FLAV RX 1 I
11 GG STRAINS 1 J
12 SHERBINSKI 1 K
13 STR8 MECHANIC 1 L
14 STR8 MECHANIC 1 L
15 710 LABS 1 M
16 ABSOLUTE XTRACTS 1 N
17 BIG PETE'S TREATS 1 O
18 BLOOM FARMS 1 P
19 BLUE RIVER 1 Q
20 BRITE LABS 2 A
21 BROTHERLY LOVE 2 B
22 BROTHERLY LOVE 2 B
23 CALIFORNIA DREAMIN 2 C
24 DIME BAG 2 D
25 EDEN 2 E
26 EEL RIVER 2 F
27 GANJA GOLD 2 G
28 GLOWING BUDDHA 2 H
29 JETTY EXTRACTS 2 I
IIUC,您可以在label_set_id上使用groupby
,并检查以下两行与shift
的不同之处,然后使用cumsum
获得每组的增量值。为map
和chr
函数添加64。
#dummy example
df = pd.DataFrame({'Cleaned':list('abbcddeffijkllmn'),
'label_set_id':[1]*8+[2]*8})
#create the column label
df['label'] = list(map(chr, df.groupby('label_set_id')['Cleaned']
.apply(lambda x: x.ne(x.shift()).cumsum())+64))
print (df)
Cleaned label_set_id label
0 a 1 A
1 b 1 B
2 b 1 B #same cleaned than previous row
3 c 1 C
4 d 1 D
5 d 1 D
6 e 1 E
7 f 1 F
8 f 2 A #restart at A for new label_set_id
9 i 2 B
10 j 2 C
11 k 2 D
12 l 2 E
13 l 2 E
14 m 2 F
15 n 2 G
EDIT:如果数据是按照label_set_id排序的,那么您可以在不分组的情况下进行:
df['label'] = df['Cleaned'].ne(df['Cleaned'].shift()) .cumsum()
df['label'] = list(map(chr, df['label']
-df['label'].where(df['label_set_id'].ne(df['label_set_id'].shift()))
.ffill().astype(int) + 65 ))