想象一下,我有一个这样的数据集:
cust_id cust_state prod_title total_sales trans_timestamp
0 1001015 NY All Veggie Yummies 72.99 2021-03-30 10:06:05.108653
1 1001019 NJ Ball and String 18.95 2021-03-30 10:07:01.746945
2 1022098 NY Cat Cave 28.45 2021-03-30 10:10:41.387170
3 1022157 PA Chewie Dental 24.95 2021-03-30 10:20:11.571311
4 1022189 Illinois Chomp-a Plush 60.99 2021-03-30 10:33:11.289467
5 1002664 Kansas Feline Fix Mix 65.99 2021-03-30 10:37:55.446798
6 1002666 Florida Fetch Blaster 9.95 2021-03-30 10:39:36.488829
7 1002175 Hawaii Foozy Mouse 45.99 2021-03-30 10:43:21.490817
8 1002666 Florida Kitty Climber 35.99 2021-03-30 10:51:31.510563
9 1022189 Illinois Purr Mix 32.99 2021-03-30 11:01:50.082440
10 1011924 Indiana Fetch Blaster 19.90 2021-03-30 11:11:15.944726
11 1022236 AK Purr Mix 98.97 2021-03-30 11:15:39.390134
12 1022189 IL Cat Cave 56.90 2021-03-30 11:21:10.167505
13 1002137 Wisconsin Purrfect Puree 54.95 2021-03-30 11:27:51.133318
14 1002159 KY Foozy Mouse 91.98 2021-03-30 11:29:00.292890
15 1002175 Hawaii Reddy Beddy 21.95 2021-03-30 11:30:22.928818
16 1002108 Maine Cat Cave 85.83 2021-03-30 11:34:35.776578
17 1002186 MI Scratchy Post 48.95 2021-03-30 10:39:28.604007
18 1002175 HI Snack-em Fish 15.99 2021-03-30 10:39:40.349882
19 1002261 CO Snoozer Essentails 99.95 2021-03-30 10:51:31.510563
20 1002666 Florida Scratchy Post 48.95 2021-03-30 11:01:28.024109
21 1002159 Kentucky Purrfect Puree 219.80 2021-03-30 11:11:15.944726
22 1002680 SC Chewie Dental 49.90 2021-03-30 11:15:39.390134
23 1002678 Texas Reddy Beddy 65.85 2021-03-30 11:21:10.167505
24 1013769 OK The New Bone 71.96 2021-03-30 11:27:17.147159
25 1013859 KY Reddy Beddy 109.75 2021-03-30 11:28:48.669564
修复列cust_state
中的不一致。
因此,对于这一个,我为实际状态名称的所有缩写创建了一个字典,然后将其作为映射应用于实际的cust_state
列。然而,当我这样做的时候,它把所有的缩写都改成了NaN值。如何将所有状态值更改为一种样式?我会在下面发布我的代码。
state_to_abr = dict({'AL':'Alabama','AK':'Alaska','AZ':'Arizona','AR':'Arkansas',
'CA':'California','CO':'Colorado','CT':'Connecticut',
'DE':'Delaware',
'FL':'Florida',
'GA':'Georgia',
'HI':'Hawaii',
'ID':'Idaho','IL':'Illinois','IN':'Indiana','IA':'Iowa',
'KS':'Kansas','KY':'Kentucky',
'LA':'Louisiana',
'ME':'Maine','MD':'Maryland','MA':'Massachusetts','MI':'Michigan','MN':'Minnesota','MS':'Mississippi','MO':'Missouri','MT':'Montana',
'NE':'Nebraska','NV':'Nevada','NH':'New Hampshire','NJ':'New Jersey','NM':'New Mexico','NY':'New York','NC':'North Carolina','ND':'North Dakota',
'OH':'Ohio','OK':'Oklahoma','OR':'Oregon',
'PA':'Pennsylvania',
'RI':'Rhode Island',
'South Carolina':'SC',
'TN':'Tennessee','TX':'Texas',
'UT':'Utah',
'VT':'Vermont','VI':'Virgin Islands',
'WA':'Washington','WV':'West Virginia','WI':'Wisconsin','WY':'Wyoming'})
df_cleaned['cust_state'] = df_cleaned['cust_state'].map(state_to_abr)
df_cleaned
这将允许您映射到您编写的词典(不过我不得不更改SC的顺序(
state_to_abr = dict({'AL':'Alabama','AK':'Alaska','AZ':'Arizona','AR':'Arkansas',
'CA':'California','CO':'Colorado','CT':'Connecticut',
'DE':'Delaware',
'FL':'Florida',
'GA':'Georgia',
'HI':'Hawaii',
'ID':'Idaho','IL':'Illinois','IN':'Indiana','IA':'Iowa',
'KS':'Kansas','KY':'Kentucky',
'LA':'Louisiana',
'ME':'Maine','MD':'Maryland','MA':'Massachusetts','MI':'Michigan','MN':'Minnesota','MS':'Mississippi','MO':'Missouri','MT':'Montana',
'NE':'Nebraska','NV':'Nevada','NH':'New Hampshire','NJ':'New Jersey','NM':'New Mexico','NY':'New York','NC':'North Carolina','ND':'North Dakota',
'OH':'Ohio','OK':'Oklahoma','OR':'Oregon',
'PA':'Pennsylvania',
'RI':'Rhode Island',
'SC':'South Carolina',
'TN':'Tennessee','TX':'Texas',
'UT':'Utah',
'VT':'Vermont','VI':'Virgin Islands',
'WA':'Washington','WV':'West Virginia','WI':'Wisconsin','WY':'Wyoming'})
df_states = pd.DataFrame.from_dict(state_to_abr,orient='index', columns = ['State_Name'])
df['cust_state'] = np.where(df['cust_state'].str.len() <= 2, df['cust_state'].map(df_states['State_Name']), df['cust_state'])
df
您的缩写中似乎有一个空格,因此剥离这些空格将产生结果。其次,由于custrongtate既有缩写,也有全名,您可能希望在映射没有导致null值时进行映射,因此使用了掩码
df['cust_state']=df['cust_state'].mask(
df['cust_state'].str.strip().map(state_to_abr).notna(),
df['cust_state'].str.strip().map(state_to_abr))
df
或
检查长度,当它为两个时,做一个映射,否则保持值为
df['cust_state']=df['cust_state'].mask(
df['cust_state'].str.strip().str.len()==2,
df['cust_state'].str.strip().map(state_to_abr))
df
cust_id cust_state prod_title total_sales trans_timestamp
0 1001015 New York All Veggie Yummies 72.99 2021-03-30 10:06:05.108653
1 1001019 New Jersey Ball and String 18.95 2021-03-30 10:07:01.746945
2 1022098 New York Cat Cave 28.45 2021-03-30 10:10:41.387170
3 1022157 Pennsylvania Chewie Dental 24.95 2021-03-30 10:20:11.571311
4 1022189 Illinois Chomp-a Plush 60.99 2021-03-30 10:33:11.289467
5 1002664 Kansas Feline Fix Mix 65.99 2021-03-30 10:37:55.446798
6 1002666 Florida Fetch Blaster 9.95 2021-03-30 10:39:36.488829
7 1002175 Hawaii Foozy Mouse 45.99 2021-03-30 10:43:21.490817
8 1002666 Florida Kitty Climber 35.99 2021-03-30 10:51:31.510563
9 1022189 Illinois Purr Mix 32.99 2021-03-30 11:01:50.082440
10 1011924 Indiana Fetch Blaster 19.90 2021-03-30 11:11:15.944726
11 1022236 Alaska Purr Mix 98.97 2021-03-30 11:15:39.390134
12 1022189 Illinois Cat Cave 56.90 2021-03-30 11:21:10.167505
13 1002137 Wisconsin Purrfect Puree 54.95 2021-03-30 11:27:51.133318
14 1002159 Kentucky Foozy Mouse 91.98 2021-03-30 11:29:00.292890
15 1002175 Hawaii Reddy Beddy 21.95 2021-03-30 11:30:22.928818
16 1002108 Maine Cat Cave 85.83 2021-03-30 11:34:35.776578
17 1002186 Michigan Scratchy Post 48.95 2021-03-30 10:39:28.604007
18 1002175 Hawaii Snack-em Fish 15.99 2021-03-30 10:39:40.349882
19 1002261 Colorado Snoozer Essentails 99.95 2021-03-30 10:51:31.510563
20 1002666 Florida Scratchy Post 48.95 2021-03-30 11:01:28.024109
21 1002159 Kentucky Purrfect Puree 219.80 2021-03-30 11:11:15.944726
22 1002680 SC Chewie Dental 49.90 2021-03-30 11:15:39.390134
23 1002678 Texas Reddy Beddy 65.85 2021-03-30 11:21:10.167505
24 1013769 Oklahoma The New Bone 71.96 2021-03-30 11:27:17.147159
25 1013859 Kentucky Reddy Beddy 109.75 2021-03-30 11:28:48.669564