使一列中的所有值成为媒体(纽约 --> 纽约)



想象一下,我有一个这样的数据集:

cust_id       cust_state  prod_title            total_sales    trans_timestamp
0        1001015       NY          All Veggie Yummies    72.99           2021-03-30 10:06:05.108653 
1        1001019       NJ          Ball and String       18.95           2021-03-30 10:07:01.746945 
2        1022098       NY          Cat Cave              28.45           2021-03-30 10:10:41.387170    
3        1022157       PA          Chewie Dental         24.95           2021-03-30 10:20:11.571311 
4        1022189       Illinois    Chomp-a Plush         60.99           2021-03-30 10:33:11.289467 
5        1002664       Kansas      Feline Fix Mix        65.99           2021-03-30 10:37:55.446798      
6        1002666       Florida     Fetch Blaster         9.95            2021-03-30 10:39:36.488829 
7        1002175       Hawaii      Foozy Mouse           45.99           2021-03-30 10:43:21.490817
8        1002666       Florida     Kitty Climber         35.99           2021-03-30 10:51:31.510563      
9        1022189       Illinois    Purr Mix              32.99           2021-03-30 11:01:50.082440 
10       1011924       Indiana     Fetch Blaster         19.90           2021-03-30 11:11:15.944726 
11       1022236       AK          Purr Mix              98.97           2021-03-30 11:15:39.390134     
12       1022189       IL          Cat Cave              56.90           2021-03-30 11:21:10.167505 
13       1002137       Wisconsin   Purrfect Puree        54.95           2021-03-30 11:27:51.133318 
14       1002159       KY          Foozy Mouse           91.98           2021-03-30 11:29:00.292890 
15       1002175       Hawaii      Reddy Beddy           21.95           2021-03-30 11:30:22.928818      
16       1002108       Maine       Cat Cave              85.83           2021-03-30 11:34:35.776578 
17       1002186       MI          Scratchy Post         48.95           2021-03-30 10:39:28.604007 
18       1002175       HI          Snack-em Fish         15.99           2021-03-30 10:39:40.349882      
19       1002261       CO          Snoozer Essentails    99.95           2021-03-30 10:51:31.510563 
20       1002666       Florida     Scratchy Post         48.95           2021-03-30 11:01:28.024109 
21       1002159       Kentucky    Purrfect Puree        219.80          2021-03-30 11:11:15.944726     
22       1002680       SC          Chewie Dental         49.90           2021-03-30 11:15:39.390134 
23       1002678       Texas       Reddy Beddy           65.85           2021-03-30 11:21:10.167505 
24       1013769       OK          The New Bone          71.96           2021-03-30 11:27:17.147159      
25       1013859       KY          Reddy Beddy           109.75          2021-03-30 11:28:48.669564 

修复列cust_state中的不一致。

因此,对于这一个,我为实际状态名称的所有缩写创建了一个字典,然后将其作为映射应用于实际的cust_state列。然而,当我这样做的时候,它把所有的缩写都改成了NaN值。如何将所有状态值更改为一种样式?我会在下面发布我的代码。

state_to_abr = dict({'AL':'Alabama','AK':'Alaska','AZ':'Arizona','AR':'Arkansas',
'CA':'California','CO':'Colorado','CT':'Connecticut',
'DE':'Delaware',
'FL':'Florida',
'GA':'Georgia',
'HI':'Hawaii',
'ID':'Idaho','IL':'Illinois','IN':'Indiana','IA':'Iowa',
'KS':'Kansas','KY':'Kentucky',
'LA':'Louisiana',
'ME':'Maine','MD':'Maryland','MA':'Massachusetts','MI':'Michigan','MN':'Minnesota','MS':'Mississippi','MO':'Missouri','MT':'Montana',
'NE':'Nebraska','NV':'Nevada','NH':'New Hampshire','NJ':'New Jersey','NM':'New Mexico','NY':'New York','NC':'North Carolina','ND':'North Dakota',
'OH':'Ohio','OK':'Oklahoma','OR':'Oregon',
'PA':'Pennsylvania',
'RI':'Rhode Island',
'South Carolina':'SC',
'TN':'Tennessee','TX':'Texas',
'UT':'Utah',
'VT':'Vermont','VI':'Virgin Islands',
'WA':'Washington','WV':'West Virginia','WI':'Wisconsin','WY':'Wyoming'})
df_cleaned['cust_state'] = df_cleaned['cust_state'].map(state_to_abr)
df_cleaned

这将允许您映射到您编写的词典(不过我不得不更改SC的顺序(

state_to_abr = dict({'AL':'Alabama','AK':'Alaska','AZ':'Arizona','AR':'Arkansas',
'CA':'California','CO':'Colorado','CT':'Connecticut',
'DE':'Delaware',
'FL':'Florida',
'GA':'Georgia',
'HI':'Hawaii',
'ID':'Idaho','IL':'Illinois','IN':'Indiana','IA':'Iowa',
'KS':'Kansas','KY':'Kentucky',
'LA':'Louisiana',
'ME':'Maine','MD':'Maryland','MA':'Massachusetts','MI':'Michigan','MN':'Minnesota','MS':'Mississippi','MO':'Missouri','MT':'Montana',
'NE':'Nebraska','NV':'Nevada','NH':'New Hampshire','NJ':'New Jersey','NM':'New Mexico','NY':'New York','NC':'North Carolina','ND':'North Dakota',
'OH':'Ohio','OK':'Oklahoma','OR':'Oregon',
'PA':'Pennsylvania',
'RI':'Rhode Island',
'SC':'South Carolina',
'TN':'Tennessee','TX':'Texas',
'UT':'Utah',
'VT':'Vermont','VI':'Virgin Islands',
'WA':'Washington','WV':'West Virginia','WI':'Wisconsin','WY':'Wyoming'})
df_states = pd.DataFrame.from_dict(state_to_abr,orient='index', columns = ['State_Name'])
df['cust_state'] = np.where(df['cust_state'].str.len() <= 2, df['cust_state'].map(df_states['State_Name']), df['cust_state']) 
df

您的缩写中似乎有一个空格,因此剥离这些空格将产生结果。其次,由于custrongtate既有缩写,也有全名,您可能希望在映射没有导致null值时进行映射,因此使用了掩码

df['cust_state']=df['cust_state'].mask(
df['cust_state'].str.strip().map(state_to_abr).notna(), 
df['cust_state'].str.strip().map(state_to_abr))
df

检查长度,当它为两个时,做一个映射,否则保持值为

df['cust_state']=df['cust_state'].mask(
df['cust_state'].str.strip().str.len()==2, 
df['cust_state'].str.strip().map(state_to_abr))
df
cust_id     cust_state  prod_title  total_sales     trans_timestamp
0   1001015     New York    All Veggie Yummies  72.99   2021-03-30 10:06:05.108653
1   1001019     New Jersey  Ball and String     18.95   2021-03-30 10:07:01.746945
2   1022098     New York    Cat Cave    28.45   2021-03-30 10:10:41.387170
3   1022157     Pennsylvania    Chewie Dental   24.95   2021-03-30 10:20:11.571311
4   1022189     Illinois    Chomp-a Plush   60.99   2021-03-30 10:33:11.289467
5   1002664     Kansas  Feline Fix Mix  65.99   2021-03-30 10:37:55.446798
6   1002666     Florida     Fetch Blaster   9.95    2021-03-30 10:39:36.488829
7   1002175     Hawaii  Foozy Mouse     45.99   2021-03-30 10:43:21.490817
8   1002666     Florida     Kitty Climber   35.99   2021-03-30 10:51:31.510563
9   1022189     Illinois    Purr Mix    32.99   2021-03-30 11:01:50.082440
10  1011924     Indiana     Fetch Blaster   19.90   2021-03-30 11:11:15.944726
11  1022236     Alaska  Purr Mix    98.97   2021-03-30 11:15:39.390134
12  1022189     Illinois    Cat Cave    56.90   2021-03-30 11:21:10.167505
13  1002137     Wisconsin   Purrfect Puree  54.95   2021-03-30 11:27:51.133318
14  1002159     Kentucky    Foozy Mouse     91.98   2021-03-30 11:29:00.292890
15  1002175     Hawaii  Reddy Beddy     21.95   2021-03-30 11:30:22.928818
16  1002108     Maine   Cat Cave    85.83   2021-03-30 11:34:35.776578
17  1002186     Michigan    Scratchy Post   48.95   2021-03-30 10:39:28.604007
18  1002175     Hawaii  Snack-em Fish   15.99   2021-03-30 10:39:40.349882
19  1002261     Colorado    Snoozer Essentails  99.95   2021-03-30 10:51:31.510563
20  1002666     Florida     Scratchy Post   48.95   2021-03-30 11:01:28.024109
21  1002159     Kentucky    Purrfect Puree  219.80  2021-03-30 11:11:15.944726
22  1002680     SC  Chewie Dental   49.90   2021-03-30 11:15:39.390134
23  1002678     Texas   Reddy Beddy     65.85   2021-03-30 11:21:10.167505
24  1013769     Oklahoma    The New Bone    71.96   2021-03-30 11:27:17.147159
25  1013859     Kentucky    Reddy Beddy     109.75  2021-03-30 11:28:48.669564

最新更新