>我有如下所示的数据框,我想根据sic2
列中的值插入一个"字符串"。
conm sic2
115466 ALLEGION PLC 34.0
115471 AGILITY HEALTH INC 80.0
115473 NORDIC AMERICAN OFFSHORE 44.0
115474 AAD 54.0
115477 DORIAN LPG LTD 44.0
115484 NOMAD FOODS LTD 20.0
115486 ATHENE HOLDING LTD 63.0
115490 MIDATECH PHARMA PLC 28.0
115495 MOTIF BIO PLC 28.0
将数字sic2
到字符串中的范围如下。
1-9 Agriculture, Forestry and Fishing
10-14 Mining
15-17 Construction
18-19 not used
20-39 Manufacturing
40-49 Transportation, Communications, Electric, Gas and Sanitary service
50-51 Wholesale Trade
52-59 Retail Trade
60-67 Finance, Insurance and Real Estate
70-89 Services
91-97 Public Administration
99-99 Nonclassifiable
0 -1 Agricultural Production-Crops
如何应用整个大型数据集制作看起来像这样的pandas.DataFrame
?
我尝试了几个条件代码,但它一直失败。
conm sic2 industry
115466 ALLEGION PLC 34.0 Manufacturing
115471 AGILITY HEALTH INC 80.0 Services
115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, Electric, Gas and Sanitary service
115474 AAD 54.0 Retail Trade
如果您将sics
数字转换为字典,那么根据需要查找行业是相当简单的:
法典:
sic = [x.strip().split(' ', 1) for x in """
1-9 Agriculture, Forestry and Fishing
10-14 Mining
15-17 Construction
18-19 not used
20-39 Manufacturing
40-49 Transportation, Communications, ...
50-51 Wholesale Trade
52-59 Retail Trade
60-67 Finance, Insurance and Real Estate
70-89 Services
91-97 Public Administration
99-99 Nonclassifiable
""".split('n')[1:-1]]
sic_dict = dict(sum([[(x, z) for x in
range(*[int(y) for y in v.split('-')])]
for v, z in sic], []))
测试代码:
df = pd.read_fwf(StringIO(u"""
number conm sic2
115466 ALLEGION PLC 34.0
115471 AGILITY HEALTH INC 80.0
115473 NORDIC AMERICAN OFFSHORE 44.0
115474 AAD 54.0
115477 DORIAN LPG LTD 44.0
115484 NOMAD FOODS LTD 20.0
115486 ATHENE HOLDING LTD 63.0
115490 MIDATECH PHARMA PLC 28.0
115495 MOTIF BIO PLC 28.0"""), header=1)
df['industry'] = df.sic2.apply(lambda x: sic_dict[int(x)])
print(df)
结果:
number conm sic2 industry
0 115466 ALLEGION PLC 34.0 Manufacturing
1 115471 AGILITY HEALTH INC 80.0 Services
2 115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, ...
3 115474 AAD 54.0 Retail Trade
4 115477 DORIAN LPG LTD 44.0 Transportation, Communications, ...
5 115484 NOMAD FOODS LTD 20.0 Manufacturing
6 115486 ATHENE HOLDING LTD 63.0 Finance, Insurance and Real Estate
7 115490 MIDATECH PHARMA PLC 28.0 Manufacturing
8 115495 MOTIF BIO PLC 28.0 Manufacturing
#Save your mapping table to a data frame
df2 = pd.DataFrame({'id_end': {0: 9, 1: 14, 2: 17, 3: 19, 4: 39, 5: 49, 6: 51, 7: 59, 8: 67, 9: 89, 10: 97, 11: 99, 12: 1},
'id_start': {0: 1, 1: 10, 2: 15, 3: 18, 4: 20, 5: 40, 6: 50, 7: 52, 8: 60, 9: 70, 10: 91, 11: 99, 12: 0},
'industry': {0: 'Agriculture, Forestry and Fishing', 1: 'Mining', 2: 'Construction', 3: 'not used', 4: 'Manufacturing',
5: 'Transportation, Communications, Electric, Gas and Sanitary service',
6: 'Wholesale Trade', 7: 'Retail Trade', 8: 'Finance, Insurance and Real Estate', 9: 'Services',
10: 'Public Administration', 11: 'Nonclassifiable', 12: 'Agricultural Production Crops'}})
df2 = df2.sort_values(by='id_end')
Out[354]:
id_end id_start industry
12 1 0 Agricultural Production Crops
0 9 1 Agriculture, Forestry and Fishing
1 14 10 Mining
2 17 15 Construction
3 19 18 not used
4 39 20 Manufacturing
5 49 40 Transportation, Communications, Electric, Gas ...
6 51 50 Wholesale Trade
7 59 52 Retail Trade
8 67 60 Finance, Insurance and Real Estate
9 89 70 Services
10 97 91 Public Administration
11 99 99 Nonclassifiable
#Map sic2 number to industry names
df['industry'] = df['sic2'].astype(np.int).apply(lambda x: df2.loc[df2.id_end>=x,'industry'].iloc[0])
Out[352]:
conm sic2 industry
115466 ALLEGION PLC 34.0 Manufacturing
115471 AGILITY HEALTH INC 80.0 Services
115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, Electric, Gas ...
115474 AAD 54.0 Retail Trade
115477 DORIAN LPG LTD 44.0 Transportation, Communications, Electric, Gas ...
115484 NOMAD FOODS LTD 20.0 Manufacturing
115486 ATHENE HOLDING LTD 63.0 Finance, Insurance and Real Estate
115490 MIDATECH PHARMA PLC 28.0 Manufacturing
115495 MOTIF BIO PLC 28.0 Manufacturing