如何在从Excel文件派生的数据帧中对列进行剥离和大写--避免unicode/str错误消息



我有下面的excel表可在此处下载

对于熊猫,我是这样读的:

import pandas as pd
infile = "sample1_neu_input_deconv.xlsx"
outdf = pd.read_excel(infile)
outdf.head()

看起来像这样:

In [8]: outdf.head()
Out[8]:
     ID_REF Gene.Symbol  GSM1711905  GSM1711906  GSM1711907
0  10344620     Gm10568      78.496      70.582      78.496
1  10344622     Gm10568      87.940      85.746      94.670
2  10344624      Lypla1     324.306     450.037     231.723
3  10344633       Tcea1     361.733     758.949     917.704
4  10344637     Atp6v1h     236.272     275.910     453.972

现在我要做的是用以下命令剥离Gene.Symbol列并使其大写:

outdf["Gene.Symbol"].map(str.strip).map(str.upper)

但它给了我以下错误:

TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode'

正确的方法是什么?

您可以连锁连续的矢量化str调用来实现您想要的:

In [4]:
outdf['Gene.Symbol'] = outdf['Gene.Symbol'].str.strip().str.upper()
outdf['Gene.Symbol']
Out[4]:
0              GM10568
1              GM10568
2               LYPLA1
3                TCEA1
4              ATP6V1H
5                OPRK1
6               RB1CC1
7              FAM150A
8                 ST18
9               PCMTD1
10                RRS1
11              ADHFE1
12       3110035E14RIK
13                SGK3
14       6030422M02RIK
15               CSPP1
16               CSPP1
17               CSPP1
18               CSPP1
19               CSPP1
20               CSPP1
21               CSPP1
22               CSPP1
23               CSPP1
24               CSPP1
25               CSPP1
26               CSPP1
27               CSPP1
28               CSPP1
29               PREX2
             ...      
24649        LOC380994
24650     LOC100504530
24651            SSTY2
24652        LOC665698
24653        LOC380994
24654            SSTY2
24655     LOC100039147
24656        LOC665746
24657            SSTY2
24658        LOC665128
24659            SSTY2
24660           RBM31Y
24661     LOC100039753
24662            SSTY1
24663            SSTY1
24664            SSTY1
24665        LOC380994
24666     LOC100504530
24667     LOC100039753
24668             SRSY
24669              SLY
24670     LOC100504530
24671              SLY
24672     LOC100039753
24673            SSTY2
24674     LOC100042196
24675        LOC380994
24676     LOC100040235
24677     LOC100041704
24678            SSTY2
Name: Gene.Symbol, dtype: object

最新更新