下面是我的输入数据的外观。我想使用pandas/python/regex将所有以"Unit"开头的字符串提取到一个新列中,该列对应于单词在第二列中的位置。如有任何帮助,我们将不胜感激。
Input:
A
MARYLAND
Unit6
Unit7
Unit8
NEW SECTOR
Unit1
Unit2
NORTH SECTOR
Unit1
Unit2
PVT SECTOR
PUBLIC SECTOR
Unit1
Unit2
CENTRAL SECTOR
THERMAL
SOUTH SECTOR
Unit1
Unit2
Unit3
ACCOUNT SECTOR
DOLBY DIGITAL
WASHINGTON
Output:
A B
MARYLAND
Unit6 Unit6
Unit7 Unit7
Unit8 Unit8
NEW SECTOR
Unit1 Unit1
Unit2 Unit2
NORTH SECTOR
Unit1 Unit1
Unit2 Unit2
PVT SECTOR
PUBLIC SECTOR
Unit1 Unit1
Unit2 Unit2
CENTRAL SECTOR
THERMAL
SOUTH SECTOR
Unit1 Unit1
Unit2 Unit2
Unit3 Unit3
ACCOUNT SECTOR
DOLBY DIGITAL
WASHINGTON
最后,现在"Unit"字符串被复制到新列,我想从列A:中删除这些值
A B
MARYLAND
Unit6
Unit7
Unit8
NEW SECTOR
Unit1
Unit2
NORTH SECTOR
Unit1
Unit2
PVT SECTOR
PUBLIC SECTOR
Unit1
Unit2
CENTRAL SECTOR
THERMAL
SOUTH SECTOR
Unit1
Unit2
Unit3
ACCOUNT SECTOR
DOLBY DIGITAL
WASHINGTON
使用str.extract
和fillna
:
df['B'] = df['A'].str.extract('(^Unitd+)')
df.loc[df['B'].notnull(),'A'] = ''
df['B'].fillna('',inplace=True)
print(df)
A B
0 MARYLAND
1 Unit6
2 Unit7
3 Unit8
4 NEW SECTOR
5 Unit1
6 Unit2
7 NORTH SECTOR
8 Unit1
9 Unit2
10 PVT SECTOR
11 PUBLIC SECTOR
12 Unit1
13 Unit2
14 CENTRAL SECTOR
15 THERMAL
16 SOUTH SECTOR
17 Unit1
18 Unit2
19 Unit3
20 ACCOUNT SECTOR
21 DOLBY DIGITAL
22 WASHINGTON
使用列A作为索引数组的另一种方法:
df["B"] = df["A"][df['A'].str.contains('^Unit', regex=True)]
df["B"] = df["B"].fillna("")
A B
0 MARYLAND
1 Unit6 Unit6
2 Unit7 Unit7
3 Unit8 Unit8
4 NEW SECTOR
5 Unit1 Unit1
6 Unit2 Unit2
7 NORTH SECTOR
8 Unit1 Unit1
9 Unit2 Unit2
10 PVT SECTOR
11 PUBLIC SECTOR
12 Unit1 Unit1
13 Unit2 Unit2
14 CENTRAL SECTOR
15 THERMAL
16 SOUTH SECTOR
17 Unit1 Unit1
18 Unit2 Unit2
19 Unit3 Unit3
20 ACCOUNT SECTOR
21 DOLBY DIGITAL
22 WASHINGTON