无法同时使用python中的多个特殊字符或模式提取字符串



我有一个数据集,我正在尝试从此处显示的较长的凌乱版本中提取简单的城镇名称。它们中的大多数后面跟括号" (.*"),但有些不遵循此模式并以 ":" 结尾(请参阅第 200 行)。最后,有些没有括号,而是用逗号","拆分部分(见第 240、246 行)。

'Region'
196    Boston (Boston University, Boston College, Bos...
197           Bridgewater (Bridgewater State College)[2]
198    Cambridge (Harvard University, Massachusetts I...
199                       Chestnut Hill (Boston College)
200                The Colleges of Worcester Consortium:
201                             Dudley (Nichols College)
240                     Faribault, South Central College
241    Mankato (Minnesota State University, Mankato),...
242    Marshall (Southwest Minnesota State University...
243    Moorhead (Minnesota State University, Moorhead...
244           Morris (University of Minnesota Morris)[2]
245    Northfield (Carleton College, St. Olaf College...
246                 North Mankato, South Central College
247    St. Cloud (St. Cloud State University, The Col...
248            St. Joseph (College of Saint Benedict)[2]
249             St. Peter (Gustavus Adolphus College)[2]

我理想情况下希望看到的是:

'RegionName'
196                                 Boston
197                            Bridgewater
198                              Cambridge
199                          Chestnut Hill
200   The Colleges of Worcester Consortium
201                                 Dudley
240                              Faribault
241                                Mankato
242                               Marshall
243                               Moorhead
244                                 Morris
245                             Northfield
246                          North Mankato
247                              St. Cloud
248                             St. Joseph
249                              St. Peter

我目前的代码是:

df['RegionName'] = df['Region'].str.extract('(.*)[:(,]', expand=False)

但这给了我一个奇怪的结果,即括号没有正确

196    Boston (Boston University, Boston College, Bos...
197                                         Bridgewater 
198    Cambridge (Harvard University, Massachusetts I...
199                                       Chestnut Hill 
200                 The Colleges of Worcester Consortium
201                                              Dudley 
240                                         Faribault
241     Mankato (Minnesota State University, Mankato)
242                                         Marshall 
243    Moorhead (Minnesota State University, Moorhead
244                                           Morris 
245                      Northfield (Carleton College
246                                     North Mankato
247             St. Cloud (St. Cloud State University
248                                       St. Joseph 
249                                        St. Peter 

我也尝试过:

df['RegionName'] = df['Region'].str.extract('(.*)[ (.*|:|,]', expand=False)

我不确定如何同时使用所有三种模式提取字符串。也将对两线解决方案开放。 谢谢(如果格式不好,请道歉!

你可以只提取除:,(以外的任何 0 或多个字符 在字符串的开头

df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)b', expand=False)

如果您使用的是 Python 2.x,请在模式的开头使用(?u),以便单词边界b也可以匹配 Unicode 字符串中的正确位置。

  • ^- 字符串的开头
  • ([^:(,]*)- 组 1:除([^...]形成否定字符类):(,之外的任何字符的零个或多个(*)连续出现。
  • b- 单词边界。

请参阅下面的正则表达式演示和 Python 3 演示:

>>> from pandas import DataFrame
>>> import pandas as pd
>>> item_list = ['Boston (Boston University, Boston College, Bos...','Bridgewater (Bridgewater State College)[2]','Cambridge (Harvard University, Massachusetts I...','Chestnut Hill (Boston College)','The Colleges of Worcester Consortium:','Dudley (Nichols College)','Faribault, South Central College','Mankato (Minnesota State University, Mankato),...','Marshall (Southwest Minnesota State University...','Moorhead (Minnesota State University, Moorhead...','Morris (University of Minnesota Morris)[2]','Northfield (Carleton College, St. Olaf College...','North Mankato, South Central College','St. Cloud (St. Cloud State University, The Col...','St. Joseph (College of Saint Benedict)[2]','St. Peter (Gustavus Adolphus College)[2]']
>>> df = pd.DataFrame(item_list, columns=['Region'])
>>> df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)b', expand=False)
>>> df['RegionName']
RegionName  
0                                 Boston  
1                            Bridgewater  
2                              Cambridge  
3                          Chestnut Hill  
4   The Colleges of Worcester Consortium  
5                                 Dudley  
6                              Faribault  
7                                Mankato  
8                               Marshall  
9                               Moorhead  
10                                Morris  
11                            Northfield  
12                         North Mankato  
13                             St. Cloud  
14                            St. Joseph  
15                             St. Peter  
>>> 

由于您只有三个可能的分隔符,因此可以利用链式 split(),因为如果未找到分隔符,split 将返回未修改的字符串。

>>> s = """196    Boston (Boston University, Boston College, Bos...
... 197           Bridgewater (Bridgewater State College)[2]
... 198    Cambridge (Harvard University, Massachusetts I...
... 199                       Chestnut Hill (Boston College)
... 200                The Colleges of Worcester Consortium:
... 201                             Dudley (Nichols College)
... 240                     Faribault, South Central College
... 241    Mankato (Minnesota State University, Mankato),...
... 242    Marshall (Southwest Minnesota State University...
... 243    Moorhead (Minnesota State University, Moorhead...
... 244           Morris (University of Minnesota Morris)[2]
... 245    Northfield (Carleton College, St. Olaf College...
... 246                 North Mankato, South Central College
... 247    St. Cloud (St. Cloud State University, The Col...
... 248            St. Joseph (College of Saint Benedict)[2]
... 249             St. Peter (Gustavus Adolphus College)[2]"""
>>> for i in s.split('n'):
...    number, text = i.split('(')[0].split(',')[0].split(':')[0].split(' ',1)
...    print('{} {}'.format(number, text.strip()))
...
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter

可以使用df.apply对字符串执行相同的转换。

使用以下正则表达式:

([ws.]+)(?<!s)

如果您不关心尾随空格,则可以删除末尾的负后视(?<!s)

相关内容

  • 没有找到相关文章

最新更新