从句子列中提取新功能 - Python



我有两个数据帧:

city_state数据帧

city        state
0   huntsville  alabama
1   montgomery  alabama
2   birmingham  alabama
3   mobile      alabama
4   dothan      alabama
5   chicago     illinois
6   boise       idaho
7   des moines  iowa

和句子数据帧

sentence
0   marthy was born in dothan
1   michelle reads some books at her home
2   hasan is highschool student in chicago
3   hartford of the west is the nickname of des moines

我想从名为城市的句子数据帧中提取新功能。该列city是从sentence中提取的,如果在句子中包含第city_state['city']列中某些city的名称,如果它不包含特定city的名称,则其值将为 Null。

预期的新数据帧将如下所示:

sentence                                        city
0   marthy was born in dothan                       dothan
1   michelle reads some books at her home           Null
2   hasan is highschool student in chicago          chicago
3   capital of dream is the motto of des moines     des moines

我已经运行了这段代码

sentence['city'] ={}
for city in city_state.city:
for text in sentence.sentence:
words = text.split()
for word in words:
if word == city:
sentence['city'].append(city)
break
else:
sentence['city'].append(None)

但是这段代码的结果是这样的

ValueError: Length of values does not match length of index

如果您有类似情况的特征工程经验,您能否给我一些建议,如何编写正确的代码以获得预期结果。

谢谢

注意: 这是错误的完整日志

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
2 
3 for city in city_state.city:
4     for text in sentence.sentence:
5         words = text.split()
~Anaconda3libsite-packagespandascoreframe.py in __setitem__(self, key, value)
3117         else:
3118             # set column
-> 3119             self._set_item(key, value)
3120 
3121     def _setitem_slice(self, key, value):
~Anaconda3libsite-packagespandascoreframe.py in _set_item(self, key, value)
3192 
3193         self._ensure_valid_index(value)
-> 3194         value = self._sanitize_column(key, value)
3195         NDFrame._set_item(self, key, value)
3196 
~Anaconda3libsite-packagespandascoreframe.py in _sanitize_column(self, key, value, broadcast)
3389 
3390             # turn me into an ndarray
-> 3391             value = _sanitize_index(value, self.index, copy=False)
3392             if not isinstance(value, (np.ndarray, Index)):
3393                 if isinstance(value, list) and len(value) > 0:
~Anaconda3libsite-packagespandascoreseries.py in _sanitize_index(data, index, copy)
3999 
4000     if len(data) != len(index):
-> 4001         raise ValueError('Length of values does not match length of ' 'index')
4002 
4003     if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index

一些快速而肮脏的应用,尚未在大型数据帧上对其进行测试,因此请谨慎使用。 首先定义一个函数来提取城市名称:

def ex_city(col, cities):
output = []
for w in cities:
if w in col:
output.append(w)
return ','.join(output) if output else None

然后将其应用于句子数据帧

city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))

sdf = sentence dataframecdf=city_state dataframe

des moines在执行str.split时将是一个问题,因为它的名称中有一个空格。

首先(或最后,需要测试(获得该城市

sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'

然后其余的

def get_city(sentence, cities):
for word in sentence.split(' '):
if sentence in cities:
return word
return None
cities = cdf['city'].tolist()
sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))

这样的东西可以工作。我会自己尝试,但我在手机上。

sentence_cities =[]
cities = city_state.city
for text in sentence.sentence:
[sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]
sentence['city'] = sentence_cities

最新更新