我有两个数据帧:
city_state
数据帧
city state
0 huntsville alabama
1 montgomery alabama
2 birmingham alabama
3 mobile alabama
4 dothan alabama
5 chicago illinois
6 boise idaho
7 des moines iowa
和句子数据帧
sentence
0 marthy was born in dothan
1 michelle reads some books at her home
2 hasan is highschool student in chicago
3 hartford of the west is the nickname of des moines
我想从名为城市的句子数据帧中提取新功能。该列city
是从sentence
中提取的,如果在句子中包含第city_state['city']
列中某些city
的名称,如果它不包含特定city
的名称,则其值将为 Null。
预期的新数据帧将如下所示:
sentence city
0 marthy was born in dothan dothan
1 michelle reads some books at her home Null
2 hasan is highschool student in chicago chicago
3 capital of dream is the motto of des moines des moines
我已经运行了这段代码
sentence['city'] ={}
for city in city_state.city:
for text in sentence.sentence:
words = text.split()
for word in words:
if word == city:
sentence['city'].append(city)
break
else:
sentence['city'].append(None)
但是这段代码的结果是这样的
ValueError: Length of values does not match length of index
如果您有类似情况的特征工程经验,您能否给我一些建议,如何编写正确的代码以获得预期结果。
谢谢
注意: 这是错误的完整日志
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
2
3 for city in city_state.city:
4 for text in sentence.sentence:
5 words = text.split()
~Anaconda3libsite-packagespandascoreframe.py in __setitem__(self, key, value)
3117 else:
3118 # set column
-> 3119 self._set_item(key, value)
3120
3121 def _setitem_slice(self, key, value):
~Anaconda3libsite-packagespandascoreframe.py in _set_item(self, key, value)
3192
3193 self._ensure_valid_index(value)
-> 3194 value = self._sanitize_column(key, value)
3195 NDFrame._set_item(self, key, value)
3196
~Anaconda3libsite-packagespandascoreframe.py in _sanitize_column(self, key, value, broadcast)
3389
3390 # turn me into an ndarray
-> 3391 value = _sanitize_index(value, self.index, copy=False)
3392 if not isinstance(value, (np.ndarray, Index)):
3393 if isinstance(value, list) and len(value) > 0:
~Anaconda3libsite-packagespandascoreseries.py in _sanitize_index(data, index, copy)
3999
4000 if len(data) != len(index):
-> 4001 raise ValueError('Length of values does not match length of ' 'index')
4002
4003 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
一些快速而肮脏的应用,尚未在大型数据帧上对其进行测试,因此请谨慎使用。 首先定义一个函数来提取城市名称:
def ex_city(col, cities):
output = []
for w in cities:
if w in col:
output.append(w)
return ','.join(output) if output else None
然后将其应用于句子数据帧
city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))
让sdf = sentence dataframe
和cdf=city_state dataframe
des moines
在执行str.split
时将是一个问题,因为它的名称中有一个空格。
首先(或最后,需要测试(获得该城市
sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'
然后其余的
def get_city(sentence, cities):
for word in sentence.split(' '):
if sentence in cities:
return word
return None
cities = cdf['city'].tolist()
sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))
这样的东西可以工作。我会自己尝试,但我在手机上。
sentence_cities =[]
cities = city_state.city
for text in sentence.sentence:
[sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]
sentence['city'] = sentence_cities