我已经四处寻找了一段时间,但似乎找不到这个小问题的答案。
我有这样的代码,应该在每三个单词后分割字符串:
import pandas as pd
import numpy as np
df1 = {
'State':['Arizona AZ asdf hello abc','Georgia GG asdfg hello def','Newyork NY asdfg hello ghi','Indiana IN asdfg hello jkl','Florida FL ASDFG hello mno']}
df1 = pd.DataFrame(df1,columns=['State'])
df1
def splitTextToTriplet(df):
text = df['State'].str.split()
n = 3
grouped_words = [' '.join(str(text[i:i+n]) for i in range(0,len(text),n))]
return grouped_words
splitTextToTriplet(df1)
目前的输出如下:
['0 [Arizona, AZ, asdf, hello, abc]n1 [Georgia, GG, asdfg, hello, def]nName: State, dtype: object 2 [Newyork, NY, asdfg, hello, ghi]n3 [Indiana, IN, asdfg, hello, jkl]nName: State, dtype: object 4 [Florida, FL, ASDFG, hello, mno]nName: State, dtype: object']
但我实际上期望在数据帧上的5行、一列中输出:
['Arizona AZ asdf', 'hello abc']
['Georgia GG asdfg', 'hello def']
['Newyork NY asdfg', 'hello ghi']
['Indiana IN asdfg', 'hello jkl']
['Florida FL ASDFG', 'hello mno']
如何更改正则表达式,使其产生预期的输出?
为了提高效率,您可以使用正则表达式和str.extractall
+groupby
/agg
:
(df1['State']
.str.extractall(r'((?:w+bs*){1,3})')[0]
.groupby(level=0).agg(list)
)
输出:
0 [Arizona AZ asdf , hello abc]
1 [Georgia GG asdfg , hello def]
2 [Newyork NY asdfg , hello ghi]
3 [Indiana IN asdfg , hello jkl]
4 [Florida FL ASDFG , hello mno]
正则表达式:
( # start capturing
(?:w+bs*) # words
{1,3} # the maximum, up to three
) # end capturing
你可以做:
def splitTextToTriplet(row):
text = row['State'].split()
n = 3
grouped_words = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
return grouped_words
df1.apply(lambda row: splitTextToTriplet(row), axis=1)
其输出以下数据帧:
0 | |
---|---|
0 | [Arizona AZ asdf','hello abc'] |
1 | ['Georgia GG asdfg','hello-def'] |
2 | [‘纽约纽约asdfg’,‘你好ghi’] |
3 | [印度输入asdfg','hellojkl'] |
4 | [佛罗里达州佛罗里达州ASDFG','hello mno'] |