Python:在数据帧中每三个字拆分一个字符串



我已经四处寻找了一段时间,但似乎找不到这个小问题的答案。

我有这样的代码,应该在每三个单词后分割字符串:

import pandas as pd
import numpy as np
df1 = {
'State':['Arizona AZ asdf hello abc','Georgia GG asdfg hello def','Newyork NY asdfg hello ghi','Indiana IN asdfg hello jkl','Florida FL ASDFG hello mno']}
df1 = pd.DataFrame(df1,columns=['State'])
df1
def splitTextToTriplet(df):
text = df['State'].str.split()
n = 3
grouped_words = [' '.join(str(text[i:i+n]) for i in range(0,len(text),n))]
return grouped_words
splitTextToTriplet(df1)

目前的输出如下:

['0     [Arizona, AZ, asdf, hello, abc]n1    [Georgia, GG, asdfg, hello, def]nName: State, dtype: object 2    [Newyork, NY, asdfg, hello, ghi]n3    [Indiana, IN, asdfg, hello, jkl]nName: State, dtype: object 4    [Florida, FL, ASDFG, hello, mno]nName: State, dtype: object']

但我实际上期望在数据帧上的5行、一列中输出:

['Arizona AZ asdf', 'hello abc']
['Georgia GG asdfg', 'hello def']
['Newyork NY asdfg', 'hello ghi']
['Indiana IN asdfg', 'hello jkl']
['Florida FL ASDFG', 'hello mno']

如何更改正则表达式,使其产生预期的输出?

为了提高效率,您可以使用正则表达式和str.extractall+groupby/agg:

(df1['State']
.str.extractall(r'((?:w+bs*){1,3})')[0]
.groupby(level=0).agg(list)
)

输出:

0     [Arizona AZ asdf , hello abc]
1    [Georgia GG asdfg , hello def]
2    [Newyork NY asdfg , hello ghi]
3    [Indiana IN asdfg , hello jkl]
4    [Florida FL ASDFG , hello mno]

正则表达式:

(             # start capturing
(?:w+bs*)  # words
{1,3}         # the maximum, up to three
)             # end capturing

你可以做:

def splitTextToTriplet(row):
text = row['State'].split()
n = 3
grouped_words = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
return grouped_words
df1.apply(lambda row: splitTextToTriplet(row), axis=1)

其输出以下数据帧:

0
0[Arizona AZ asdf','hello abc']
1['Georgia GG asdfg','hello-def']
2[‘纽约纽约asdfg’,‘你好ghi’]
3[印度输入asdfg','hellojkl']
4[佛罗里达州佛罗里达州ASDFG','hello mno']

最新更新