我正在尝试访问一列,筛选其数字,然后拆分为3列。但我只是犯了一些错误。我正在尝试这个:
dsc = df["Descricao"].str.findall("d+")
dsc
The Output:
0 []
1 [475, 2000, 3]
2 [65, 2000, 2]
3 [51, 2000, 3]
4 [320, 2000, 3]
...
2344 NaN
2345 [480, 2000, 1]
2346 [32, 2000, 6]
2347 [250, 2000, 1]
2348 NaN
Name: Descricao, Length: 2349, dtype: object
然后,我试图拆分,每次我都会出现这种错误:
df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
df.head(5)
The Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~AppDataLocalTemp/ipykernel_15388/2481153233.py in <module>
----> 1 df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
2 df.head(5)
~anaconda3libsite-packagespandascoreframe.py in __setitem__(self, key, value)
3598 self._setitem_frame(key, value)
3599 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3600 self._setitem_array(key, value)
3601 elif isinstance(value, DataFrame):
3602 self._set_item_frame_value(key, value)
~anaconda3libsite-packagespandascoreframe.py in _setitem_array(self, key, value)
3637 else:
3638 if isinstance(value, DataFrame):
-> 3639 check_key_length(self.columns, key, value)
3640 for k1, k2 in zip(key, value.columns):
3641 self[k1] = value[k2]
~anaconda3libsite-packagespandascoreindexers.py in check_key_length(columns, key, value)
426 if columns.is_unique:
427 if len(value.columns) != len(key):
--> 428 raise ValueError("Columns must be same length as key")
429 else:
430 # Missing keys in columns are represented as -1
ValueError: Columns must be same length as key
我认为这与str.findall生成列表有关。有人知道我该怎么解决吗?有关信息,我的所有列都是对象
你可以试试这个:
dsc = pd.DataFrame(df["Descricao"].str.findall("d+").tolist(), columns=['Larg','comp', 'qtd'])
df = pd.concat([df, dsc], axis=1)
请注意,如果在任何时候都有三列以上的列,则这可能不起作用(考虑到您的尝试,我认为情况不会如此(。
这个方法就是从这里来的。
在一般情况下,一些输入可能没有解析为3个数值的字符串。
这里有一种方法,在用NaN填充任何不寻常行的新列时,按照问题的要求进行操作。如果非标准行所需的行为不同,则可以根据需要调整逻辑。
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Descricao' : ['', '475,2000,3', '65,2000,2', np.nan, 'abc,def,ghi', '1,2', '1']
})
print(f"nInput dataframe:n{df}")
df2 = df["Descricao"].str.findall("d+").to_frame()
print(f"nDataframe with lists of 3 where possible:n{df2}")
df2["Descricao"] = df2.apply(lambda x:
x["Descricao"]
if (len(x["Descricao"]) if isinstance(x["Descricao"], list) else 0) == 3 else
[np.NaN]*3,
axis=1)
print(f"nDataframe with lists include NaNs for incomplete data:n{df2}")
df2[['Larg','comp', 'qtd']] = pd.DataFrame(df2["Descricao"].tolist(), columns=['Larg','comp', 'qtd'])
df2 = df2.drop(['Descricao'], axis=1)
print(f"nResult dataframe with NaNs for incomplete inputs:n{df2}")
样本输出:
Input dataframe:
Descricao
0
1 475,2000,3
2 65,2000,2
3 NaN
4 abc,def,ghi
5 1,2
6 1
Dataframe with lists of 3 where possible:
Descricao
0 []
1 [475, 2000, 3]
2 [65, 2000, 2]
3 NaN
4 []
5 [1, 2]
6 [1]
Dataframe with lists include NaNs for incomplete data:
Descricao
0 [nan, nan, nan]
1 [475, 2000, 3]
2 [65, 2000, 2]
3 [nan, nan, nan]
4 [nan, nan, nan]
5 [nan, nan, nan]
6 [nan, nan, nan]
Result dataframe with NaNs for incomplete inputs:
Larg comp qtd
0 NaN NaN NaN
1 475 2000 3
2 65 2000 2
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
谢谢大家!遵循@constantstranger解决方案,从it解决方案中提取一部分,并开发了一个新版本。但这是一个轻松的开始。最后,我的解决方案是:
dsc = ndf['descricao'].str.findall('d+') #Separated only the numerical elements
# Created 3 lists for the elements
larg = []
comp = []
qtd = []
for lines in dsc:
for item in enumerate(lines):
if len(lines) != 3: #If the length of the elements is not 3, does nothing.
continue
if item[0] == 0:
larg.append(item[1])
if item[0] == 1:
comp.append(item[1])
if item[0] == 2:
qtd.append(item[1])
#Then i checked for the length of all
print(len(larg), len(comp), len(qtd))
lis = [larg, comp, qtd]
df1 = pd.DataFrame(lis).transpose()
df1.columns = ['larg', 'comp', 'qtd']
df1
输出:
larg comp qtd
0 32 2000 6
1 46 1000 1
2 320 100 20
3 220 100 50
4 220 50 30
... ... ... ...
1404 50 2000 1
1405 52 200 2
1406 48 2000 1
1407 325 3000 1
1408 33 2000 2
1409 rows × 3 columns
我想,这不是大数据的理想解决方案,但它目前正在发挥作用。我尝试了使用to_frame((的.findall表达式,但由于某种原因,每个长度都变为零。所以,现在我将寻找一种优化的方法。