如何从列中隔离数字并创建3个新列



我正在尝试访问一列,筛选其数字,然后拆分为3列。但我只是犯了一些错误。我正在尝试这个:

dsc = df["Descricao"].str.findall("d+")
dsc
The Output:
0                   []
1       [475, 2000, 3]
2        [65, 2000, 2]
3        [51, 2000, 3]
4       [320, 2000, 3]
...      
2344               NaN
2345    [480, 2000, 1]
2346     [32, 2000, 6]
2347    [250, 2000, 1]
2348               NaN
Name: Descricao, Length: 2349, dtype: object

然后,我试图拆分,每次我都会出现这种错误:

df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
df.head(5)
The Error:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~AppDataLocalTemp/ipykernel_15388/2481153233.py in <module>
----> 1 df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
2 df.head(5)
~anaconda3libsite-packagespandascoreframe.py in __setitem__(self, key, value)
3598             self._setitem_frame(key, value)
3599         elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3600             self._setitem_array(key, value)
3601         elif isinstance(value, DataFrame):
3602             self._set_item_frame_value(key, value)
~anaconda3libsite-packagespandascoreframe.py in _setitem_array(self, key, value)
3637         else:
3638             if isinstance(value, DataFrame):
-> 3639                 check_key_length(self.columns, key, value)
3640                 for k1, k2 in zip(key, value.columns):
3641                     self[k1] = value[k2]
~anaconda3libsite-packagespandascoreindexers.py in check_key_length(columns, key, value)
426     if columns.is_unique:
427         if len(value.columns) != len(key):
--> 428             raise ValueError("Columns must be same length as key")
429     else:
430         # Missing keys in columns are represented as -1
ValueError: Columns must be same length as key

我认为这与str.findall生成列表有关。有人知道我该怎么解决吗?有关信息,我的所有列都是对象

你可以试试这个:

dsc = pd.DataFrame(df["Descricao"].str.findall("d+").tolist(), columns=['Larg','comp', 'qtd'])
df = pd.concat([df, dsc], axis=1)

请注意,如果在任何时候都有三列以上的列,则这可能不起作用(考虑到您的尝试,我认为情况不会如此(。

这个方法就是从这里来的。

在一般情况下,一些输入可能没有解析为3个数值的字符串。

这里有一种方法,在用NaN填充任何不寻常行的新列时,按照问题的要求进行操作。如果非标准行所需的行为不同,则可以根据需要调整逻辑。

import pandas as pd
import numpy as np
df = pd.DataFrame({
'Descricao' : ['', '475,2000,3', '65,2000,2', np.nan, 'abc,def,ghi', '1,2', '1']
})
print(f"nInput dataframe:n{df}")
df2 = df["Descricao"].str.findall("d+").to_frame()
print(f"nDataframe with lists of 3 where possible:n{df2}")
df2["Descricao"] = df2.apply(lambda x: 
x["Descricao"] 
if (len(x["Descricao"]) if isinstance(x["Descricao"], list) else 0) == 3 else 
[np.NaN]*3, 
axis=1)
print(f"nDataframe with lists include NaNs for incomplete data:n{df2}")
df2[['Larg','comp', 'qtd']] = pd.DataFrame(df2["Descricao"].tolist(), columns=['Larg','comp', 'qtd'])
df2 = df2.drop(['Descricao'], axis=1)
print(f"nResult dataframe with NaNs for incomplete inputs:n{df2}")

样本输出:


Input dataframe:
Descricao
0
1   475,2000,3
2    65,2000,2
3          NaN
4  abc,def,ghi
5          1,2
6            1
Dataframe with lists of 3 where possible:
Descricao
0              []
1  [475, 2000, 3]
2   [65, 2000, 2]
3             NaN
4              []
5          [1, 2]
6             [1]
Dataframe with lists include NaNs for incomplete data:
Descricao
0  [nan, nan, nan]
1   [475, 2000, 3]
2    [65, 2000, 2]
3  [nan, nan, nan]
4  [nan, nan, nan]
5  [nan, nan, nan]
6  [nan, nan, nan]
Result dataframe with NaNs for incomplete inputs:
Larg  comp  qtd
0  NaN   NaN  NaN
1  475  2000    3
2   65  2000    2
3  NaN   NaN  NaN
4  NaN   NaN  NaN
5  NaN   NaN  NaN
6  NaN   NaN  NaN

谢谢大家!遵循@constantstranger解决方案,从it解决方案中提取一部分,并开发了一个新版本。但这是一个轻松的开始。最后,我的解决方案是:

dsc = ndf['descricao'].str.findall('d+')        #Separated only the numerical elements
# Created 3 lists for the elements
larg = []
comp = []
qtd = []
for lines in dsc:                    
for item in enumerate(lines):
if len(lines) != 3:       #If the length of the elements is not 3, does nothing.
continue
if item[0] == 0:
larg.append(item[1])
if item[0] == 1:
comp.append(item[1])
if item[0] == 2:
qtd.append(item[1])
#Then i checked for the length of all
print(len(larg), len(comp), len(qtd))
lis = [larg, comp, qtd]
df1 = pd.DataFrame(lis).transpose()
df1.columns = ['larg', 'comp', 'qtd']
df1

输出:

larg    comp    qtd
0   32  2000    6
1   46  1000    1
2   320 100 20
3   220 100 50
4   220 50  30
... ... ... ...
1404    50  2000    1
1405    52  200 2
1406    48  2000    1
1407    325 3000    1
1408    33  2000    2
1409 rows × 3 columns

我想,这不是大数据的理想解决方案,但它目前正在发挥作用。我尝试了使用to_frame((的.findall表达式,但由于某种原因,每个长度都变为零。所以,现在我将寻找一种优化的方法。

最新更新