如何使用lambda函数来数据清洁excel在python?

我正在尝试使用Python数据清理excel

的要求如下:100个excel文件，管理亚马逊库存，过滤功能，获得亚马逊prime产品，价格更低，等于20美元，男性使用。

所有的excel看起来像这样:

product cost gender prime?
book    20   male   yes
pencil  10   female no
short   15   male   yes
...

另外，我希望在结果中删除主列。

那么，结果看起来像下面这样:

product cost gender
book    20   male 
short   15   male  
...

我使用下面的代码导入excel:

import os
import pandas as pd
cwd = os.path.abspath('') 
files = os.listdir(cwd)  
df = pd.DataFrame()
for file in files:
if file.endswith('.XLSX'):
df = df.append(pd.read_excel(file), ignore_index=True)

我通过将结果更改为数组列表来完成清理过程:

array = df.to_numpy().tolist()

我像这样进入数组:

array = [['comic', 15, 'male', 'yes'], 
['paint', 14, 'male', 'no'], 
['pen', 5, 'female', 'yes'], 
['phone case', 9, 'male', 'yes'], 
['headphone', 40, 'male', 'yes'], 
['book', 20, 'male', 'yes']]

我这样使用代码:

for line in array:
for element in line:
#add action here

得到如下结果:

array = [['comic', 15, 'male'],
['phone case', 9, 'male'], 
['book', 20, 'male']]

我得到了想要的结果，然后将其导出到干净的数据excel

result = pd.DataFrame(array)
result.to_excel('clean_data.xlsx')

但是我希望代码使用apply和lambda函数来减少行数，但我不确定数组策略是否合适。

我知道lambda只是一种编码风格，但这个分配也有使用lambda函数的要求。

它有几行lambda代码可以做所有的要求吗?

谁能展示Python代码来做这件事?提前谢谢你。

您可以将filter和lambda组合在一起以过滤掉不符合条件的子列表，然后使用列表推导法仅取子列表中倒数第二项的值以删除yes/No值:

>>> [x[:-1] for x in filter(lambda x: x[1]<=20 and x[-1]=='yes', array)]
[['comic', 20, 'male'], ['pen', 5, 'male'], ['book', 15, 'male'], ['pencil ', 10, 'female']]

您也可以将map和filter与lambda组合:

>>> list(map(lambda x: x[:-1], filter(lambda x: x[1]<=20 and x[-1]=='yes', array)))
[['comic', 20, 'male'], ['pen', 5, 'male'], ['book', 15, 'male'], ['pencil ', 10, 'female']]

更新:

由于在更新问题之后有了数据框架，因此可以使用数据框架进行操作，然后将其转换为list:

>>> df[(df['cost'].le(20))&df['prime?'].eq('yes')].iloc[:,:-1].values.tolist()
[['comic', 20, 'male'], ['pen', 5, 'male'], ['book', 15, 'male'], ['pencil ', 10, 'female']]

没有理由为此使用lambda。列表推导式是普通的"python式";筛选列表的方法。

array = [['comic', 20, 'male', 'yes'], 
['paint', 14, 'male', 'no'], 
['pen', 5, 'male', 'yes'], 
['phone case', 9, 'female', 'no'], 
['headphone', 40, 'female', 'yes'], 
['book', 15, 'male', 'yes'], 
['pencil ', 10, 'female', 'yes'],  
['shirt', 25, 'male', 'no']]
result = [entry[:3] for entry in array if entry[1] <= 20 
and entry[2] == "male" and entry[3] == "yes"]
print(result)

不要使用lambda+.apply，那是最后的手段，您应该按照设计的方式使用pandas-矢量化操作。鉴于:

In [10]: df
Out[10]:
product  cost  gender prime?
0    book    20    male    yes
1  pencil    10  female     no
2   short    15    male    yes

然后像这样写:

In [11]: df[(df['prime?'] == 'yes') & (df['gender'] =='male') & (df['cost'] <= 20)].drop('prime?', axis=1)
Out[11]:
product  cost gender
0    book    20   male
2   short    15   male

需要理解的一件重要的事情是，如果要使用.apply，那么使用lambda表达式只是一个风格问题，lambda表达式创建的函数对象与函数定义语句创建的函数对象完全相同。

没有无用lambda的选项:[x[:-1] for x in array if x[1]<=20 and x[-1]=='yes']

相关内容

最新更新

热门标签：