匹配熊猫excel中的索引功能



Excel中有一个匹配索引函数,如果元素存在于必需列中,我用它来匹配

=iferror(INDEX($B$2:$F$8,MATCH($J4,$B$2:$B$8,0),MATCH(K$3,$B$1:$F$1,0)),0)

这是我现在正在使用的功能,它给我带来了很好的结果,但我想用 python 实现它。

brand   N   Z   None
Honor   63  96  190     
Tecno   0   695 763     

从这张桌子我想要

brand L   N   Z
Honor 0   63  96
Tecno 0   0   695

它应该比较列和索引并给出适当的值

我已经尝试过 Python 中的查找功能,但这给了我

ValueError: Row labels must have same size as column labels

你基本上用你的excel公式做的是创建类似数据透视表的东西,你也可以用熊猫来做。 例如:

# Define the columns and brands, you like to have in your result table
# along with the dataframe in variable df it's the only input
columns_query=['L', 'N', 'Z']
brands_query=['Honor', 'Tecno', 'Bar']
# no begin processing by selecting the columns
# which should be shown and are actually present
# add the brand, even if it was not selected
columns_present= {col for col in set(columns_query) if col in df.columns}
columns_present.add('brand')
# select the brands in question and take the
# info in columns we identified for these brands
# from this generate a "flat" list-like data
# structure using melt
# it contains records containing
# (brand, column-name and cell-value)
flat= df.loc[df['brand'].isin(brands_query), columns_present].melt(id_vars='brand')
# if you also want to see the columns and brands,
# for which you have no data in your original df
# you can use the following lines (if you don't
# need them, just skip the following lines until
# the next comment)
# the code just generates data points for the
# columns and rows, which would otherwise not be
# displayed and fills them wit NaN (the pandas 
# equivalent for None)
columns_missing= set(columns_query).difference(columns_present)
brands_missing=  set(brands_query).difference(df['brand'].unique())
num_dummies= max(len(brands_missing), len(columns_missing))
dummy_records= {
'brand': list(brands_missing) +     [brands_query[0]]  * (num_dummies - len(brands_missing)),
'variable': list(columns_missing) + [columns_query[0]] * (num_dummies - len(columns_missing)),
'value': [np.NaN] * num_dummies
}
dummy_records= pd.DataFrame(dummy_records)
flat= pd.concat([flat, dummy_records], axis='index', ignore_index=True)
# we get the result by the following line:
flat.set_index(['brand', 'variable']).unstack(level=-1)

对于我的测试数据,这输出:

value             
variable     L     N      Z
brand                      
Bar        NaN   NaN    NaN
Honor      NaN  63.0   96.0
Tecno      NaN   0.0  695.0

测试数据是(注意,上面我们没有看到 col None 和 row Foo,但我们看到了行 Bar 和列 L,它们实际上并不存在于测试数据中,而是被"查询"(:

brand   N    Z  None
0  Honor  63   96   190
1  Tecno   0  695   763
2    Foo   8  111   231

您可以使用以下方法生成此测试数据:

import pandas as pd
import numpy as np
import io
raw=
"""brand   N   Z   None
Honor   63  96  190     
Tecno   0   695 763
Foo     8   111 231"""
df= pd.read_csv(io.StringIO(raw), sep='s+')

注意:输出中显示的结果是常规熊猫数据帧。因此,如果您打算将数据写回 excel 工作表,应该没有问题(pandas 提供了将数据帧读取/写入 excel 文件的方法(。

你需要使用熊猫来执行此操作吗?你也可以用简单的python来做到这一点。从一个文本文件中读取并打印出匹配和已处理的字段。

Python 中的基本文件读取是这样的。其中数据文件.csv是你的文件。这将读取一个文件中的所有行并打印出正确的结果。首先,您需要以.csv格式保存文件,以便在字段","之间有一个分隔符。

import csv # use csv
print('brand L N Z') # print new header
with open('datafile.csv', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(spamreader, None) # skip old header
for row in spamreader:
# You need to add Excel Match etc... logic here.
print(row[0], 0, row[1], row[2]) # print output            

输入文件:

brand,N,Z,None
Honor,63,96,190
Tecno,0,695,763

打印输出:

brand L N Z
Honor 0 63 96
Tecno 0 0 695

(我不熟悉 Excel Match 函数,因此您可能需要向上面的 Python 脚本添加一些逻辑才能使逻辑处理您的所有数据。

最新更新