我有12个文件:
files = ['01_2021.csv', '02_2021.csv', '03_2021.csv', '04_2021.csv', '05_2021.csv', '06_2021.csv', '07_2021.csv', '08_2021.csv', '09_2021.csv', '10_2021.csv', '11_2020.csv', '12_2020.csv']
我的CSV文件结构:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 111 163
2 item2 16 22 98
...
5000
我试图计算最近一个月的文件(在这种情况下是10_2021(值的减少/增加%;条目";上一个月的值;条目";每个唯一的id。此外,请注意,不能保证两个文件中都始终存在唯一的标识。
10_2 021.csv:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 111 163
2 item2 16 22 98
...
5000
09_2021.csv:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 97 163
2 item2 16 57 98
...
5000
例如id=1:
111(10_2021.csv) - 97 (09_2021.csv) = 14
14 / 97 (09_2021.csv) = 0.1443 * 100 = 14.43
例如id=2:
22(10_2021.csv) - 57 (09_2021.csv) = -35
-35 / 57 (09_2021.csv) = -0.6140 * 100 = -61.40
所需输出为:
id %differenceLatestMonthToPreviousMonth
1 14.43%
2 -61.40%
到目前为止我的代码:
import pandas as pd
from os import listdir
from os.path import isfile, join
#readMyDirectoryForFiles
mypath= <myDirectoryPath>
list_of_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]
#GenerateAllFilesInDirList
list_of_files = [mypath+x for x in list_of_files]
#sortListToEnsureLatestTwoMonthsAreOnTop
list_of_files.sort()
#ConsiderOnlyLatestTwoMonths
filesNeeded = list_of_files[:2]
#I'm stuck here: map file names to each unique ID and calculate like on examples provided above for id 1 and id 2.
dataframes = [pd.read_csv(fi) for fi in filesNeeded]
有人能帮忙吗?提前谢谢。
听起来您需要在id
列上pd.DataFrame.join
您的两个DataFrame
,然后计算%差异:
this_month = dataframes[0]
last_month = dataframes[1]
combined = this_month.join(last_month.set_index('id'),
on='id',
lsuffix='_this_month',
rsuffix='_last_month',
)
combined['pct_diff_between_months'] =
((combined['Entries_this_month']/combined['Entries_last_month'] - 1)*100)