我正在进行一个MapReduce项目,我的输入是(天、站、温度(,我的目标是每天输出每个站的最高和最低温度。所以基本上,对于这个输入,我会有一个看起来像这样的输出:
输入:
20200101, station1, 35
20200101, station1, 44
20200101, station1, 77
20200101, station3, 66,
20200101, station3, 99
20200102, station1, 54,
20200102, station2, 55,
输出:
20200101, station1, max(77) min(35)
20200101, station3, max(99) min(66)
20200102, station1, max(54) min(..)
20200102, station2, max(55) min(..)
到目前为止,我所尝试的仅适用于2个列表,不适用于3个列表:为每一天,找到每一个气象站,为每一个天气站找到每一种温度。。。
以下是我迄今为止尝试的代码:
# Read file txt file in
file1 = open('bigdatatemp.txt', 'r')
Lines = file1.readlines()
Lines ouput: (the variables that are important are (WBAN NUMBER = station, YearMonthDay = day, DryBulb Temp = temperature)
['Wban Number, YearMonthDay, Time, Station Type, Maintenance Indicator, Sky Conditions, Visibility, Weather Type, Dry Bulb Temp, Dew Point Temp, Wet Bulb Temp, % Relative Humidity, Wind Speed (kt), Wind Direction, Wind Char. Gusts (kt), Val for Wind Char., Station Pressure, Pressure Tendency, Sea Level Pressure, Record Type, Precip. Totaln',
'03011,20070401,0050,AO2 ,-,SCT055 ,10SM ,-,32,23,28,69 , 4 ,130,-,0 ,30.13,-,-,AA,-n',
'03011,20070401,0150,AO2 ,-,BKN055 ,10SM ,-,32,23,28,69 , 4 ,140,-,0 ,30.12,-,-,AA,-n',
'03011,20070401,0250,AO2 ,-,OVC050 ,10SM ,-,32,23,28,69 , 3 ,130,-,0 ,30.12,-,-,AA,-n',
'03011,20070401,0350,AO2 ,-,OVC050 ,10SM ,-,34,23,30,64 , 3 ,120,-,0 ,30.12,-,-,AA,-n',
'03011,20070401,0450,AO2 ,-,BKN050 ,10SM ,-,34,23,30,64 , 4 ,130,-,0 ,30.11,-,-,AA,-n',
'03011,20070401,0550,AO2 ,-,SCT050 SCT070 ,10SM ,-,32,25,28,75 , 3 ,150,-,0 ,30.10,-,-,AA,-n',
'03011,20070401,0650,AO2 ,-,SCT070 ,10SM ,-,34,25,30,70 , 3 ,130,-,0 ,30.12,-,-,AA,-n',
'03012,20070401,0750,AO2 ,-,CLR ,10SM ,-,37,27,34,67 , 4 ,140,-,0 ,30.12,-,-,AA,-n',
'03011,20070401,0850,AO2 ,-,SCT060 BKN075 ,10SM ,-,41,27,36,58 , 0 ,000,-,0 ,30.13,-,-,AA,-n',
'03011,20070401,0950,AO2 ,-,SCT060 OVC075 ,10SM ,-,45,23,37,42 , 0 ,000,-,0 ,30.14,-,-,AA,-n',
然后我创建了一个dictionairy,并创建了3个包含所需变量(站、年、温度(的列表
# Create a dictionary
# Iterate each line
# If the key doesn't exist, create one equal to empty list
# Otherwise, append temperature to list
# This also uses an interim dictionary (tmp).
years = []
stations = []
temps = []
for line in Lines:
(station, year, ac, ad, af, ag, ah, aj, temp, al, ae, ar, at, ay, au, ai, alc, ap, ax, av, an) = line.split(',')
stations.append(station)
years.append(year)
temps.append(temp)
最后但并非最不重要的是我被困的地方。我为两个列表创建了一个循环,并对它们进行迭代:
dayTemps = {d:[] for d in stations}
for d,t in zip(stations,temps): dayTemps[d].append(t)
print(dayTemps)
output:
{'Wban Number': [' Dry Bulb Temp'], '03011': ['32', '32', '32', '34', '34', '32', '34', '41', '45', '55', '54', '54', '52', '46', '43', '43', '43'], '03012': ['37', '46', '54', '46', '45', '43'], '03013': ['50', '52', '50', '46', '45'], '03014': ['45']}
但实际上我也需要day变量,我似乎无法理解它。它应该是一个以day为关键字、以我上面的字典为值的字典吗?另外,我该如何构建它,以便获得每个气象站的最高和最低温度,这是一步还是两步多步?
或多或少下面的
data = {}
MIN = 0
MAX = 1
DATE = 0
STATION = 1
VALUE = 2
with open('in.txt') as f:
lines = [line.strip() for line in f.readlines()]
for line in lines:
fields = [f.strip() for f in line.split(',')]
if data.get(fields[DATE]) is None:
data[fields[DATE]] = {}
if fields[STATION] not in data[fields[DATE]]:
data[fields[DATE]][fields[STATION]] = [None, None]
if data[fields[DATE]][fields[STATION]][MIN] is None:
data[fields[DATE]][fields[STATION]][MIN] = (int(fields[VALUE]))
else:
if data[fields[DATE]][fields[STATION]][MIN] > int(fields[VALUE]):
data[fields[DATE]][fields[STATION]][MIN] = (int(fields[VALUE]))
if data[fields[DATE]][fields[STATION]][MAX] is None:
data[fields[DATE]][fields[STATION]][MAX] = (int(fields[VALUE]))
else:
if data[fields[DATE]][fields[STATION]][MAX] < int(fields[VALUE]):
data[fields[DATE]][fields[STATION]][MAX] = (int(fields[VALUE]))
for date, stations in data.items():
for station, values in stations.items():
print(f'{date} {station} {values}')
in.txt
20200101, station1, 35
20200101, station1, 44
20200101, station1, 77
20200101, station3, 66
20200101, station3, 99
20200102, station1, 54
20200102, station2, 55
输出
20200101 station1 [35, 77]
20200101 station3 [66, 99]
20200102 station1 [54, 54]
20200102 station2 [55, 55]
lines = ['Wban Number, YearMonthDay, Time, Station Type, Maintenance Indicator, Sky Conditions, Visibility, Weather Type, Dry Bulb Temp, Dew Point Temp, Wet Bulb Temp, % Relative Humidity, Wind Speed (kt), Wind Direction, Wind Char. Gusts (kt), Val for Wind Char., Station Pressure, Pressure Tendency, Sea Level Pressure, Record Type, Precip. Totaln',
'03011,20070401,0050,AO2 ,-,SCT055 ,10SM ,-,32,23,28,69 , 4 ,130,-,0 ,30.13,-,-,AA,-n',
'03011,20070401,0150,AO2 ,-,BKN055 ,10SM ,-,32,23,28,69 , 4 ,140,-,0 ,30.12,-,-,AA,-n',
'03011,20070401,0250,AO2 ,-,OVC050 ,10SM ,-,32,23,28,69 , 3 ,130,-,0 ,30.12,-,-,AA,-n',
'03011,20070401,0350,AO2 ,-,OVC050 ,10SM ,-,34,23,30,64 , 3 ,120,-,0 ,30.12,-,-,AA,-n',
'03011,20070401,0450,AO2 ,-,BKN050 ,10SM ,-,34,23,30,64 , 4 ,130,-,0 ,30.11,-,-,AA,-n',
'03011,20070401,0550,AO2 ,-,SCT050 SCT070 ,10SM ,-,32,25,28,75 , 3 ,150,-,0 ,30.10,-,-,AA,-n',
'03011,20070401,0650,AO2 ,-,SCT070 ,10SM ,-,34,25,30,70 , 3 ,130,-,0 ,30.12,-,-,AA,-n',
'03012,20070401,0750,AO2 ,-,CLR ,10SM ,-,37,27,34,67 , 4 ,140,-,0 ,30.12,-,-,AA,-n',
'03011,20070401,0850,AO2 ,-,SCT060 BKN075 ,10SM ,-,41,27,36,58 , 0 ,000,-,0 ,30.13,-,-,AA,-n',
'03011,20070401,0950,AO2 ,-,SCT060 OVC075 ,10SM ,-,45,23,37,42 , 0 ,000,-,0 ,30.14,-,-,AA,-n',]
lst = [i.split(',')[0:2] + [i.split(',')[8]] for i in lines[1:]]
station = set([i[0] for i in lst])
data = list(map(lambda station_now: (max([l for l in lst if l[0] == station_now]), min([l for l in lst if l[0] == station_now])), station))
for collected_data in data:
print(collected_data[0][1],collected_data[0][0],' max(',collected_data[0][2],')',' min(',collected_data[1][2],')')
>>> 20070401 03012 max( 37 ) min( 37 )
20070401 03011 max( 45 ) min( 32 )
创建子列表
然后创建另一个包含不同站号的子列表列表
然后对每个子列表进行迭代,以获得最大和最小