我正在尝试合并3个任务数据框架并相应地查询它们。数据框架是从forma1数据的SQL表的.csv文件创建的。表
laptimes: raceId, driverId, lap, position, time
races: raceId, year, round
drivers: driverId, forename, surname
下面的laptimes
表与races
和drivers
表有外键关系。我已经读取,合并了数据帧,并试图查询数据帧:
from dask import dataframe as df
import pandas as pd
pd.set_option("max_rows", None)
df_times = df.read_csv("lap_times.csv", blocksize=64000000)
df_races = df.read_csv("races.csv",blocksize=64000000)
df_drivers = df.read_csv("drivers.csv", blocksize=64000000)
pd1 = df_times.merge(df_drivers, on = "driverId").merge(df_races, on = "raceId")
.drop(labels = ["dob", "nationality", "milliseconds", "url_x", "url_y", "time_y", "date"],
axis = 1)
.loc[(df_drivers["forename"] == "Sebastian")
& (df_drivers["surname"] == "Vettel")
& (df_races["year"] == 2018)
& (df_races["round"] == 1), :]
我能够运行上面的代码没有错误。但是,如果我尝试使用pd1.compute()
将过滤后的任务数据框转换为pandas数据框,则会得到错误:
Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
我已经为此搜索了解决方案,并且看到,对于pandas数据框架,重置索引是一个解决方案,但不适用于任务数据框架。我错过了什么?如有任何帮助,不胜感激。
我能够获得所需的子集,但是我必须分别执行每个合并:
pd1 = df_times.merge(df_drivers, on = "driverId")
.drop(labels = ["dob", "nationality", "milliseconds", "url"],axis = 1)
pd1 = pd1.merge(df_races, on = "raceId")
pd2 = pd1.drop(["url", "time_y", "date"], axis = 1)
pd2.loc[(pd2["forename"] == "Lewis") & (pd2["surname"] == "Hamilton") & (pd2["year"] == 2018) & (pd2["round"] == 1)].compute()
这不是最优的,我仍然想找到一个解决方案,在一个语句中执行所有3个表之间的合并。