根据dataframe中的值计算dataframe子集的统计信息(纬度和经度)



我希望计算数据框架子集的汇总统计数据,但与行内的特定值相关。

例如,我有一个包含经纬度和人数的数据框架。

df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})

我想知道距每行0.05英里以内的总人数。这可以很容易地通过循环创建,但随着空间开始增加,这变得不可用。

当前/样本:

from geopy.distance import distance
def distance_calc (row, focus_lat, focus_long):
start = (row['latitude'], row['longitude'])
stop = (focus_lat, focus_long)
return distance(start, stop).miles
df['total_people_within_05'] = 0
df['total_rows_within_05'] = 0
for index, row in df.iterrows():
focus_lat = df['latitude'][index]
focus_long = df['longitude'][index]
new_df = df.copy()
new_df['distance'] = new_df.apply (lambda row: (distance_calc(row, focus_lat, focus_long)),axis=1)
df.at[index, 'total_people_within_05'] = new_df.loc[new_df.distance<=.05]['people'].sum()
df.at[index, 'total_rows_within_05'] = new_df.loc[new_df.distance<=.05].shape[0]

有什么python的方法可以做到这一点吗?

  • 笛卡尔积到自身得到所有组合。在大型数据集上,这将是昂贵的。这将生成N^2行,因此在本例中为25行
  • 计算每个组合的距离
  • query()过滤到所需的距离
  • groupby()得到总人数。还生成索引的list包含在total中,以帮助提高透明度
  • 最终join()这个回到一起,你有你想要的
import geopy.distance as gd
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))
print(df.to_markdown())
index_y[0,1, 2][0,1, 2][0,1, 2][3][4]

最新更新