限制Pandas Groupby计数



我有dataframe crimes_df

>> crimes_df.size
6198374

我需要使用相同的"s_lat""s_lon""date"计算事件。我使用Groupby:

crimes_count_df = crimes_df
    .groupby(["s_lat", "s_lon", "date"])
    .size()
    .to_frame("crimes")

但没有给出正确的答案,因为如果您计算总和,您可以看到大多数事件都丢失了:

>> crimes_count_df.sum()
crimes    476798
dtype: int64

我也尝试了agg:

crimes_count_df = crimes_df
    .groupby(["s_lat", "s_lon", "date"])
    .agg(['count'])

但结果相同:

crimes_count_df.sum()
Unnamed: 0            count    476798
area                  count    476798
arrest                count    476798
description           count    476798
domestic              count    476798
latitude              count    476798
location_description  count    475712
longitude             count    476798
time                  count    476798
type                  count    476798

编辑:我发现此聚合功能有一个限制!查看此命令:

crimes_df.head(100) 
    .groupby(["s_lat", "s_lon", "date"]) 
    .size() 
    .to_frame("crimes")
    .sum()
crimes    100
dtype: int64
crimes_df.head(1000) 
    .groupby(["s_lat", "s_lon", "date"]) 
    .size() 
    .to_frame("crimes")
    .sum()
crimes    1000
dtype: int64
crimes_df.head(10000) 
    .groupby(["s_lat", "s_lon", "date"]) 
    .size() 
    .to_frame("crimes")
    .sum()
crimes    10000
dtype: int64
crimes_df.head(100000) 
    .groupby(["s_lat", "s_lon", "date"]) 
    .size() 
    .to_frame("crimes")
    .sum()
crimes    100000
dtype: int64
crimes_df.head(1000000) 
    .groupby(["s_lat", "s_lon", "date"]) 
    .size() 
    .to_frame("crimes")
    .sum()
crimes    476798
dtype: int64
crimes_df.head(10000000) 
    .groupby(["s_lat", "s_lon", "date"]) 
    .size() 
    .to_frame("crimes")
    .sum()
crimes    476798
dtype: int64
crimes_df.head(476799) 
    .groupby(["s_lat", "s_lon", "date"]) 
    .size() 
    .to_frame("crimes")
    .sum()
crimes    476798
dtype: int64

如果您想自己检查一下,这里是带有数据的文件:

https://www.dropbox.com/s/ib0kq16t4c2e5a2/crimedatawithsquare.csv?dl = 0

您可以这样加载:

from pandas import read_csv, DataFrame
crimes_df = read_csv("CrimeDataWithSquare.csv")

信息

crimes_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 476798 entries, 0 to 476797
Data columns (total 13 columns):
Unnamed: 0              476798 non-null int64
area                    476798 non-null float64
arrest                  476798 non-null bool
date                    476798 non-null object
description             476798 non-null object
domestic                476798 non-null bool
latitude                476798 non-null float64
location_description    475712 non-null object
longitude               476798 non-null float64
time                    476798 non-null object
type                    476798 non-null object
s_lon                   476798 non-null float64
s_lat                   476798 non-null float64
dtypes: bool(2), float64(5), int64(1), object(5)
memory usage: 40.9+ MB

我认为这不是错误。大小方法并不总是等于行数。让我们看一下您的情况:

import pandas as pd
crimes_df = pd.read_csv("CrimeDataWithSquare.csv")
crimes_df.shape
#(476798, 13)
crimes_df.shape[0] * crimes_df.shape[1]
#6198374
crimes_df.size
#6198374
len(crimes_df)
#476798

关于size方法的文档说什么?

ndframe中的元素数

通常,dataFrame具有2个维度(x行列列(。因此,dataframe size方法返回x times y(其中元素的数量(。

如果您有一列?

crimes_df2 = crimes_df.iloc[:, 0]
len(crimes_df2) == crimes_df2.size
#True

这是您期望的结果。

您的某些数据集可能会丢失值,例如日期?如果我没记错的话,一个人不会分组(尽管我可能错了(。您是否尝试过使用fillna(0(?

crimes_count_df = crimes_df
    .groupby(["s_lat", "s_lon", "date"])
    .size()
    .reset_index()
    .fillna(0)
    .to_frame("crimes")

尝试以下:

np.random.seed(0)
df = pd.DataFrame({
    'a': [1, 2, 3] * 4,
    'b': np.random.choice(['q','w','a'], size=12),
    'c': 1
})
df
    a  b  c
0   1  q  1
1   2  w  1
2   3  q  1
3   1  w  1
4   2  w  1
5   3  a  1
6   1  q  1
7   2  a  1
8   3  q  1
9   1  q  1
10  2  q  1
11  3  a  1
df.groupby(['a', 'b']).count()
     c
a b   
1 q  3
  w  1
2 a  1
  q  1
  w  2
3 a  2
  q  2

最新更新