Pandas Groupby的索引不在通过用Copy()进行子语句派生的数据帧中



问题:查找df中不重复的ZIP。ZIP(不能出现一次(和df。ST没有值".">
所以我对原始数据帧进行了子集设置,并应用了Groupby-这仍然带来了一些不符合子集条件的行(df.ST!='.'(。所以我通过使用copy((选项进行子集设置来创建一个单独的df_us。Groupby仍然给出相同的索引。

grouped = df[df.ST != '.'].groupby(['ZIP_CD'],sort=False) # grouping
df_size = pd.DataFrame({'ZIP':grouped.size().index, 'Count':grouped.size().values}) # Forming df around the group
df_count = df_size[df_size.Count==1] #df with Count=1
one_index = df_count.index.tolist() #gathering index
df_one = df.loc[one_index] #final df
df_us = df_data[df.ST != '.'].copy() # tried this too

上面的最后一段代码仍然为"的值提供了一些索引当我分组时。但是df_us没有任何"。"完全所以这会导致具有和上面方法相同的索引列,但对于"值,其余行值为空,因为df_us没有这些值!

groupy正在查找那些带有"的索引价值观,无论我做了什么。有什么解决方案吗?

更新:样本数据=
索引ST ZIP_CD
123 ca 94025
124多伦多
125 ga 30306
126意大利
127 ca 94025

所以正确答案是

ST      ZIP_CD 
0   123     ca  94025

更新:@Naveed的soln和我下面的工作很好。不知道为什么上面的代码有缺陷?

您可以使用groupby:使用原始方法

grouped = df[df.ST != '.'].groupby(['ZIP_CD'],sort=False) # grouping
item = grouped.size()[grouped.size() < 2)].index # finding zip values
df_one = df[df.ZIP_CD.isin(item) #final df

我测试了一下,结果成功了。

# use lot where zip_cd ne .
# and zip_cd is not duplicated
# df.duplicated(subset='ZIP') : identifies the duplicates based on ZIP code and results in True/False series
# df.loc selects the rows from df, where duplicated is true
# df['ZIP'].isin : check if any of the DF is part of the zip filtered by df.loc
# using negation we eliminate them from being selected
# first condition is to check using loc that ZIP is not equal to "."
# combining these two together with logical AND, we filter DF where it holds true
# please note: while the same DF is used repeatedly, the filtered result is different for each of them.

(df.loc[df['ZIP'].ne('.') & 
~df['ZIP'].isin(df.loc[df.duplicated(subset='ZIP')]['ZIP'])]
)
index   ST  ZIP_CD
0   123     ca  94025
2   125     ga  30306

@Naveed-感谢你帮助我学习。你的否定对我来说是新的东西。我还用你的否定写了一个替代方案。

df1 = df[df.ZIP != '.'] # eliminate invalid entries
v = df1.ZIP.value_counts() # counts values
df2 = df1[~df1.ZIP.isin(v.index[v.gt(1)])] # gets values more than once and negates

要尝试的链接:https://trinket.io/python3/b26eae2e0e

值得一提的是@Jerrold110修改后的方法:

items = v[v<2].index # items that appear less than twice
df2 = df1[df1['ZIP'].isin(items)

我不知道为什么最初的groupby soln失败了。

最新更新