搜索与Python中多个列之间的值相对应的重复字符串(最好使用Pandas DataFrame)



我有一个三列的数据集,看起来像这样:

    X1 X5 X1
    X2 X9 X2
    X3 X3 X5
    X4 X8 X3
    X5 X1 X4

我想搜索所有列中存在的变量。在这种情况下,输出将是

    X1
    X3
    X5

有人可以在python中帮助这样做吗?

如果您 .apply value_counts到列,您将获得以下内容:

In [25]: df
Out[25]:
    a   b   c
0  X1  X5  X1
1  X2  X9  X2
2  X3  X3  X5
3  X4  X8  X3
4  X5  X1  X4
In [26]: df.apply(pd.Series.value_counts)
Out[26]:
      a    b    c
X1  1.0  1.0  1.0
X2  1.0  NaN  1.0
X3  1.0  1.0  1.0
X4  1.0  NaN  1.0
X5  1.0  1.0  1.0
X8  NaN  1.0  NaN
X9  NaN  1.0  NaN

所以,您想要所有不为空的行...

In [28]: result = df.apply(pd.Series.value_counts).notnull().all(axis=1)
In [29]: result
Out[29]:
X1     True
X2    False
X3     True
X4    False
X5     True
X8    False
X9    False
dtype: bool

,您可以获得值为Truelist

In [30]: [i for i, x in result.iteritems() if x]
Out[30]: ['X1', 'X3', 'X5']

和一种略有不同的方法:

In [50]: df
Out[50]:
    a   b   c
0  X1  X5  X1
1  X2  X9  X2
2  X3  X3  X5
3  X4  X8  X3
4  X5  X1  X4
In [51]: uniq = pd.Series(np.unique(df.values))
In [52]: uniq
Out[52]:
0    X1
1    X2
2    X3
3    X4
4    X5
5    X8
6    X9
dtype: object
In [53]: result = df.apply(uniq.isin).all(axis=1)
In [54]: result.index = uniq
In [55]: result
Out[55]:
X1     True
X2    False
X3     True
X4    False
X5     True
X8    False
X9    False
dtype: bool

我可以想象的最简单解决方案:1.在每一列中制作一组值2.在上一个步骤中获得的所有集合

上设置交集
df = pd.DataFrame(
    {'a': ['x1', 'x2', 'x3', 'x4', 'x5'], 'b': ['x5', 'x9', 'x3', 'x8', 'x1'], 'c': ['x1', 'x1', 'x5', 'x3', 'x4']})
sets = [set(df[column]) for column in df.columns]
result = list(set.intersection(*sets))

最新更新