在列表中包含数组中以给定百分比出现的值

我有一个名为"data"其中包含以下信息:

[['amazon',
'phone',
'serious',
'mind',
'blown',
'serious',
'enjoy',
'use',
'applic',
'full',
'blown',
'websit',
'allow',
'quick',
'track',
'packag',
'descript',
'say'],
['would',
'say',
'app',
'real',
'thing',
'show',
'ghost',
'said',
'quot',
'orang',
'quot',
'ware',
'orang',
'cloth',
'app',
'adiquit',
'would',
'recsmend',
'want',
'talk',
'ghost'],
['love',
'play',
'backgammonthi',
'game',
'offer',
'varieti',
'difficulti',
'make',
'perfect',
'beginn',
'season',
'player'],

的情况是，我想保存在一个列表中，值出现在这个数组中至少1%。

我找到的最接近的近似如下，但它没有返回我需要的内容。什么好主意吗?

import numpy_indexed as npi
idx = [np.ones(len(a))*i for i, a in enumerate(tokens_list_train)]
(rows, cols), table = npi.count_table(np.concatenate(idx), np.concatenate(tokens_list_train))
table = table / table.sum(axis=1, keepdims=True)
print(table * 100)`

让我们看看，我们可以使用itertool.chain.from_iterable删除嵌套，但我们还需要总长度，我们可以通过创建另一个生成器来计算，以避免循环两次，我们需要计算重复次数，这是由计数器完成的。

from collections import Counter
from itertools import chain
total_length = 0
def sum_sublist_length(some_list):  # to sum the lengths of the sub-lists
global total_length
for value in some_list:
total_length += len(value)
yield value

counts = Counter(chain.from_iterable(sum_sublist_length(my_list)))
items = [item for item in counts if counts[item]/total_length >= 0.01]
print(items)

['amazon', 'phone', 'serious', 'mind', 'blown', 'enjoy', 'use', 'applic', 'full', 'websit', 'allow', 'quick', 'track', 'packag', 'descript', 'say', 'would', 'app', 'real', 'thing', 'show', 'ghost', 'said', 'quot', 'orang', 'ware', 'cloth', 'adiquit', 'recsmend', 'want', 'talk', 'love', 'play', 'backgammonthi', 'game', 'offer', 'varieti', 'difficulti', 'make', 'perfect', 'beginn', 'season', 'player']

下面是使用pandas.DataFrame:

生成出现1%或更多时间的元素列表的另一种方法:


import numpy as np
import pandas as pd

# == Define `flatten` function to combine objects with multi-level nesting =======
def flatten(iterable, base_type=None, levels=None):
"""Flatten an iterable with multiple levels of nesting.
>>> iterable = [(1, 2), ([3, 4], [[5], [6]])]
>>> list(flatten(iterable))
[1, 2, 3, 4, 5, 6]
Binary and text strings are not considered iterable and
will not be collapsed.
To avoid collapsing other types, specify *base_type*:
>>> iterable = ['ab', ('cd', 'ef'), ['gh', 'ij']]
>>> list(flatten(iterable, base_type=tuple))
['ab', ('cd', 'ef'), 'gh', 'ij']
Specify *levels* to stop flattening after a certain level:
>>> iterable = [('a', ['b']), ('c', ['d'])]
>>> list(flatten(iterable))  # Fully flattened
['a', 'b', 'c', 'd']
>>> list(flatten(iterable, levels=1))  # Only one level flattened
['a', ['b'], 'c', ['d']]
"""
def walk(node, level):
if (
((levels is not None) and (level > levels))
or isinstance(node, (str, bytes))
or ((base_type is not None) and isinstance(node, base_type))
):
yield node
return
try:
tree = iter(node)
except TypeError:
yield node
return
else:
for child in tree:
yield from walk(child, level + 1)
yield from walk(iterable, 0)

# == Problem Solution ==========================================================
# 1. Flatten the array into a single level list of elements, then convert it
#    to a `pandas.Series`.
series_array = pd.Series(list(flatten(array)))
# 2. Get the total number of elements in flattened list
element_count = len(series_array)
# 3. Use method `pandas.Series.value_counts() to count the number of times each
#    elements appears, then divide each element count by the
#    total number of elements in flattened list (`element_count`)
elements = (
(series_array.value_counts()/element_count)
# 4. Use `pandas.Series.loc` to select only values that appear more than
#    1% of the time.
# .loc[lambda xdf: xdf['rate_count'] >= 0.01, :]
.loc[lambda value: value >= 0.01]
# 5. Select the elements, and convert results to a list
.index.to_list()
)
print(elements)
['would', 'serious', 'blown', 'quot', 'orang', 'app', 'ghost', 'say', 'use', 'adiquit', 'enjoy', 'said', 'cloth', 'thing', 'applic', 'talk', 'player', 'track', 'recsmend', 'beginn', 'packag', 'allow', 'perfect', 'want', 'real', 'love', 'full', 'show', 'play', 'make', 'backgammonthi', 'mind', 'amazon', 'game', 'difficulti', 'offer', 'descript', 'websit', 'quick', 'season', 'phone', 'variety', 'ware']

相关内容

最新更新

热门标签：