如何根据字典键在语料库中的频率过滤字典键



所以,我正在做一个作业,我被困在这一部分。我有一个字典,其中有一个字符串元组作为键和一个相应的值。现在,我必须通过使用paras方法删除布朗语料库中出现次数少于8次的键来过滤字典

我到处寻找它,找不到任何关于如何做到这一点的伪代码。

[{('love', 'sex'): '6.77',
  ('tiger', 'cat'): '7.35',
  ('tiger', 'tiger'): '10.00',
  ('book', 'paper'): '7.46',
  ('computer', 'keyboard'): '7.62',
  ('computer', 'internet'): '7.58',
  ('plane', 'car'): '5.77',
  ('train', 'car'): '6.31',
  ('telephone', 'communication'): '7.50',
  ('television', 'radio'): '6.77',
  ('media', 'radio'): '7.42',
  ('drug', 'abuse'): '6.85',
  .
  . 
  .

所以我对这本字典的处理是,我应该删除键,其标记(单词对(不按字母顺序排列,以及至少一个单词在棕色语料库中的文档频率小于 8 的单词对(键(

我不知道

在这种情况下document是什么,所以这个答案可能有缺陷。

输入:

mylist = [{('love', 'sex'): '6.77',
  ('tiger', 'cat'): '7.35',
  ('tiger', 'tiger'): '10.00',
  ('book', 'paper'): '7.46',
  ('computer', 'keyboard'): '7.62',
  ('computer', 'internet'): '7.58',
  ('computer', 'car'): '7.58',
  ('computer', 'plane'): '7.58',
  ('computer', 'train'): '7.58',
  ('computer', 'television'): '7.58',
  ('computer', 'radio'): '7.58',
  ('computer', 'tiger'): '7.58',
  ('computer', 'test1'): '7.58',
  ('computer', 'test2'): '7.58',
  ('tiger', 'tz1'): '7.58',
  ('tiger', 'tz2'): '7.58',
  ('tiger', 'tz3'): '7.58',
  ('tiger', 'tz4'): '7.58',
  ('tiger', 'tz5'): '7.58',
  ('tiger', 'tz6'): '7.58',
  ('tiger', 'tz7'): '7.58',
  ('tiger', 'tz8'): '7.58',
  ('plane', 'car'): '5.77',
  ('train', 'car'): '6.31',
  ('telephone', 'communication'): '7.50',
  ('television', 'radio'): '6.77',
  ('media', 'radio'): '7.42',
  ('drug', 'abuse'): '6.85'}]

溶液:请注意,解决方案必须遍历字典两次(尽管第二个循环通常只会遍历字典的一部分(。我还将列表中的每个字典作为自己的内容进行处理,因此您可能需要移动一些语句。

# This will be the keys we want to remove
removable_keys = set()
# This will be the number of times we see a key part (left or right)
occurences = dict()
# For each dictionary in our list
for dic in mylist:
    # For each key in that dictionary
    for key in dic:
        # If the key is not in alphabetical order
        if list(key) != sorted(list(key)):
            # We will remove that key
            removable_keys.add(key)
        # Else this is a valid key
        else:
            # Increment the number of times we have seen this key
            left, right = key
            occurences[left] = 1 if left not in occurences else occurences[left] + 1
            occurences[right] = 1 if right not in occurences else occurences[right] + 1
    # No we need to look for keys that had less than 8 occurences.
    for key in dic.keys() - removable_keys:
        left, right = key
        if occurences[left] < 8 or occurences[right] < 8:
            removable_keys.add(key)
    # Finally remove all those keys from our dict
    for key in removable_keys:
        del dic[key]
    print(dic)

输出:

{('tiger', 'tiger'): '10.00', ('computer', 'tiger'): '7.58'}

最新更新