如何创建仅包含字符串中每个单词的第一个实例的列表(不包括标点符号,newlines等)



好吧,你们所有天才的程序员和开发人员...我真的可以在此方面使用一些帮助。

我目前正在使用coursera(https://www.coursera.org/specializations/python)提供的"每个人的Python",我陷入了任务。

我不知道如何创建仅包含字符串中每个单词的第一个实例的列表:

示例字符串:

my_string = "How much wood would a woodchuck chuck,
             if a woodchuck would chuck wood?"

所需列表:

words_list = ['How', 'much', 'wood', 'would',
              'a', 'woodchuck', 'chuck', 'if']

谢谢大家的时间,考虑和贡献!

您可以用已经看到的单词构建一个列表,并过滤非字母字符:

my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
new_l = []
final_l = []
for word in my_string.split():
    word = ''.join(i for i in word if i.isalpha())
    if word not in new_l:
       final_l.append(word)
       new_l.append(word)

输出:

['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

这可以通过2个步骤完成,首先删除标点符号,然后将单词添加到一个将删除重复的集合中。

python 3:

from string import punctuation #  This is a string of all ascii punctuation characters
trans = str.maketrans('', '', punctuation)
text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(trans)
words = set(text.split())

pyhton 2:

from string import punctuation #  This is a string of all ascii punctuation characters
text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(None, punctuation)
words = set(text.split())

由于单词的所有实例都是相同的,因此我将提出问题,意味着您想要一个唯一的单词列表。可能最简单的方法是:

import re
non_unique_words = re.findall(r'w+', my_string)
unique_words = list(set(non_unique_words))

're.findall'命令将返回任何单词,然后转换为集合并返回列表将使结果变得独一无二。

尝试:

my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
def replace(word, block):
    for i in block:
        word = word.replace(i, '')
    return word
my_string = replace(my_string, ',?')
result = list(set(my_string.split()))

您可以将re模块和铸件结果使用到set,以删除重复项:

>>> import re
>>> my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
>>> words_list = re.findall(r'w+', my_string)  # Find all words in your string (without punctuation)
>>> words_list_unique = sorted(set(words_list), key=words_list.index)  # Cast your result to a set in order to remove duplicates. Then cast again to a list.
>>> print(words_list_unique)
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

说明:

  • w表示 tarne w+表示 Word
  • 因此,您使用re.findall(r'w+', my_string)my_string 中查找所有单词。
  • set是一个具有唯一元素的集合,因此您将结果列表从re.findall()投入到集合中。
  • 然后您将listsorted)重新铸造,以获取带有唯一单词的列表。
  • edit - 如果要保留单词的顺序,则可以使用key=words_list.index使用CC_12,以便保持订购,因为set S是无序的集合。

如果您需要保留单词出现的顺序:

import string
from collections import OrderedDict
def unique_words(text):
    without_punctuation = text.translate({ord(c): None for c in string.punctuation})
    words_dict = OrderedDict((k, None) for k in without_punctuation.split())
    return list(words_dict.keys())
unique_words("How much wood would a woodchuck chuck, if a woodchuck would chuck wood?")
# ['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

我使用OrderedDict,因为在Python标准库中似乎没有订购集。

编辑:

使单词列表案例不敏感的一个可以使字典键较低键: (k.lower(), None) for k in ...

应该足以找到所有单词,然后滤除重复项。

words = re.findall('[a-zA-Z]+', my_string)
words_list = [w for idx, w in enumerate(words) if w not in words[:idx]]

最新更新