Python3-处理连字词:结合和分裂

我想处理连字词。例如，我想通过两种不同的方式处理"知名"一词。

首先，结合这个词，即("众所周知"(，第二种方法是拆分单词，即("好"，"已知"(。

输入将是："众所周知"，预期输出为：

--wellknown
--well
--known

但是我只能单独解析每个单词，但不能同时解析。当我循环浏览文本文件时，如果我正在寻找连字符的单词，我首先将它们结合在一起。

然后，在将它们组合在一起之后，我不知道如何再次回到原始单词并进行拆分操作。以下是我的代码中的简短作品。(请让我知道您是否需要查看更多详细信息(

for text in contents.split():   
   if not re.search(r'd', text):               #remove numbers
      if text not in string.punctuation:        #remove punctuation
        if '-' in term:
           combine = text.replace("-", '')      #??problem parts (wellknown)
           separate = re.split(r'[^a-z]', text) #??problem parts (well, known)

我知道我不能同时进行操作的原因，因为在我更换连字符的单词后，这个词消失了。然后，我找不到连字符的单词来进行拆分(代码中的"单独"操作(。有人知道该怎么做吗？或如何修复逻辑？

为什么不只是使用包含分离单词和组合单词的元组。

首先拆分然后组合：

示例代码

separate = text.split('-')
combined = ''.join(separate)
words = (combined, separate[0], separate[1])

输出

('wellknown', 'well', 'known')

将令牌视为对象而不是字符串，然后您可以创建具有多个属性的令牌。

例如，我们可以将collections.namedtuple容器用作简单对象来容纳令牌：

from collections import namedtuple
from nltk import word_tokenize
Token = namedtuple('Token', ['surface', 'splitup', 'combined'])
text = "This is a well-known example of a small-business grant of $123,456."
tokenized_text = []
for token in word_tokenize(text):
    if '-' in token:
        this_token = Token(token, tuple(token.split('-')),  token.replace('-', ''))
    else:
        this_token = Token(token, token, token)
    tokenized_text.append(this_token)

然后，您可以通过tokenized_text迭代Token名为Tuple的列表，例如如果我们只需要表面字符串列表：

for token in tokenized_text:
    print(token.surface)
    tokenized_text

[out]：

This
is
a
well-known
example
of
a
small-business
grant
of
$
123,456
.

如果您需要访问组合的令牌：

for token in tokenized_text:
    print(token.combined)

[out]：

This
is
a
wellknown
example
of
a
smallbusiness
grant
of
$
123,456
.

如果要访问拆分令牌，请使用相同的循环，但是您会看到您得到元组而不是字符串，例如

for token in tokenized_text:
    print(token.splitup)

[out]：

This
is
a
('well', 'known')
example
of
a
('small', 'business')
grant
of
$
123,456
.

您也可以使用列表理解来访问Token名为Tuples的属性，例如

>>> [token.splitup for token in tokenized_text]
['This', 'is', 'a', ('well', 'known'), 'example', 'of', 'a', ('small', 'business'), 'grant', 'of', '$', '123,456', '.']

要识别具有连字符并已分开的令牌，您可以轻松检查其类型，例如

>>> [type(token.splitup) for token in tokenized_text]
[str, str, str, tuple, str, str, str, tuple, str, str, str, str, str]

相关内容

最新更新

热门标签：