按多个单词的标签拆分的术语

我正在尝试拆分一个包含多个单词的标签的术语，如"#I-am-great"或"#woulding dayofmylife"
那么我要寻找的输出是：

 I am great
 awesome day of my life

我所能达到的就是：

 >>> import re
 >>> name = "big #awesome-dayofmylife because #iamgreat"
 >>> name =  re.sub(r'#([^s]+)', r'1', name)
 >>> print name
 big awesome-dayofmylife because iamgreat

如果有人问我是否有一个可能的单词列表，答案是"没有"，所以如果我能得到这方面的指导，那就太好了。有NLP专家吗？

上面所有的注释当然都是正确的：单词之间没有空格或其他清晰分隔符的标签（尤其是在英语中）通常是模棱两可的，并且不能在所有情况下正确解析。

然而，单词列表的想法实现起来相当简单，可能会产生有用的（尽管有时是错误的）结果，所以我实现了它的快速版本：

wordList = '''awesome day of my life because i am great something some
thing things unclear sun clear'''.split()
wordOr = '|'.join(wordList)
def splitHashTag(hashTag):
  for wordSequence in re.findall('(?:' + wordOr + ')+', hashTag):
    print ':', wordSequence   
    for word in re.findall(wordOr, wordSequence):
      print word,
    print
for hashTag in '''awesome-dayofmylife iamgreat something
somethingsunclear'''.split():
  print '###', hashTag
  splitHashTag(hashTag)

此打印：

### awesome-dayofmylife
: awesome
awesome
: dayofmylife
day of my life
### iamgreat
: iamgreat
i am great
### something
: something
something
### somethingsunclear
: somethingsunclear
something sun clear

正如你所看到的，它落入了qstebom为它设置的陷阱；-）

编辑：

上面代码的一些解释：

变量wordOr包含由管道符号（|）分隔的所有单词的字符串。在正则表达式中，意思是"这些单词中的一个"。

第一个findall得到一个模式，意思是"这些单词中的一个或多个单词的序列"，所以它与"生日"之类的词匹配。findall找到所有这些序列，所以我对它们进行迭代（for wordSequence in …）。对于每个单词序列，我搜索序列中的每个单词（也使用findall）并打印该单词。

问题可以分解为几个步骤：

用英语单词填充列表
把句子分成用空格分隔的词
将以"#"开头的术语视为标签
对于每个标签，通过检查单词列表中是否存在最长匹配来查找单词

这里有一个使用这种方法的解决方案：

# Returns a list of common english terms (words)
def initialize_words():
    content = None
    with open('C:wordlist.txt') as f: # A file containing common english words
        content = f.readlines()
    return [word.rstrip('n') for word in content]

def parse_sentence(sentence, wordlist):
    new_sentence = "" # output    
    terms = sentence.split(' ')    
    for term in terms:
        if term[0] == '#': # this is a hashtag, parse it
            new_sentence += parse_tag(term, wordlist)
        else: # Just append the word
            new_sentence += term
        new_sentence += " "
    return new_sentence 

def parse_tag(term, wordlist):
    words = []
    # Remove hashtag, split by dash
    tags = term[1:].split('-')
    for tag in tags:
        word = find_word(tag, wordlist)    
        while word != None and len(tag) > 0:
            words.append(word)            
            if len(tag) == len(word): # Special case for when eating rest of word
                break
            tag = tag[len(word):]
            word = find_word(tag, wordlist)
    return " ".join(words)

def find_word(token, wordlist):
    i = len(token) + 1
    while i > 1:
        i -= 1
        if token[:i] in wordlist:
            return token[:i]
    return None 

wordlist = initialize_words()
sentence = "big #awesome-dayofmylife because #iamgreat"
parse_sentence(sentence, wordlist)

它打印：

'big awe some day of my life because i am great '

您将不得不删除尾部空格，但这很容易。：）

我从http://www-personal.umich.edu/~jlawler/wordlist。

相关内容

最新更新

热门标签：