在tweepy tweet响应python中找到最后一个单词

我收到一连串带有python的推文，并希望提取最后一个单词或知道在哪里引用它。

例如在

NC不喜欢一起工作 www.linktowtweet.org

回来

together

我不熟悉tweepy，所以我假设你有python字符串中的数据，所以也许有一个更好的答案。

但是，给定python中的字符串，提取最后一个单词很简单。

解决方案 1

使用str.rfind(' ').这里的想法是找到最后一个单词之前的空间。下面是一个示例。

text = "NC don’t like working together"
text = text.rstrip() # To any spaces at the end, that would otherwise confuse the algorithm.
last_word = text[text.rfind(' ')+1:] # Output every character *after* the space.
print(last_word)

备注：如果给出的字符串没有单词，则last_word将是空白字符串。

现在，这假定所有单词都用空格分隔。若要处理换行符和空格，请使用str.replace将它们转换为字符串。python中的空格是tnx0bx0cr，但我认为在Twitter消息中只会找到换行符和制表符。

另请参阅：string.whitespace

所以一个完整的示例(包装为函数)将是

def last_word(text):
text = text.replace('n', ' ') # Replace newlines with spaces.
text = text.replace('t', ' ') # Replace tabs with spaces.
text = text.rstrip(' ') # Remove trailing spaces.
return text[text.rfind(' ')+1:]
print(last_word("NC don’t like working together")) # Outputs "together".

这可能仍然是基本解析的最佳情况。对于更大的问题，有更好的东西。

解决方案 2

正则表达式

这些是在python中处理字符串的一种方式，要灵活得多。REGEX，正如他们通常所说的那样，使用自己的语言来指定文本的一部分。

例如，.*s(S+)指定字符串中的最后一个单词。

这里又是一个更长的解释。

.*               # Match as many characters as possible.
s               # Until a whitespace ("tnx0bx0cr ")
(                # Remember the next section for the answer.
S+              # Match a ~word~ (not whitespace) as possible.
)                # End saved section.

那么，在python中，您将按如下方式使用它。

import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match n).
LAST_WORD_PATTERN = re.compile(r".*s(S+)", re.DOTALL) 
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together")) # Outputs "together".

现在，尽管这种方法不太明显，但它有几个优点。首先，它更具可定制性。如果您想匹配最后一个单词，而不是链接，则正则表达式r".*s([^.:s]+(?!.S|://))b"将匹配最后一个单词，但如果这是最后一件事，则忽略链接。

例：

import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match n).
LAST_WORD_PATTERN = re.compile(r".*s([^.:s]+(?!.S|://))b", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together www.linktowtweet.org")) # Outputs "together".

这种方法的第二个优点是速度。

您可以在线尝试！在这里，正则表达式方法几乎与字符串操作一样快，在某些情况下甚至更快。(我实际上发现正则表达式在我的机器上执行 .2 usec 的速度比演示中快。

无论哪种方式，正则表达式的执行都非常快，即使在简单的情况下也是如此，毫无疑问，正则表达式比 python 中实现的任何更复杂的字符串算法都快。因此，使用正则表达式也可以加快代码速度。

编辑更改了避免正则表达式的网址

re.compile(r".*s([^.s]+(?!.S))b", re.DOTALL)

自

re.compile(r".*s([^.:s]+(?!.S|://))b", re.DOTALL)

因此，调用last_word("NC don’t like working together http://www.linktowtweet.org")返回together而不是http://.

要了解此正则表达式的工作原理，请查看 https://regex101.com/r/sdwpqB/2。

很简单，所以如果你的文本是：

text = "NC don’t like working together www.linktowtweet.org"
text = re.sub(r'https?://.*[rn]*', '', text, flags=re.MULTILINE) #to remove any URL
text = text.split() #splits sentence into words with delimiter=" "
last_word = text[-1]

所以你去吧！！现在你会得到最后一个词"在一起"。

相关内容

最新更新

热门标签：