Python 使用正则表达式提取 Twitter 文本数据中的@user和 url 链接

有一个列表字符串推特文本数据，比如下面的数据（其实有大量的文本，不仅仅是这些数据），我想提取推特文本中@和url链接后面的所有用户名，例如：galaxy5univ和url链接。

   tweet_text = ['@galaxy5univ I like you',
    'RT @BestOfGalaxies: Let's sit under the stars ...',
    '@jonghyun__bot .........((thanks)',
    'RT @yosizo: thanks.ddddd <https://yahoo.com>',
    'RT @LDH_3_yui: #fam, ccccc https://msn.news.com']

我的代码：

import re
pu = re.compile(r'httpS+')
pn = re.compile(r'@(S+)')
for row in twitter_text:
   text = pu.findall(row)
   name = (pn.findall(row))
   print("url: ", text)
   print("name: ", name)

通过测试大量推特数据中的代码，我发现我的url和name的两种模式都是错误的（尽管在一些推特文本数据中是正确的）。在大型推特数据的情况下，你们是否有一些关于从推特文本中提取名称和网址的文档或链接。

如果您对从推特数据中提取名称和网址有建议，请告诉我，谢谢！

请注意，您的pn = re.compile(r'@(S+)')正则表达式将在 @ 之后捕获任何 1+ 非空格字符。

要排除匹配:，您需要将速记S类转换为否定字符类等效[^s]并向其添加:：

pn = re.compile(r'@([^s:]+)')

现在，它将停止在第一个:之前捕获非空格符号。请参阅正则表达式演示。

如果需要捕获到最后一个:，只需在捕获组后添加:：pn = re.compile(r'@(S+):') 。

至于与正则表达式匹配的 URL，网络上有很多，只需选择最适合您的一个即可。

下面是一个示例代码：

import re
p = re.compile(r'@([^s:]+)')
test_str = "@galaxy5univ I like younRT @BestOfGalaxies: Let's sit under the stars ...n@jonghyun__bot .........((thanks)nRT @yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>nRT @LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"
print(p.findall(test_str)) 
p2 = re.compile(r'(?:http|ftp|https)://(?:[w_-]+(?:(?:.[w_-]+)+))(?:[w.,@?^=%&:/~+#-]*[w@?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']
# => ['https://yahoo.com', 'https://msn.news.com']

如果用户名不包含特殊字符，则可以使用：

@([w]+)

观看现场演示

相关内容

最新更新

热门标签：