Python分解字符串以将url与文本分隔开

实际上，我使用这样的脚本从字符串中提取URL：

import re
s = 'This is my tweet check it out http://www.example.com/blah and http://blabla.com'
result = re.findall(r'(https?://S+)', s)
print(result)
['http://www.example.com/blah', 'http://blabla.com']

现在我需要发展脚本，我需要为字符串的每个块创建一个字典：我需要从正常文本中识别url，但我也需要维护正常文本，并将原始字符串拆分为这样的字典：

my_dict_result = {
0: {
type: "text",
value: "This is my tweet check it out"
},
1: {
type: "url",
value: "http://www.example.com/blah"
},
2: {
type: "text",
value: "and"
},
3: {
type: "url",
value: "http://blabla.com"
}
}

但我不明白是否存在一个函数来简化我的工作。如果不能创建像我这样的dict，我也可以接受列表这样的结果，然后我可以迭代列表，检查它是url还是文本，然后创建我的dict。

有人知道我可以用什么功能来实现这一点吗？Thnks

为了拆分文本，使其由感兴趣的子字符串和其他部分组成，您可以使用re.split和第一个具有捕获组的参数，您已经有了捕获组，所以您可以这样做：

import re
s = 'This is my tweet check it out http://www.example.com/blah and http://blabla.com'
result = re.split(r'(https?://S+)', s)
print(result)

输出：

['This is my tweet check it out ', 'http://www.example.com/blah', ' and ', 'http://blabla.com', '']

请注意，被模式匹配的内容总是奇数索引号，即使它是字符串的开头：

s = 'http://www.example.com something http://www.blahblahblah.com'
result = re.split(r'(https?://S+)', s)
print(result)

给出：

['', 'http://www.example.com', ' something ', 'http://www.blahblahblah.com', '']

相关内容

最新更新

热门标签：