在相当复杂的日志文件中查找URL和用户代理



我有这个Regex:http://regexr.com/39rbe

1413323829.0907|172.168.1.0|  |somedomain.com|OK|0015e248f2484591f52ed37030001|st=bla&cp=huh%2Cs_de%2Cf_bt%2Ce_rc%2Ch_sub%2Cl_ol%2Ca_noapp%2Cp_npaid%2Ci_t-e&sv=i2&pt=CP&rf=www.google.de&r2=https%3A%2F%2Fwww.google.de%2F&ur=mydomain.de&xy=1366x768x24&lo=DE%asdaasdasdcb=0009&vr=306&id=guccjs&lt=1413373830843&ev=&cs=w2dwmo&mo=1&la=1413773766|i00=0615e248f8484591f52ed47030001%3B543e5f46%3B55966cde|Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36|http://mydomain.de/uriPath|023|web|OK|OK

我正在尝试捕获用户代理字符串,其中URL等于http://mydomain.de/uriPath,例如还不起作用:

[^|]+(?=https?://(?:www.)?mydomain.de[^|]+)

怎么样

|[^|]+|(?=https?://(?:www.)?mydomain.de[^|]+)

例如:http://regex101.com/r/tF4jD3/5

如果您不想要启动和跟踪|,请将其添加到作为的环视断言中

(?<=|)[^|]+(?=|https?://(?:www.)?mydomain.de[^|]+)

以形式输出

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36

它做什么

(?<=|)断言以下正则表达式是由| 预先设定的

[^|]+匹配除| 之外的任何内容

(?=|https?://(?:www.)?mydomain.de[^|]+)断言*除|*之外的任何东西后面都跟有|http://mydomain.de/uriPath|

编辑

使用捕获组

|([^|]+)|(?:https?://(?:www.)?mydomain.de[^|]+)

使用下面这样的积极前瞻,

[^|]+(?=|[^|]*(?:https?://)(?:www.)?mydomain.de[^|]+)

演示

使用捕获组,

|([^|]+)|[^|]*(?:https?://)(?:www.)?mydomain.de[^|]+

演示

>>> s = "1413323829.0907|172.168.1.0|  |somedomain.com|OK|0015e248f2484591f52ed37030001|st=bla&cp=huh%2Cs_de%2Cf_bt%2Ce_rc%2Ch_sub%2Cl_ol%2Ca_noapp%2Cp_npaid%2Ci_t-e&sv=i2&pt=CP&rf=www.google.de&r2=https%3A%2F%2Fwww.google.de%2F&ur=mydomain.de&xy=1366x768x24&lo=DE%asdaasdasdcb=0009&vr=306&id=guccjs&lt=1413373830843&ev=&cs=w2dwmo&mo=1&la=1413773766|i00=0615e248f8484591f52ed47030001%3B543e5f46%3B55966cde|Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36|http://mydomain.de/uriPath|023|web|OK|OK"
>>> re.search(r'|([^|]+)|[^|]*(?:https?://)(?:www.)?mydomain.de[^|]+', s).group(1)
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36'

通过拆分,

import re
s = "1413323829.0907|172.168.1.0|  |somedomain.com|OK|0015e248f2484591f52ed37030001|st=bla&cp=huh%2Cs_de%2Cf_bt%2Ce_rc%2Ch_sub%2Cl_ol%2Ca_noapp%2Cp_npaid%2Ci_t-e&sv=i2&pt=CP&rf=www.google.de&r2=https%3A%2F%2Fwww.google.de%2F&ur=mydomain.de&xy=1366x768x24&lo=DE%asdaasdasdcb=0009&vr=306&id=guccjs&lt=1413373830843&ev=&cs=w2dwmo&mo=1&la=1413773766|i00=0615e248f8484591f52ed47030001%3B543e5f46%3B55966cde|Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36|http://mydomain.de/uriPath|023|web|OK|OK"
L = s.split('|')
previous = ''
for i in L:
    if re.match(r'[^|]*(?:https?://)(?:www.)?mydomain.de[^|]+', i):
        print(previous)
    previous = i

输出:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36

最新更新