为什么这个正则表达式是贪婪的,为什么示例代码永远重复



我想弄清楚这件事。现在已经三天了,我准备放弃了。以下代码应返回剪贴板上所有电话号码和电子邮件的列表,不得重复。

#! python 3
#! Phone number and email address scraper
#take user input for:
#1. webpage to scrape
# - user will be prompted to copy a link
#2. file & location to save to
#3. back to 1 or exit
import pyperclip, re, os.path
#function for locating phone numbers
def phoneNums(clipboard):
phoneNums = re.compile(r'^(?:d{8}(?:d{2}(?:d{2})?)?|(+?d{2,3})s?(?:d{4}[s*.-]?d{4}|d{3}[s*.-]?d{3}|d{2}([s*.-]?)d{2}1d{2}(?:1d{2})?))$')
#(+d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
#(s)?                          #Optional space
#((d))?                      #Optional bracketed area code
#(dd(s)?d | d{3})          #3 digits with optional space between
#(s)?                          #Optional space
#(d{3})                        #3 digits
#(s)?                          #Optional space
#(d{4})                        #Last four
#)
#)', re.VERBOSE)
#nos = phoneNums.search(clipboard)  #ignore for now. Failed test of .group()
return phoneNums.findall(clipboard)
#function for locating email addresses
def emails(clipboard):
emails = re.compile(r'''(
[a-z0-9._%+-]*     #username
@                  #@ sign
[a-z0-9.-]+        #domain name
)''', re.I | re.VERBOSE)
return emails.findall(clipboard)

#function for copying email addresses and numbers from webpage to a file
def scrape(fileName, saveLoc):
newFile = os.path.join(saveLoc, fileName + ".txt")
#file = open(newFile, "w+")
#add phoneNums(currentText) +
print(currentText)
print(emails(currentText))
print(phoneNums(currentText))
#file.write(emails(currentText))
#file.close()
url = ''
currentText = ''
file = ''
location =  ''
while True:
print("Please paste text to scrape. Press ENTER to exit.")
currentText = str(pyperclip.waitForNewPaste())
#print("Filename?")
#file = str(input())
#print("Where shall I save this? Defaults to C:")
#location = str(input())
scrape(file, location)

电子邮件返回正确,但散列部分的电话号码输出如下:

[('+30 210 458 6600','+30','','','210',''、'','458','','6600'(,('+30 210 458 6601','+30','','','210',''','458','','6601'(]

正如你所看到的,数字被正确识别,但我的代码很贪婪,所以我尝试添加"+?":

def phoneNums(clipboard):
phoneNums = re.compile(r'''(
(+d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
(s)?                          #Optional space
((d))?                      #Optional bracketed area code
(dd(s)?d | d{3})          #3 digits with optional space between
(s)?                          #Optional space
(d{3})                        #3 digits
(s)?                          #Optional space
(d{4})                        #Last four
)+?''', re.VERBOSE)

没有快乐。我试着从这里插入一个正则表达式示例:在python脚本中查找电话号码

现在我知道这是有效的,因为其他人已经测试过了。我得到的是:

Please paste text to scrape. Press ENTER to exit. 
[] [] 
Please paste text to scrape. Press ENTER to exit. 
[] [('', '', '', '', '', '', '','', '', '')] 
...forever...

最后一个甚至不允许我复制到剪贴板。waitForNewPaste((应该按照它在锡上说的做,但当我运行代码的时候,程序会提取剪贴板上的内容,并试图处理它(糟糕(。

很明显,我的代码中有一个问题,但我看不出来。有什么想法吗?

正如您所指出的,regex是有效的。

输入部分"+30 210 458 6600"匹配一次,结果是所有捕获的子组的元组:

请注意,元组中的第一个元素是整个匹配。

如果通过在左括号后插入?:使所有组不捕获,则将不存在捕获组,结果将仅为完全匹配的"+30 210 458 6600"作为str

phoneNums = re.compile(r'''
(?:+d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
(?:s)?                          #Optional space
(?:(d))?                      #Optional bracketed area code
(?:dd(?:s)?d | d{3})        #3 digits with optional space between
(?:s)?                          #Optional space
(?:d{3})                        #3 digits
(?:s)?                          #Optional space
(?:d{4})                        #Last four
''', re.VERBOSE)

代码"永远重复",因为CCD_ 3块是一个无限循环。如果你想在一次迭代后停止,你可以在块的末尾放一个break语句来停止循环。

while True:
currentText = str(pyperclip.waitForNewPaste())
scrape(file, location)
break

最新更新