优化 Python 2.7 中网页标题转换的 url



我在名为twfile.txt的文档中有一个tweet列表。示例文本文件可能如下所示:

RT @CriticalReading: How #Islamophobia works. #Germanwings http://t.co/rX6XVxARiD
Family of Australian victims visit the #Germanwings #GermanWingsCrash crash site in #FrenchAlps #A320Crash #A320 http://t.co/ztReJ1tifU
RT @morningshowon7: #Germanwings: Australian relatives have visited the memorial site in the French alps. #TMS7 http://t.co/BmfiLxHPkC
Three generations from the same family were killed in the #Germanwings Alps crash: http://t.co/6F5MgvBSZG http://t.co/HzJZCZKVZe
Alps crash pilot's hidden illness sparks medical privacy debate #Germanwings. http://t.co/Efe89rxwJG
#Germanwings crash: church in #AndreasLubitz's home town stands by his family http://t.co/QkePs5sG4W http://t.co/irdDnHhxF7
Breaking: #Germanwings co-pilot had been treated 4 suicidal tendencies: http://t.co/6qEynKMSEI/s/KJKu http://t.co/TVdqP4EeWu/s/b4vR @Reuters
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
Audio last 60 seconds from flight deck http://t.co/T4IYK26NrG     #Germanwings #GermanWingsCrash #GermanyWings #4U9525 #AndreasLubitz
#Germanwings: Australian relatives have visited the memorial site in the French alps. #TMS7 http://t.co/BmfiLxHPkC
RT @surfinwav: American intelligence contractor among those killed in Alps plane crash http://t.co/m4L0EOd9L2 #Germanwings #GermanWingsCrash
Excellent help & resources from our friends @MindframeMedia over responsible reporting re #Germanwings http://t.co/EQG0kxyQgd  #NoStigma
.@Boba71 @Reuters So in Germany any sick psycho can fly a commercial plane hiding behind the so called privacy laws? #germanwings
The World Will Never Forget  https://t.co/Th41xouUiS  #4U9525 #GermanWings #A320Crash #indeepsorrow #AndreasLubitz
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
I am uncomfortable using word 'depression' for the #Germanwings pilot, depression does not kill other people.
Google Maps has blurred out the home of #Germanwings crash pilot Andreas Lubitz. http://t.co/VTm5sfmT6e
#Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/YpDB8trKFL http://t.co/uML8h6vwD8
#Lufthansa #Germanwings prepare for negligence charges since copilot was known to be suicidal 7 years ago
ICYMI: @swaindiana's interview w. lawyer who represents 4 families, who lost loved ones in #Germanwings crash. http://t.co/dnUXKkCD46 #CBCNN
An airplane crashes, after a couple of HOURS we get who's guilty, with the perfect solution for everybody. I don't buy it. #Germanwings
#Germanwings Crash Settlements Are Likely to Vary by Passenger Nationality - #aviationlaw #montrealconvention http://t.co/MWM8nSEYwG
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
German prosecutors confirm #Germanwings pilot "had continued to see psychiatrists and neurologists until recently" http://t.co/ma1v9zeiIV
RT @Reuters: #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/Qb75hM3shv http://t.co/7twzPvaAQV
RT @MindframeMedia: MEDIA: tips when including #mentalillness in stories to avoid perpetuating #stigma http://t.co/W7RlJVe9Lq #Germanwings
#Germanwings plane crash in French Alps: First clues - CNN : http://t.co/AbMPbXFfjG
RT @MindframeMedia: MEDIA: Get to know the facts about  #mentalillness & avoiding  stigmatising stories http://t.co/ZDd7AFOAir #Germanwings
RT @michaelhallida4: Am I Mad Enough To Crash A Plane Into A Mountain? https://t.co/M9d5nlf4bM #auspol #Germanwings
It's a sick world! How can this happen? RT @Reuters #Germanwings co-pilot had been treated for suicidal tendencies: http://t.co/ryw6nTmTNF
RT @Reuters: #Germanwings co-pilot Andreas Lubitz had been treated for suicidal tendencies: http://t.co/p7wqBNvoEW http://t.co/KKAGnvXFDd
I suffer #depression too but I would never risk other people's life. #Germanwings

以下代码用于从文件中读取。然后它扩展url并用旧url替换新url。它还检查url是否指向图像。如果没有,它将用网页标题替换url。否则它会保持原样。代码工作得很好,除了一个问题,它需要太多的时间在这个过程中,这是不适合数千条推文的文档。怎样才能使它工作得更快呢?

import codecs
from bs4 import BeautifulSoup
import urllib
output = codecs.open('tw1file.txt','w','utf-8')
with open('twfile.txt','r') as inputf:
    for line in inputf:
        try:
            list1 = line.split(' ')
            for i in range(len(list1)):
                a = list1[i]
                if "http" in list1[i]:
                    ##print list1[i]
                    response = urllib.urlopen(list1[i])
                    a = response.url
                    ##print a
                    if 'photo' in a:
                        ##print a                       
                        list1[i] = a + ' '
                        ##print list1[i]
                    else:
                        html = response.read()
                        soup = BeautifulSoup(html)
                        list1[i] = soup.html.head.title
                        t = str(list1[i])
                        list1[i] = t[8:-9] = ' '

                    list1[i] = ''.join(ch for ch in list1[i])
                else:
                    list1[i] = ''.join(ch for ch in list1[i])
            line = ' '.join(list1)
            print line
            output.write(line)
        except:
            pass

inputf.close()
output.close()

可能通过购买更多的带宽…

请看这里:精确测量时间的python函数

,然后确定你花你的时间,我敢打赌,你使用了大量的脚本时间,下载网站…

如果您在网络上有很多空闲时间(由于站点比您的带宽慢),您可以尝试将这些行放在处理队列中,并让工作线程池执行实际工作。

看这里:线程池类似于多处理池?(例如使用worker的代码,请参阅dgorissen的回答)

相关内容

  • 没有找到相关文章

最新更新