我有一个来自max mind的数据库。它从IP给我位置信息。我已经写了以下函数来从ip检索城市和国家:-
import geoip2.database
def country(ipa):
with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
try:
response = reader.city(ipa)
response = response.country.iso_code
return response
except:
return 'NA'
def city(ipa):
with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
try:
response = reader.city(ipa)
response = response.city.name
return response
except:
return 'NA'
我每分钟都在处理这个问题,并将其应用于Panda中的raddr
列:-
df['country']=df['raddr'].apply(country)
df['city']=df['raddr'].apply(city)
问题是,在每次迭代中执行都需要3分钟以上的时间,我得到了大约150000行,并且我正在对每一行应用函数。
我想在不到一分钟的时间内完成这个手术。任何建议。
您的功能没有得到优化。想象一下,在应用函数时,必须读取每一行的数据库。甚至maxmind的github也特别指出,创建阅读器对象的成本很高:
>>> # This creates a Reader object. You should use the same object
>>> # across multiple requests as creation of it is expensive.
您应该做的是向函数传递一个额外的关键字参数:
def country(ipa, reader):
try:
response = reader.city(ipa)
response = response.country.iso_code
return response
except:
return 'NA'
def city(ipa, reader):
try:
response = reader.city(ipa)
response = response.city.name
return response
except:
return 'NA'
然后用额外的关键字参数调用apply函数:
with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
df['country'] = df['raddr'].apply(country, reader=reader)
df['city'] = df['raddr'].apply(city, reader=reader)