在熊猫身上快速应用



我有一个来自max mind的数据库。它从IP给我位置信息。我已经写了以下函数来从ip检索城市和国家:-

import geoip2.database
def country(ipa):
with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
try:
response = reader.city(ipa)
response = response.country.iso_code
return response
except:
return 'NA'

def city(ipa):
with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
try:
response = reader.city(ipa)
response = response.city.name
return response
except:
return 'NA'

我每分钟都在处理这个问题,并将其应用于Panda中的raddr列:-

df['country']=df['raddr'].apply(country)
df['city']=df['raddr'].apply(city)

问题是,在每次迭代中执行都需要3分钟以上的时间,我得到了大约150000行,并且我正在对每一行应用函数。

我想在不到一分钟的时间内完成这个手术。任何建议。

您的功能没有得到优化。想象一下,在应用函数时,必须读取每一行的数据库。甚至maxmind的github也特别指出,创建阅读器对象的成本很高:

>>> # This creates a Reader object. You should use the same object
>>> # across multiple requests as creation of it is expensive.

您应该做的是向函数传递一个额外的关键字参数:

def country(ipa, reader):
try:
response = reader.city(ipa)
response = response.country.iso_code
return response
except:
return 'NA'
def city(ipa, reader):
try:
response = reader.city(ipa)
response = response.city.name
return response
except:
return 'NA'

然后用额外的关键字参数调用apply函数:

with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
df['country'] = df['raddr'].apply(country, reader=reader)
df['city'] = df['raddr'].apply(city, reader=reader)

最新更新