根据长度排除不需要的变量



我正在对手机游戏及其当前排名进行数据挖掘。有些游戏没有排名,因此页面内容是空的。但是它们有一天可能会排名,因为我每天运行一次脚本,所以我不想完全排除它们,只是跳过它们。

该错误似乎发生在第 8 个 URL 上,其内容实际上只是:

[]

我在代码之后也在此处添加了错误。

据我所知,发生错误是因为数据帧中没有要拆分的内容。我该如何从这里开始?

到目前为止,我以不同的变体玩了这个 for 循环:

for s in df:
if s == []:
continue
else: 
pass

我只是不确定这是否是正确的方法。

我的目标:我希望跳过内容为"[ ]"的每个URL。

我的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
#this function gets me all the content of a certain URL 
def get_games(url):
url_get = requests.get(url, headers=headers, verify=False)
soup = BeautifulSoup(url_get.content, 'lxml')
pros = {}
for idx, link in enumerate(soup.find()):
pros["{}".format(idx)] = link.get_text()
pros_list = list(pros.items())
p = "".join(str(x) for x in pros_list)
pp = re.findall('{(.*?)}', p)      #splits the list
data = {url: pp}
return data
#this function cleans the data variable 
def cleaner(to_get_cleaned):
df = pd.DataFrame(get_games(url))
date = pd.datetime.now().strftime("%d/%m/%Y")
df[date],df["category"],df["chart_type"],df["country"],df["previous_rank"] = df[url].str.split("," ,0).str #error seems to happen here
df.drop([url],axis=1,inplace=True)       #removes first col, which includes all data in csv format
df = df.replace(to_replace=r"^.*?:", value = "", regex=True)    #removes everything before ":"
df = df.replace(to_replace=r""", value = "", regex=True)       # removes all " 
df = df.set_index('country').reset_index()      #moves country to first col
western = df.loc[df['country'].isin(['US', 'FR', "JP", "DE", "GB"])]    
western = western.loc[western["category"].isin(["game"])]
western = western.loc[western["chart_type"].isin(["topgrossing"])]
western = western.drop(["category", "chart_type", "previous_rank"], axis=1)
western = western.T    #transposes dataframe   
return western.to_string(header=None)
if __name__ == "__main__":
url = {
"Empire: Four Kingdoms":    "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=air.com.goodgamestudios.empirefourkingdoms&date=2019-08-27T00%3A00%3A00.000Z",
"Big Farm Mobile Harvest":  "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.goodgamestudios.bigfarmmobileharvest&date=2019-08-27T00%3A00%3A00.000Z",
"Age of Lords":             "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.eRepublikLabs.AgeOfLords&date=2019-08-27T00%3A00%3A00.000Z",
"Battle Pirates HQ":        "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.kixeye.BPCompanion&date=2019-08-27T00%3A00%3A00.000Z",
"Call of War":              "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.bytro.callofwar1942&date=2019-08-27T00%3A00%3A00.000Z",
"Empire: Age of Knights":   "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.goodgamestudios.ageofknights&date=2019-08-27T00%3A00%3A00.000Z",
"Empire: Millennium Wars":  "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.goodgamestudios.millennium&date=2019-08-27T00%3A00%3A00.000Z",
"eRepublik":                "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.erepubliklabs.erpkmobile&date=2019-08-27T00%3A00%3A00.000Z",
"Game of Emperors":         "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.io.gameofemperors&date=2019-08-27T00%3A00%3A00.000Z",
"Game of Trenches":         "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.erepubliklabs.ww1&date=2019-08-27T00%3A00%3A00.000Z",
"Imperia Online":           "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=org.imperiaonline.android.v6&date=2019-08-27T00%3A00%3A00.000Z",
"Imperial Hero":            "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=org.imperialhero.android&date=2019-08-27T00%3A00%3A00.000Z",
"Mars Tomorrow":            "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=de.gamefab.mars&date=2019-08-27T00%3A00%3A00.000Z",
"One Epic Knight":          "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.simutronics.oneepicknight&date=2019-08-27T00%3A00%3A00.000Z",
"Seasons of War":           "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=org.imperiaonline.android.seasons&date=2019-08-27T00%3A00%3A00.000Z",
"SIEGE: TITAN WARS":        "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.gamealliance.siege&date=2019-08-27T00%3A00%3A00.000Z",
"SIEGE: World War II":      "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.simutronics.b17&date=2019-08-27T00%3A00%3A00.000Z",
"Skytopia - City Tycoon":   "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.goodgamestudios.skytopia&date=2019-08-27T00%3A00%3A00.000Z",
"Supremacy 1914":           "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.bytro.supremacy1914&date=2019-08-27T00%3A00%3A00.000Z",
"Tactical Heroes 2: Platoons": "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.erepubliklabs.vietnamwar&date=2019-08-27T00%3A00%3A00.000Z",
"Twin Shooter - Invaders":  "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.erepubliklabs.twinshooter&date=2019-08-27T00%3A00%3A00.000Z",
"VEGA Conflict":            "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.kixeye.vegaconflict&date=2019-08-27T00%3A00%3A00.000Z",
"War and Peace":            "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.erepubliklabs.warandpeace&date=2019-08-27T00%3A00%3A00.000Z",
"War Commander: Rogue Assault": "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.kixeye.wcm&date=2019-08-27T00%3A00%3A00.000Z",
"World at War: WW2 Strategy MMO":   "https://sensortower.com/api/android/rankings/for_app_and_date?app_id=com.erepubliklabs.worldatwar&date=2019-08-27T00%3A00%3A00.000Z"
}
for category, url in url.items():
total_items = cleaner(url)
print("{}".format(category, url) + ":n{}".format(total_items) + "n")
time.sleep(1)
#total_items.to_excel(excel_writer="ranking.xlsx", index=False)

以下是错误和回溯:

Traceback (most recent call last):
File "<ipython-input-1-f82b46453409>", line 1, in <module>
runfile('/Users/M/Desktop/games_scraper.py', wdir='/Users/M/Desktop')
File "/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/M/Desktop/games_scraper.py", line 69, in <module>
total_items = cleaner(url)
File "/Users/M/Desktop/games_scraper.py", line 26, in cleaner
df[date],df["category"],df["chart_type"],df["country"],df["previous_rank"] = df[url].str.split("," ,0).str #error seems to happen here
File "/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 5063, in __getattr__
return object.__getattribute__(self, name)
File "/anaconda3/lib/python3.6/site-packages/pandas/core/accessor.py", line 171, in __get__
accessor_obj = self._accessor(obj)
File "/anaconda3/lib/python3.6/site-packages/pandas/core/strings.py", line 1796, in __init__
self._validate(data)
File "/anaconda3/lib/python3.6/site-packages/pandas/core/strings.py", line 1818, in _validate
raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

为了将来使用和其他有相同问题的人,这是我如何修复它:

if df.empty:
pass
else:
print("I'm running")

解释: "if df.empty"指的是一个熊猫函数,用于检查数据帧是否为空。如果是,我说"通过" - 继续脚本,如果它不为空,我说打印"我正在运行"只是为了检查会发生什么。

令我自己惊讶的是,它奏效:)

相关内容

最新更新