如何按类别抓取文本并制作 json 文件



我们 www.theft-alerts.com 抓取网站。现在我们得到了所有的文本。

connection = urllib2.urlopen('http://www.theft-alerts.com')
soup = BeautifulSoup(connection.read().replace("<br>","n"), "html.parser")
theftalerts = []
for sp in soup.select("table div.itemspacingmodified"):
    for wd in sp.select("div.itemindentmodified"):
        text = wd.text
        if not text.startswith("Images :"):
            print(text)
with open("theft-alerts.json", 'w') as outFile:
    json.dump(theftalerts, outFile, indent=2)

输出:

STOLEN : A LARGE TAYLORS OF LOUGHBOROUGH BELL
Stolen from Bromyard on 7 August 2014
Item : The bell has a diameter of 37 1/2" is approx 3' tall weighs just shy of half a ton and was made by Taylor's of Loughborough in 1902. It is stamped with the numbers 232 and 11.
The bell had come from Co-operative Wholesale Society's Crumpsall Biscuit Works in Manchester.
Any info to : PC 2361. Tel 0300 333 3000
Messages : Send a message
Crime Ref : 22EJ / 50213D-14
No of items stolen : 1
Location : UK > Hereford & Worcs
Category : Shop, Pub, Church, Telephone Boxes & Bygones
ID : 84377
User : 1 ; Antique/Reclamation/Salvage Trade ;  (Administrator)
Date Created : 11 Aug 2014 15:27:57
Date Modified : 11 Aug 2014 15:37:21;

我们如何对 JSON 文件的文本进行分类。JSON 文件现在为空。

输出 JSON:

[]

您可以定义列表并将创建的所有字典对象追加到列表中。 例如:

import json
theftalerts = [];
atheftobject = {};
atheftobject['location'] = 'UK > Hereford & Worcs';
atheftobject['category'] = 'Shop, Pub, Church, Telephone Boxes & Bygones';
theftalerts.append(atheftobject);
atheftobject['location'] = 'UK';
atheftobject['category'] = 'Shop';
theftalerts.append(atheftobject);
with open("theft-alerts.json", 'w') as outFile:
        print(json.dump(theftalerts, outFile, indent=2))

在此运行之后,theft-alerts.json将包含以下 json 对象:

[
  {
    "category": "Shop",
    "location": "UK"
  },
  {
    "category": "Shop",
    "location": "UK"
  }
]

您可以使用它来生成自己的 JSON 对象。查看 json 模块

JSON 输出保持为空,因为循环不会追加到列表中。

以下是我提取类别名称的方法:

theftalerts = []
for sp in soup.select("table div.itemspacingmodified"):
    item_text = "n".join(
        [wd.text for wd in sp.select("div.itemindentmodified")
         if not wd.text.startswith("Images :")])
    category = sp.find(
        'span', {'class': 'itemsmall'}).text.split('n')[1][11:]
    theftalerts.append({'text': item_text, 'category': category})

最新更新