按标题类型拆分 HTML div,如何提取我感兴趣的一个?



给定这样的页面,两个作业(我们现在忽略"打开应用程序")一个接一个地完整描述,我可以通过应用以下XPath来检测是否有与关键字匹配的作业:

//*[self::h2 or self::h3 or self::h4][contains(., 'Country Manager')]

通过蟒蛇:

import urllib2
import lxml.html as lh    
url = 'http://jobs.kelkoo.co.uk/'
response = urllib2.urlopen(url)
content = response.read()
root = lh.fromstring(content)
job_titles = root.xpath("//*[self::h2 or self::h3 or self::h4][contains(., 'Country Manager')]")

然后,我可以确定涉及哪种类型,因此:

tags = [e.tag for e in titles]

知道我们正在处理一个<h2>,我希望提取个人工作规范。我知道我可以使用以下内容描述每个<h2>

//div[count(preceding-sibling::h2)=1]

但是,我如何将相关职位的发现位置、标签类型以及上述描述联系起来呢?

我试图将关键字放回上面的描述 XPath 中,但我被告知这不是一个有效的表达式:

//div[count(preceding-sibling::h2[contains(text(), 'Country Manager')]=1]

查找以下divclass="jobspecs"的同级:

for title in job_titles:
    print(title.text_content())
    for spec in title.xpath("following-sibling::div[@class='jobspecs']/ul/li/span[@class='label']"):
        spec_name = spec.text_content().strip()
        spec_value = spec.xpath("following-sibling::text()")[0].strip()
        print(spec_name, spec_value)
    print("----")

指纹:

Country Manager - Uk
Contract type: Permanent
Hours per week: 40
Site: London
----

示例页面上的每个作业都在一个<div class="jobitem">

            <div class="jobitem">
        <h2>Country Manager - Uk</h2>
        <div class="jobspecs">
            <ul>
                <li><span class="label">Contract type: </span>Permanent</li>
                <li><span class="label">Hours per week: </span>40</li>
                <li></li>
                <li><span class="label">Site: </span>London</li>
                <li></li>
                <li></li>
            </ul>
        </div>
        <div class="jobdesc">
            <p>Role overview:</p>
            ...

因此,您可以:

  • 通过查看标题及其文本内容来选择"作业项"元素,

获取作业元素:

import urllib2
import lxml.html as lh    
url = 'http://jobs.kelkoo.co.uk/'
response = urllib2.urlopen(url)
content = response.read()
root = lh.fromstring(content)
jobs = root.xpath('''
    //div[@class='jobitem']
         [child::*[self::h2 or self::h3 or self::h4]
                  [contains(., $query)]]''',
    query="Country Manager")

(上面使用的是 lxml 支持的 XPath 变量,但您可以使用 [contains(., "Country Manager")]

  • 绕过他们,
  • 在循环中,相对选择子元素<div>您想要的元素(以 ./ 开头的 XPath 表达式是安全的),例如使用 @alecxe 的建议[@class="jobspecs"]

像这样:

>>> for job in jobs:
...     title = job.xpath('normalize-space(h2|h3|h4)')
...     specs = job.xpath('string(./div[@class="jobspecs"])').strip()
...     desc = job.xpath('string(./div[@class="jobdesc"])').strip()
...     print('-------')
...     print(title)
...     print('-------')
...     print(specs)
...     print('-------')
...     print(desc)
...     
... 
-------
Country Manager - Uk
-------
Contract type: Permanent
                    Hours per week: 40
                    Site: London
-------
Role overview:
Reporting in to the European Commercial Director, the UK/IE Country Manager is a senior manager with full responsibility for the sales, traffic and product functions across two countries. He/She will drive the UK sales and traffic functions and manage a team of highly skilled digital account managers based in London.
The role involves sales planning, account growth planning, forecasting, data analysis and high level presentations with senior internal and external parties. The CM is responsible for the Gross Margin position and goals of the country, managing yield prices, cost of sale prices and the overall financial management of conversion over a large number of merchants and traffic partners.
The critical equations of broking between revenue, cost of leads and understanding the merchant perspective on volume, performance and quality is key to this role. This person will need little day to day management and will be a natural leader who is respected for their knowledge, commitment and ability.
Accountabilities and Deliverables:
-Develop strong relationships with key UK merchants and agencies that drive growth and take best advantage of all opportunities
- Work closely with EU counterparts to identify and maximise pan-euro opportunities where required, drive these deals through to completion either on own initiative or as part of the wider European team
- Use initiative to identity and push new opportunities; from growth of existing channels to creation of new ones
- Full control and management of the UK/IE commercial teams; able to delegate tasks and responsibilities while respecting their staffs experience and ability;
Previous Experience/Skills required:
- 6+ years experience in a proven sales/marketing management role, in digital/e-commerce.
- Understanding of the price comparison market.
- Understanding of digital marketing and online advertising.
- Contacts in online retail
Person Specification/Competencies:
- Good negotiation skills and ability to close deals quickly.
- Very strong communication and presentation skills to get best results in both local country and where required across Europe (proven track record in creating and maintaining a productive network)
- Excellent internal and external customer relationship and interpersonal skills.
- Team player with strong work ethic and ability to adapt to and drive change.
- Commercially minded.
- Strategic thinker and able to think analytically at a detailed level.
- Proven leadership skills.
- Ability to strongly influence those outside direct control for positive results.
- Displays respect to all colleagues and encourages this behaviour in own team.
- Able to deliver at a consistently high level in a demanding commercial environment.
- Manages conflict in a positive and assertive manner for best outcome
Academic Background:
- Strong academic background; preferably a minimum 2:1 degree or equivalen
Requirements/Other Information:
- Role holder must be able to travel freely across Europe and be eligible to work in the UK
Good reasons to join us
- Company highly recognized in its market
- Help our customer drive core of their business
- Opportunity to show your full potential in a growing business
- Chance to work with incredibly smart, talented, and interesting folks                
                        Apply

                        Download details

最新更新