Python lxml etree.tostring() 为什么 \r 有编码到



我试图解析:

    request = urllib2.Request(url="http://2012.qq.com/sports/")
    response = urllib2.urlopen(request)
    content = response.read()
    uni_content = content.decode("gb2312", "ignore")
    tecent = uni_content.encode("utf-8")
    tecent_page = etree.HTML(tecent, parser=etree.HTMLParser(encoding='utf-8'))
    test_tags = tecent_page.xpath("/html/body/div[@class='page']/div[@class='layout']/div/div[@class='bd']/ul[@class='list']/li")
    for i, item in enumerate(test_tags):
        content = etree.tostring(item, encoding="utf-8", pretty_print=True)
        print content

为什么结果是这样的:

<li class="item">&#13;
                        <a class="pic" href="http://2012.qq.com/sports/judo/index.htm" target="_blank"><img width="96" height="96" src="http://mat1.gtimg.com/2012/samanthasun/allevents/roudao.png" alt="柔道"/></a>&#13;
                        <p><a href="http://2012.qq.com/sports/judo/index.htm" target="_blank">柔道</a></p>&#13;
                        <p><a href="http://2012.qq.com/l/sports/judo/judochn/list2011079114946.htm" target="_blank">新闻</a> | <a href="http://2012.qq.com/l/photos/33xiangmu/roudao/list2011079115124.htm" target="_blank">图片</a> | <a href="http://2012.qq.com/l/video/xm/vjudo/list.htm" target="_blank">视频</a></p>&#13;
                    </li>&#13;

为什么它有&#13;
每一行都有&#13;。为什么?

因为原始文档(http://2012.qq.com/sports/)具有CR+LF断线。回车代码为13。

您可以使用简单的解决方法:tecent = uni_content.encode("utf-8").replace('rn', 'n')

最新更新