Java Jsoup谷歌图片搜索结果解析



我使用jsoup来解析Google图像结果。我试着得到图像的src。这是我到目前为止的代码。由于某种原因,输出被截断,我无法访问src属性。有人知道为什么会发生这种情况,我能做些什么来解决它吗?非常感谢!

public static void main(String args[]) {
    try {
        // Does a google image search for "test"
        final Document doc = Jsoup.connect("https://www.google.com/search?q=test&tbm=isch").userAgent(USER_AGENT).get();
        // selects images
        Elements elements = doc.select("img.rg_ic.rg_i");
            // cycles through elements and prints attributes
            for (Element e : elements) {
                System.out.print(e);
            }

    } catch (IOException e) {
        e.printStackTrace();
    }
}
输出:

<img class="rg_ic rg_i" data-sz="f" name="XWXPqrX1RFJiaM:" alt="Image result for test" jsaction="load:str.tbn" onload="google.aft&&google.aft(this)">

下面的代码提供了使用jsoup获取前100个图像结果的url。如果你需要所有的结果,你必须使用一个无头浏览器(我推荐PhantomJS,看看这个答案的用法)。

静态html源具有仅存储在JSON对象中的前100个结果的图像url。为了解析抓取的JSON对象,我使用JSON.simple

JSON对象包含在rg_meta类的<div>元素中,其形式如下:

{"st":"Uber","tu":"https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTSEUMluu1kigjR3JU40BYfaH0fQ6JW1vk9WScBiXr--lsMILf2","ru":"https://newsroom.uber.com/uberkittens-are-back/","tw":300,"pt":"UberKittens Delivers Kittens to Play or Stay","ou":"https://newsroom.uber.com/wp-content/uploads/2015/10/HQ_uberkittens_blog_960x540_r1v1.jpg","ow":960,"cl":6,"isu":"newsroom.uber.com","rid":"vLA3QXY8xPE4PM","cr":3,"ity":"jpg","sc":1,"ct":15,"s":"Clear Your Calendarsu2014#UberKITTENS Are Back","th":168,"oh":540,"id":"qCR7qXt7VX38iM:","itg":false,"cb":15}

对于url,我们需要提取键"ou"的值。

示例代码

// can only grab first 100 results
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=kittens&gws_rd=cr";
List<String> resultUrls = new ArrayList<String>();
try {
    Document doc = Jsoup.connect(url).userAgent(userAgent).referrer("https://www.google.com/").get();
    Elements elements = doc.select("div.rg_meta");
    JSONObject jsonObject;
    for (Element element : elements) {
        if (element.childNodeSize() > 0) {
            jsonObject = (JSONObject) new JSONParser().parse(element.childNode(0).toString());
            resultUrls.add((String) jsonObject.get("ou"));
        }
    }
    System.out.println("number of results: " + resultUrls.size());
    for (String imageUrl : resultUrls) {
        System.out.println(imageUrl);
    }
} catch (IOException | ParseException e) {
    e.printStackTrace();
}

number of results: 100
https://newsroom.uber.com/wp-content/uploads/2015/10/HQ_uberkittens_blog_960x540_r1v1.jpg
https://pbs.twimg.com/profile_images/562466745340817408/_nIu8KHX.jpeg
http://leecamp.net/wp-content/uploads/kitten-3.jpg 
...

最新更新