在网页中使用 Ruby、Nokogiri & Mechanize java cookie 链接进行解析



每个人。

我需要解析一个为每个链接都设置了java cookie的网页。我可以解析正常的搜索,每个产品都会显示并导入到mysql数据库中。

我能够从搜索结果中抓取每个产品及其元素,代码如下:

这就是我所拥有的:

    require 'rubygems'
    require 'logger'
    require 'mechanize'
    require 'mysql2'
    
    agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
    #agent.set_proxy('a-proxy', '8080')
    agent.read_timeout = 60
    
    def add_cookie(agent, uri, cookie)
      uri = URI.parse(uri)
      Mechanize::Cookie.parse(uri, cookie) do |cookie|
        agent.cookie_jar.add(uri, cookie)
      end
    end
    
    
    # get main page
    page = agent.get "http://www.site.com.mx"
    
    # get login form
    form = page.forms.first
    form.correo_ingresar = "user"
    form.password = "password"
    
    # submit login form
    page = agent.submit form
    
    # parse cookies
    myarray = page.body.scan(/SetCookie("(.+)", "(.+)")/)
    
    # set session cookies
    myarray.each do |item|
      add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
    end
    # show 1000 search results per page
    add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")
    
    # order results
    add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")
    
    # section results
    add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")
    
    # get main page
    page = agent.get "http://www.site.com.mx/tienda/index.php"
    
    search_form = page.forms.first
    
    search_result = agent.submit search_form
    
    doc = Nokogiri::HTML(search_result.body)
    
    rows = doc.css("table.articulos tr")
    
    i = 0
    details = rows.collect do |row|
      detail = {}
      [
        [:sku, 'td[3]/text()'],
        [:desc, 'td[4]/text()'],
        [:qty, 'td[5]/text()'],
        [:qty2, 'td[5]/p/b/text()'],
        [:price, 'td[6]/text()']
      ].collect do |name, xpath|
        detail[name] = row.at_xpath(xpath).to_s.strip
      end
      i = i + 1
      detail
    end
    
    # walk through paginator links
    links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!
    
    links.each do |l|
        page = agent.get l
    
        doc = Nokogiri::HTML(page.body)
    
        rows = doc.css("table.articulos tr")
    
        rows.each do |row|
            detail = {}
            [
                    [:sku, 'td[3]/text()'],
                    [:desc, 'td[4]/text()'],
                    [:qty, 'td[5]/text()'],
                    [:qty2, 'td[5]/p/b/text()'],
                    [:price, 'td[6]/text()']
            ].collect do |name, xpath|
                    detail[name] = row.at_xpath(xpath).to_s.strip
            end
            details << detail
        end
    end
    
    # update db
    client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")
    
    details.each do |d|
        if d[:sku] != ""
            price = d[:price].split
    
            if price[1] == "D"
                currency = 144
            else
                currency = 168
            end
    
            cost = price[0].gsub(",", "").to_f
    
            if d[:qty] == ""
                qty = d[:qty2]
            else
                qty = d[:qty]
            end 
    
            results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
            if results.count == 1
                product = results.first
    
                            client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id = 
    #{product['product_id']};")
    
                client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
            else
                client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
                last_id = client.last_id
    
                client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
            end
        end
    end

现在我不想搜索,我想从类别列表中解析:
主页链接:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_strar=11这显示了一个这样的表(所有内容都包含链接)最上面的名称:ACCESSORIOS是ACCESSORIO类别的链接,下面列出的粗体名称是子类别,粗体名称下面的名称是品牌。如果我点击ACCESSORIOS,它会显示每个品牌和每个子类别的混淆,等等

ACCESSORIOS
Accesorios多媒体(6)
墨西哥ACTECK(5),曼哈顿(1)
Accesorios P/impra。文塔角(1)
爱普生公司(1)
Accesorios Para Cableados De配线架(1)
智能网络解决方案(1)
Accesorios Para Camaras Digitales(1)
曼哈顿(1)
Accesorios Para Computadoras De Escritorio(32)
ACTECK DE MEXICO(2),GENERICA(1),MANHATTAN(28),TARGUS(1)
Accesorios Para Computadoras Portatiles(60)
ACTECK DE MEXICO(3),GENIUS(2),HP COMERCIAL(2)、HP IMPRESION(1)、MANHATAN(17)、PERFECT CHOICES(32)、SOLIDEX(1),TARGUS(1)和TECH ZONE(1)
Accesorios Para Ipod(3)
ACTECK DE MEXICO(1),完美选择(2)
Accesorios Para Mesas(3)
曼哈顿(2),完美选择(1)
Accesorios Para Redes(13)
智能网络解决方案(5),曼哈顿(8)
Accesoriso Para Celulares(14)
BLACKBERRY(14)
适配器蓝牙(6)
ACTECK DE MEXICO(1),曼哈顿(2),完美选择(3)
Adaptadores Para Mouse Y Teclado(3)
曼哈顿(2),完美选择(1)
Audifono/diademas Y Microfonos(49)
ACTECK DE MEXICO(14),BTO(1),GENIUS(3),LOGITECH(2),MANHATTAN(11),PERFECT CHOICES(18)

这是每个链接都有cookie的表的代码,这就是为什么我一直很难抓取它。

    <table width="95%" cellspacing="0" cellpadding="3" border="0">
    <tbody>
    <tr>
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
    </tr>
    <tr>
    <td width="20" valign="top" align="left"></td>
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
    </td>
    </tr>
    </tbody>
    </table>

所以问题是,我应该在代码中添加什么才能访问每个链接?如果它使用java cookie

使用的Cookie:
名称、值范围
codigosection_buscar,11-30
codigomacar_buscar,100-736
codigolinea_buscar,15-1385

我通过在Ruby代码中添加cookie来抓取其中一个链接内容:

    # set cookies
    add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx")
    add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx")
    add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx")
    add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx")

奇怪的是,如果我只添加其中一个cookie,它就不会起作用。所以我不得不添加所有的,即使它们没有任何值,因为每个链接都有一个cookie,这样它就会删除或清除保存的cookie。

现在我需要刮那些cookie,用它作为变量,做一个循环或其他什么,有人能帮我吗?

<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>

最新更新