如何从我尝试在字符串中抓取的网页中获取 html?



我写了以下代码:

require "http/client"
require "myhtml"
puts "Give me the URL of the page to be scraped."
url = gets
html=<<-HTML
[Here goes the html of the website to be scraped]
HTML
myhtml = Myhtml::Parser.new(html)
myhtml.nodes(:div).each do |node|
id = node.attribute_by("id")
if first_link = node.scope.nodes(:a).first?
href = first_link.attribute_by("href")
link_text = first_link.inner_text
puts "div with id #{id} have link [#{link_text}](#{href})"
else
puts "div with id #{id} have no links"
end
end

如何从我尝试在字符串中抓取的网页中获取 html,以便我可以替换

html=<<-HTML
[Here goes the html of the website to be scraped]
HTML

用类似的东西

response = requests.get(url)
html = BeautifulSoup(response.text, 'html.parser')

来自以下 Python 代码:


url = input("What is the address of the web page in question?n")
response = requests.get(url)
html = BeautifulSoup(response.text, 'html.parser')

或从以下 Rust 代码let html = reqwest::get(url).await?.text().await?;

println!("Give me the URL of the page to be scraped."); 
let mut url = String::new();
io::stdin().read_line(&mut url).expect("Failed to read line");
let html = reqwest::get(url).await?.text().await?;

分片 myhtml 的文档没有提供足够的 例子让我弄清楚这一点。 可以使用Crystal的HTTP客户端从他们的 标准库? 当我更换时

html=<<-HTML
[Here goes the html of the website to be scraped]
HTML

response = HTTP::Client.get url
html = response.body

我收到以下错误:

response = HTTP::Client.get url #no overload matches 'HTTP::Client.get' with type (String | Nil)
^--
Error: no overload matches 'HTTP::Client.get' with type (String | Nil)
Overloads are:
- HTTP::Client.get(url : String | URI, headers : HTTP::Headers | ::Nil = nil, body : BodyType = nil, tls : TLSContext = nil)
- HTTP::Client.get(url : String | URI, headers : HTTP::Headers | ::Nil = nil, body : BodyType = nil, tls : TLSContext = nil, &block)
- HTTP::Client.get(url, headers : HTTP::Headers | ::Nil = nil, tls : TLSContext = nil, *, form : String | IO | Hash)
- HTTP::Client.get(url, headers : HTTP::Headers | ::Nil = nil, tls : TLSContext = nil, *, form : String | IO | Hash, &block)
Couldn't find overloads for these types:
- HTTP::Client.get(Nil)

我能够从网页上获取文本 通过硬编码,例如response = HTTP::Client.get "https://github.com/monero-project/monero/releases"但这还不够,因为我希望该应用程序是交互式的。

你很接近,是类型系统在抱怨。HTTP::Client.get期待String(或者更确切地说是String | URL)。但是,在您的代码中,您的url变量也可以是nil的,并且属于String?的类型,这是String | Nil的缩写。如果对 URL 进行硬编码,则无法对其进行nil,但始终为String类型。因此,HTTP::Client.get调用有效。

查看get函数的文档:

def gets(chomp = true) : 字符串?

从此 IO 读取一行。一行由 字符终止。如果在此 IO 结束时调用,则返回 nil。

有多种方法可以解决这个问题,但基本思想是,您必须确保在进行HTTP调用时无法nilurl。例如:

url = gets
if url
# now url cannot be nil
response = HTTP::Client.get url
html = response.body
puts html
end

延伸阅读:如果 var

相关内容

  • 没有找到相关文章

最新更新