Web抓取项目列表



这是我第一次在rust中编程(我目前正在读这本书(,最近我需要删除这个网站的疾病和条件列表,在尝试了几个指南后,我得到了这个小片段。我目前一直在迭代一个ol,其中没有将每个li作为数组中的一个项,而是将其作为单个元素。

use error_chain::error_chain;
use select::document::Document;
use select::predicate::Class;
error_chain! {
foreign_links {
ReqError(reqwest::Error);
IoError(std::io::Error);
}
}
// Source: https://rust-lang-nursery.github.io/rust-cookbook/web/scraping.html#extract-all-links-from-a-webpage-html
#[tokio::main]
async fn main() -> Result<()> {
let res = reqwest::get("https://www.cdc.gov/diseasesconditions/az/a.html")
.await?
.text()
.await?;
Document::from(res.as_str())
.find(Class("unstyled-list")) // This is returning the the whole "ol"
.for_each(|i| print!("{};", i.text()));
Ok(())
}

输出,注意整个列表是如何打印为单个项目的,而不是使用预期的分隔符;:打印的每个取消对话框

Abdominal Aortic Aneurysm — see Aortic AneurysmAcanthamoeba InfectionACE (Adverse Childhood Experiences)Acinetobacter InfectionAcquired Immune Deficiency Syndrome (AIDS) — see HIVAcute Flaccid Myelitis (AFM)Adenovirus InfectionAdenovirus VaccinationADHD [Attention Deficit/Hyperactivity Disorder]Adult VaccinationsAdverse Childhood Experiences (ACE)AFib, AF (Atrial fibrillation)AFMAfrican Trypanosomiasis — see Sleeping SicknessAgricultural Safety — see Farm Worker InjuriesAHF (Alkhurma hemorrhagic fever)AIDS (Acquired Immune Deficiency Syndrome)Alkhurma hemorrhagic fever (AHF)ALS [Amyotrophic Lateral Sclerosis]Alzheimer's DiseaseAmebiasis, Intestinal [Entamoeba histolytica infection]American Trypanosomiasis — see Chagas DiseaseAmphibians and Fish, Infections from — see Fish and Amphibians, Infections fromAmyotrophic Lateral Sclerosis — see ALSAnaplasmosis, HumanAncylostoma duodenale Infection, Necator americanus Infection — see Human HookwormAngiostrongylus InfectionAnimal-Related DiseasesAnisakiasis — see Anisakis InfectionAnisakis Infection [Anisakiasis]Anthrax VaccinationAnthrax [Bacillus anthracis Infection]Antibiotic-resistant Infections - ListingAntibiotic and Antimicrobial ResistanceAntibiotic Use, Appropriatesee also U.S. Antibiotic Awareness Week (USAAW)Aortic AneurysmAortic Dissection — see Aortic AneurysmArenavirus InfectionsArthritisChildhood ArthritisFibromyalgiaGoutOsteoarthritis (OA)Rheumatoid Arthritis (RA)Ascariasis — see Ascaris InfectionAscaris Infection [Ascariasis]Aseptic Meningitis — see Viral MeningitisAspergillosis — see Aspergillus InfectionAspergillus Infection [Aspergillosis]AsthmaAtrial fibrillation (AFib, AF)Attention Deficit/Hyperactivity Disorder — see ADHDAutismsee also Genetics and GenomicsAvian Influenza  ;

预期的输出将是:

Abdominal Aortic Aneurysm — see Aortic AneurysmAcanthamoeba Infection;ACE (Adverse Childhood Experiences);Acinetobacter Infection; etc...

find()返回与creteria匹配的元素列表。您需要调用.children()才能获得<li>s:

Document::from(res.as_str())
.find(Class("unstyled-list"))
.next() // Get the first match
.expect("no matching <ol>")
.children()
.for_each(|i| print!("{};", i.text()));

最新更新