如何使用restful API和Spring从多个网站提取数据?



我在学校有一个任务,我必须做以下事情:

实现RESTful端点API,它同时进行调用到以下网站:

  • https://pizzerijalimbo.si/meni/
  • https://pizzerijalimbo.si/kontakt/
  • https://pizzerijalimbo.si/my-account/
  • https://pizzerijalimbo.si/o-nas/

端点的输入是' integer ',它表示数字对上述网页的同时调用(最小1代表全部)连续呼叫,最大4表示所有同时呼叫)

从每页中提取一个简短的标题文本,并将该文本保存在通用全局结构(array, folder())。程序也应该统计通话成功次数。最后,服务应该列出成功呼叫、失败呼叫的次数和保存的地址所有网页的文本。

在一些帮助下,我设法做了一些事情,但我仍然需要帮助使用Jsoup或任何其他方法进行数据提取。

下面是我的代码:

import java.util.Arrays;
import java.util.List;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
@RestController
public class APIcontroller {

@Autowired
private RestTemplate restTemplate;
List<String> websites = Arrays.asList("https://pizzerijalimbo.si/meni/", 
"https://pizzerijalimbo.si/kontakt/", 
"https://pizzerijalimbo.si/my-account/", 
"https://pizzerijalimbo.si/o-nas/");
@GetMapping("/podatki")
public List<Object> getData(@RequestParam(required = true) int numberOfWebsites) {
List<String> websitesToScrape = websites.subList(0, numberOfWebsites);

for (String website : websitesToScrape) {
Document doc = Jsoup.connect("https://pizzerijalimbo.si/meni/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%snt%s", 
headline.attr("title"), headline.absUrl("href"));
}
}
}
}

我也需要并行做,所以调用到一个特定的网站在同一时间继续。但是现在的主要问题是日志函数不能正常工作。

我试过了:

我试图用Jsoup库解决问题,但我似乎没有很好地理解了它,所以我在for循环中得到了一个错误,它说方法日志未定义。我还需要做一个try catch来计算可能失败的调用,并计算成功的调用,您可以在任务描述中看到。

WebScrapperController.java

package com.stackovertwo.stackovertwo;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutionException;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;
//import org.w3c.dom.Document;
//import org.w3c.dom.DocumentFragment;
import org.jsoup.nodes.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@RestController
public class WebScrapperController {
@GetMapping("/")
public String index() {
return "Greetings from Spring Boot!";
}

//  @Autowired
//    private RestTemplate restTemplate;
@Autowired
WebScrapperService webScrapperService;
List<String> websites = Arrays.asList("https://pizzerijalimbo.si/meni/", 
"https://pizzerijalimbo.si/kontakt/", 
"https://pizzerijalimbo.si/my-account/", 
"https://pizzerijalimbo.si/o-nas/");
@GetMapping("/podatki")
public ResponseEntity<Object> getData(@RequestParam(required = true) int numberOfWebsites) throws InterruptedException, ExecutionException {
List<SiteResponse> webSitesToScrape = new ArrayList<>();
//        List<String> websitesToScrape = websites.subList(0, numberOfWebsites);
List<SiteResponse> responseResults = new ArrayList<SiteResponse>();
CompletableFuture<SiteResponse> futureData1 = webScrapperService.getWebScrappedContent(websites.get(0));
CompletableFuture<SiteResponse> futureData2 = webScrapperService.getWebScrappedContent(websites.get(1));

//CompletableFuture.allOf(futureData1, futureData2).join();
webSitesToScrape.add(futureData1.get());
webSitesToScrape.add(futureData2.get());

List<SiteResponse> result = webSitesToScrape.stream().collect(Collectors.toList());
return ResponseEntity.ok().body(result);
}
}

WebScrapperService.java

package com.stackovertwo.stackovertwo;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpEntity;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.ResponseEntity;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;
import org.springframework.web.client.RestTemplate;
import java.util.concurrent.CompletableFuture;
@Service
public class WebScrapperService {
@Autowired
private RestTemplate restTemplate;

Logger logger = LoggerFactory.getLogger(WebScrapperService.class);
@Async
public  CompletableFuture<SiteResponse> getWebScrappedContent(String webSiteURL) 
//throws InterruptedException 
{
logger.info("Starting: getWebScrappedContent for webSiteURL {} with thread {}", webSiteURL, Thread.currentThread().getName());
HttpEntity<String> response = restTemplate.exchange(webSiteURL,
HttpMethod.GET, null, String.class);
//Thread.sleep(1000);
SiteResponse webSiteSummary = null ;
String resultString = response.getBody();

HttpHeaders headers = response.getHeaders();
int statusCode = ((ResponseEntity<String>) response).getStatusCode().value();
System.out.println(statusCode);
System.out.println("HEADERS"+headers);
try
{
Document doc = (Document) Jsoup.parse(resultString);
Elements header = doc.select(".elementor-inner h2.elementor-heading-title.elementor-size-default");
System.out.println(header.get(0).html());
// Return the fragment.
webSiteSummary = new SiteResponse(statusCode, header.get(0).html());

}
catch(Exception e) {
System.out.println("Exception "+e.getMessage());
}
logger.info("Complete: getWebScrappedContent for webSiteURL {} with thread {}", webSiteURL, Thread.currentThread().getName());
return CompletableFuture.completedFuture(webSiteSummary);
}

}

SpringBootApp.java

package com.stackovertwo.stackovertwo;
import org.springframework.boot.SpringApplication;    
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.http.client.HttpComponentsClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate; 
import java.security.KeyManagementException;
import java.security.KeyStoreException;
import java.security.NoSuchAlgorithmException;
import java.security.cert.X509Certificate;
//import javax.net.ssl.HostnameVerifier;
//import javax.net.ssl.HttpsURLConnection;
import javax.net.ssl.SSLContext;
//import javax.net.ssl.SSLSession;
//import javax.net.ssl.TrustManager;
//import javax.net.ssl.X509TrustManager;
//import javax.security.cert.X509Certificate;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.impl.client.*;
import org.apache.http.conn.ssl.*;
@SpringBootApplication    
public class SpringBootApp  
{  
public static void main(String[] args)  
{    
SpringApplication.run(SpringBootApp.class, args);    
}   

@Bean
public RestTemplate restTemplate() throws KeyManagementException, NoSuchAlgorithmException, KeyStoreException {
TrustStrategy acceptingTrustStrategy = (X509Certificate[] chain, String authType) -> true;
SSLContext sslContext = org.apache.http.ssl.SSLContexts.custom()
.loadTrustMaterial(null, acceptingTrustStrategy)
.build();
SSLConnectionSocketFactory csf = new SSLConnectionSocketFactory(sslContext);
CloseableHttpClient httpClient = HttpClients.custom()
.setSSLSocketFactory(csf)
.build();
HttpComponentsClientHttpRequestFactory requestFactory =
new HttpComponentsClientHttpRequestFactory();
requestFactory.setHttpClient(httpClient);

//return new RestTemplate();
RestTemplate restTemplate = new RestTemplate(requestFactory);
return restTemplate;
}

}  

注意:我在resttemplate中调用webbulr时禁用了SSL验证,但不建议在生产中使用(对于分配它是ok的)。但是在生产中需要通过java密钥存储库导入密钥:https://myshittycode.com/2015/12/17/java-https-unable-to-find-valid-certification-path-to-requested-target-2/

最新更新