r-如何匹配除了图像大小之外具有相同地址的图像url,并且只保留一个url供下载



我在父url中有一个充满url和图像url的df。我的网络抓取脚本下载了所有的图像url。正如你所看到的,图片url 2-8是相同的图片,但尺寸不同。我希望保留其中一张图片,这样6个版本的照片就不会被下载。只要只有一个,保留哪种尺寸并不重要。

这也是一个较大数据集的一部分,每周执行一次。每周大约有1500个图像url,因此代码的结构必须是通用的/标准化的。我曾考虑过按照文章url分组url的最大长度,并以短语开头和结尾进行匹配,但我觉得这不是最好的方法,可能会导致url较短的网站上丢失一些图片。

structure(list(image_url = c("https://cloudfront-eu-central-1.images.arcpublishing.com/rtl/HDS4PPYVRD7SEHYOS6N6TFAROU.jpg", 
"https://ais-akamai.rtl.de/masters/1769218/1024x0/lewis-hamilton-trennen-nur-noch-acht-punkte-von-max-verstappen.jpg", 
"https://ais-akamai.rtl.de/masters/1769218/290x0/lewis-hamilton-trennen-nur-noch-acht-punkte-von-max-verstappen.jpg", 
"https://ais-akamai.rtl.de/masters/1769218/345x0/lewis-hamilton-trennen-nur-noch-acht-punkte-von-max-verstappen.jpg", 
"https://ais-akamai.rtl.de/masters/1769218/395x0/lewis-hamilton-trennen-nur-noch-acht-punkte-von-max-verstappen.jpg", 
"https://ais-akamai.rtl.de/masters/1769218/728x0/lewis-hamilton-trennen-nur-noch-acht-punkte-von-max-verstappen.jpg", 
"https://ais-akamai.rtl.de/masters/1769218/399x0/lewis-hamilton-trennen-nur-noch-acht-punkte-von-max-verstappen.jpg", 
"https://ais-akamai.rtl.de/masters/1769218/527x0/lewis-hamilton-trennen-nur-noch-acht-punkte-von-max-verstappen.jpg"
), URL = c("https://www.rtl.de/cms/heisser-wuestenendspurt-live-bei-rtl-faellt-am-sonntag-die-formel-1-entscheidung-4875243.html", 
"https://www.rtl.de/cms/heisser-wuestenendspurt-live-bei-rtl-faellt-am-sonntag-die-formel-1-entscheidung-4875243.html", 
"https://www.rtl.de/cms/heisser-wuestenendspurt-live-bei-rtl-faellt-am-sonntag-die-formel-1-entscheidung-4875243.html", 
"https://www.rtl.de/cms/heisser-wuestenendspurt-live-bei-rtl-faellt-am-sonntag-die-formel-1-entscheidung-4875243.html", 
"https://www.rtl.de/cms/heisser-wuestenendspurt-live-bei-rtl-faellt-am-sonntag-die-formel-1-entscheidung-4875243.html", 
"https://www.rtl.de/cms/heisser-wuestenendspurt-live-bei-rtl-faellt-am-sonntag-die-formel-1-entscheidung-4875243.html", 
"https://www.rtl.de/cms/heisser-wuestenendspurt-live-bei-rtl-faellt-am-sonntag-die-formel-1-entscheidung-4875243.html", 
"https://www.rtl.de/cms/heisser-wuestenendspurt-live-bei-rtl-faellt-am-sonntag-die-formel-1-entscheidung-4875243.html"
)), row.names = 1:8, class = "data.frame")

只要重复的图像在最后一个正斜杠(/(之后具有相同的名称,以下代码就可以工作。

它添加一个新列pic,该列包含从最后一个斜线(/(到img_url末尾的字符,将它们分组在pic列上,并仅选择(slices(第一个。


library(dplyr)
library(stringr)
x %>% 
mutate(pic = str_extract(image_url, "[^//]+$")) %>%  
group_by(pic) %>% 
slice(1) %>%
ungroup()
#> # A tibble: 2 x 3
#>   image_url                     URL                       pic                   
#>   <chr>                         <chr>                     <chr>                 
#> 1 https://cloudfront-eu-centra… https://www.rtl.de/cms/h… HDS4PPYVRD7SEHYOS6N6T…
#> 2 https://ais-akamai.rtl.de/ma… https://www.rtl.de/cms/h… lewis-hamilton-trenne…

由reprex包(v0.3.0(于2022-03-10创建

最新更新