或者使用分隔符拆分字符串



我有一个这样的网址列表:

mydata <- read.table(header=TRUE, text="
      Id
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrickpattern%3ADecorative%2FArt+Deco%3Abrickpattern%3AFloral%3Abrickpattern%3AGeometric%3Abrickpattern%3AGraphic%3Abrickpattern%3ATropical%3Aprice%3A300%2C10500&page=7&gridValue=4  
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2040%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3ABlue%3Averticalcolorfamily%3AWhite
      https://www.example.com/dp/c/830316016?q=%3Arelevance%3Averticalcolorfamily%3AWhite&gclid=CjwKEAjw9_jJBRCXycSarr3csWcSJABthk07W_H0RxQtOPZX7VdD9CSmK4S01BMYdXbtc0XxC0OeChoCky_w_wcB
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AFLYING%20MACHINE%3Abrand%3AMUFTI%3Abrand%3AUNITED%20COLORS%20OF%20BENETTON
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2038%3Averticalsizegroupformat%3AIN%2039%3Averticalsizegroupformat%3AIN%20M%3Averticalsizegroupformat%3AUK%2039%3Averticalsizegroupformat%3AUK%20M%3Averticalsizegroupformat%3AUK%20S%3Averticalsizegroupformat%3AUS%20M%3Averticalsizegroupformat%3AUS%20S%3Abrickpattern%3ASolid%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3AWhite
      https://www.example.com/dp/c/830216013?q=%3Aprce-asc%3Abricksleeve%3AShort%3Aprice%3A300%2C10500&page=2&gridValue=4
      https://www.example.com/dp/c/830216013??q=%3Aprce-asc%3Abrand%3AUS+POLO%3Abricksleeve%3AShort%3Aprice%3A300%2C10500
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AAJIO%3Abrand%3ABASICS%3Abrand%3ACelio%3Abrand%3ADNMX%3Abrand%3AGAS%3Abrand%3ALEVIS%3Abrand%3ANETPLAY%3Abrand%3ASIN%3Abrand%3ASUPERDRY%3Abrand%3AUS%20POLO%3Abrand%3AVIMAL%3Abrand%3AVIMAL%20APPARELS%3Abrand%3AVOI%20JEANS
      https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3ABritish+Club%3Abrand%3ACelio%3Abrand%3AFLYING+MACHINE%3Aprice%3A300%2C10500&page=1&gridValue=4          
                         ")      

我需要从网址中提取参数的值,如品牌、垂直颜色系列、q= 等。这些参数是应用于网站上的过滤器。
我正在寻找的输出是一个包含三列的数据框:参数、值和值的出现频率。例如:

parameter |      value     | frequency
----------|----------------|----------
brand     | FLYING+MACHINE | 2  
q=        | relevance      | 5  
price     | 300%2C10500    | 2  
brand     | BASICS         | 1

目前我能想到的是将每个 url 收集为一个字符串向量,由交替值"%3A"分隔为分隔符:[q=%3A相关性,砖型%3ADecorative%2FArt+Deco,砖型%3AFloral,砖型%3AGeometric,砖型%3AGraphic ,砖型%3A,price%3A300%2C10500]。

然后将每个元素放在数据框的一列中,然后再次按"%3A"拆分并执行分组依据。关于其他方法的建议将不胜感激。另外,如果我应该使用这种方法,我不知道使用交替的"%3A"作为分隔符的方法。

urltools看起来像一个很棒的软件包,可以满足你想要做的事情。与此同时,这是一个被黑的答案。从您的数据帧开始:

# Convert to character list
# Get rid of url
# Split by "%3A" and convert to "long" list
L <- as.character(mydata$Id)
L <- gsub("https://www.example.com/dp/c/830216013\?", "", L)
L <- unlist(strsplit(L, "%3A"))
head(L)
[1] "q="                    "relevance"             "brickpattern"         
[4] "Decorative%2FArt+Deco" "brickpattern"          "Floral"

然后:

# Convert to 2-column data frame
# Count unique parameter:value pairs
df <- data.frame(parameter = L[seq(1,length(L),2)], value = L[seq(2,length(L),2)]) %>%
      group_by(parameter, value) %>%
      summarize(frequency=sum(!is.na(value)))

我将仅显示以下条目,其中frequency >= 2

# Show only entries with frequency >= 2
filter(df, frequency >= 2)
            parameter     value frequency
               <fctr>    <fctr>     <int>
1               brand     Celio         2
2         bricksleeve     Short         2
3                  q= relevance         6
4 verticalcolorfamily     Black         2
5 verticalcolorfamily     White         2

请注意,brand::FLYING+MACHINE != 2因为FLYING+MACHINEFLYING%20MACHINEFLYING+MACHINE的形式出现。

最新更新