从CSV中的某个值中提取某个字符串并重写CSV



我有一个文件,其中包含如下字段

"1652476614-55","https://tr.uspoloassn.com/erkek-yesil-polo-yaka-t-shirt-basic-50249149-vr083/?integration_color=VR004","Erkek Açık Sarı Polo Yaka T-Shirt Basic","299,95 TL","50249149-VR004","<a href=""#"" class=""js-variant "" data-name=""integration_size"" data-value=""XL"" data-isvariant=""true"" data-pk=""165742"">XL</a>"

我想删除最后一个元素中的所有内容,只保留data-value后面引号之间的内容,并再次重写文件,看起来像这样

"1652476614-55","https://tr.uspoloassn.com/erkek-yesil-polo-yaka-t-shirt-basic-50249149-vr083/?integration_color=VR004","Erkek Açık Sarı Polo Yaka T-Shirt Basic","299,95 TL","50249149-VR004","XL"

任何建议(python, shell脚本等)

一种方法是使用awk。

BEGIN{
FS="",""   # input is quoted fields with comma
OFS="";""  # set output to be quoted, with semicolon
}
{
# match value inside link tag 
# match($6, />(.*?)</a>/, arr)  # capture content in HTML <a> tag
# match value in data-value:   data-value=""CONTENT HERE""
match($6, /data-value=""([^"]*?)/, arr)
gsub(/^"/, "", $1)         # remove the first quote in field 1. 
print """$1,$2,$3,$4,$5,arr[1]"""     # print with start and end quotes
}

输入:(为了简单起见,缩写字段!)

"55","http","TS","9,95 TL","VR004","<a href=""#"" data-value=""XL"" data-isvariant=""true"" data-pk=""165742"">XL</a>"

结果:

"55";"http";"TS";"9,95 TL";"VR004";"XL"

最新更新