我是Google CloudDLP的新手,我运行了一个POST https://dlp.googleapis.com/v2beta1/inspect/operations 来扫描Google Cloud Storage目录中的.parquet
文件,并使用cloudStorageOptions
来保存.csv
输出。
.parquet
文件为 53.93 M。
当我对.parquet
文件进行 API 调用时,我得到:
"processedBytes": "102308122",
"infoTypeStats": [{
"infoType": {
"name": "AMERICAN_BANKERS_CUSIP_ID"
},
"count": "1"
}, {
"infoType": {
"name": "IP_ADDRESS"
},
"count": "17"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "148"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "30"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "22"
}]
当我将.parquet
文件转换为.csv
时,我得到一个 360.58 MB 的文件。然后,如果我对.csv
文件进行 API 调用,我会得到:
"processedBytes": "377530307",
"infoTypeStats": [{
"infoType": {
"name": "CREDIT_CARD_NUMBER"
},
"count": "56546"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "372527"
}, {
"infoType": {
"name": "NETHERLANDS_BSN_NUMBER"
},
"count": "5"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "1331321"
}, {
"infoType": {
"name": "AUSTRALIA_TAX_FILE_NUMBER"
},
"count": "52269"
}, {
"infoType": {
"name": "PHONE_NUMBER"
},
"count": "28"
}, {
"infoType": {
"name": "US_DRIVERS_LICENSE_NUMBER"
},
"count": "114"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "141383"
}, {
"infoType": {
"name": "KOREA_RRN"
},
"count": "56144"
}],
显然,当我扫描.parquet
文件时,与在.csv
文件上运行扫描相比,我验证了所有EmailAddresses
都已检测到,但并未检测到所有infoTypes
。
我找不到任何关于压缩文件(如镶木地板)的文档,所以我假设谷歌云DLP不提供此功能。
任何帮助将不胜感激。
Parquet 文件目前被扫描为二进制对象,因为系统尚未智能地解析它们。在 V2 API 中,支持的文件类型 https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype 在此处列出。