如何在Powershell中删除分隔记录内的JSON文本(包括CRLF)



有一个奇怪的问题,我需要在波浪分隔的文件中删除JSON文本(在JSON的每行结束时,由于CRLF, JSON中断了导入)。例子:

Test Plan Work~Response Status: BadRequest Bad Request,Response Content: {
"trace": "0HM5285F2",
"errors": [
{
"code": "server_error",
"message": "Couldn't access service ",
"moreInfoUrl": null,
"target": {
"type": null,
"name": null
}
}
]
},Request: https://www.test.com Headers: Accept: application/json
SubscriberId: 
~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM

或诸如此类的没有JSON但仍然具有我需要的相同模式:

Test Plan Pay Work~Response Status: InternalServerError Internal Server Error,Response Content: Error,Request: https://api.test.com Headers: Accept: application/json
Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5c
SubscriberId: eb7aee
~9d05b16e-e57b-44be-b028-b6ddsdfsdf62a5~1/20/2021 7:07:53 PM

需要这两种类型的CSV文本的格式:

Test Plan Work~Response Status: BadRequest Bad Request~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM

JSON(包括JSON每行末尾的CRLF)正在破坏数据导入到Powershell。任何帮助或见解将不胜感激!

PowerShell(或者更确切地说,)。)在它的正则引擎中有两个奇特的特性,可能非常适合这个用例——平衡组和条件!

平衡组是一个复杂的功能,无法完全解释,但它本质上允许我们"保持计数"。在正则表达式模式中特定命名子表达式的出现,当应用时看起来像这样:

PS ~> $string = 'Here is text { but wait { it has } nested { blocks }} here is more text'
PS ~> $string -replace '{(?>{(?<depth>)|[^{}]+|}(?<-depth>))*(?(depth)(?!))}'
Here is text  here is more text

让我们分解一下正则表达式模式:

{                    # match literal '{'
(?>                   # begin atomic group* 
{(?<depth>)     #     match literal '{' and increment counter
|  [^{}]+           #  OR match any sequence of characters that are NOT '{' or '}'
|  }(?<-depth>)    #  OR match literal '}' and decrement counter
)*                    # end atomic group, whole group should match 0 or more times
(?                    # begin conditional group*
(depth)(?!)       # if the 'depth' counter > 0, then FAIL!
)                     # end conditional group
}                    # match literal '}' (corresponding to the initial '{')

*)(?>...)原子分组防止回溯-防止意外计数超过一次。

对于其余字段中的CRLF字符,我们可以用(?s)作为模式的前缀-这使得regex引擎在匹配."any"时包含新行。元字符,直到我们到达~87c5...:

之前的位置。
(?s),Response Content:s*{(?>{(?<depth>)|[^{}]+|}(?<-depth>))*(?(depth)(?!))}.*?(?=~)

或者我们可以更准确地将JSON后面的字段描述为重复的,和"而不是,";

,Response Content:s*(?:{(?>{(?<depth>)|[^{}]+|}(?<-depth>))*(?(depth)(?!))})?s*(?:,[^,]+?)*(?=~)

让我们对你的多行输入字符串试一下:

$string = @'
Test Plan Work~Response Status: BadRequest Bad Request,Response Content: {
"trace": "0HM5285F2",
"errors": [
{
"code": "server_error",
"message": "Couldn't access service ",
"moreInfoUrl": null,
"target": {
"type": null,
"name": null
}
}
]
},Request: https://www.test.com Headers: Accept: application/json
SubscriberId: 
~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM
'@
$string -replace ',Response Content:s*(?:{(?>{(?<depth>)|[^{}]+|}(?<-depth>))*(?(depth)(?!))})?s*(?:,[^,]+?)*(?=~)'

输出:

Test Plan Work~Response Status: BadRequest Bad Request~87c5de00-5906-4d2d-b65f-4asdfsdfsdfa29~3/17/2020 1:54:08 PM