正则表达式,用于解析"column"中包含 JSON 的 CSV



我需要解析一个逗号分隔的文本文件,其中一行可以包含json作为列值。该文件还包含一个标题行。

我正在尝试构建正则表达式,以便我可以解析文件并为文件的每行列表创建一个hashmap。

我意识到,这可能会为此变得过于复杂,但是我不知道我还能设计代码以获取每一行的hashmap。

我尝试了不同的方式 - 从SO和其他站点上的示例中,但无法获得良好的工作表达。他们在不同的零件上工作,但整个数据

不使用
    String[] values = readLine.split(",(?=([^"]*"[^"]*")*(?![^"]*"))", -1);
    String[] values = readLine.split(",(?=(?:(?:[^'",]*(?:'|")){2})*[^'",]*)", -1);
    String[] values = readLine.split(",("([^"]|"")*")", -1);
    String[] values = readLine.split(",(\w+\s)?("[^"]+"|\w+)(\(\w\d(,\w\d)*\))?", -1);
    String[] values = readLine.split(",(?=(?:(?:[^'",]*(?:'|")){2})*[^'",]*$)", -1);

以下是数据示例

id,apptoolID,apptoolUUID,accountNumber,accountName,name,description,mac,status,pGroups,oemCode,oemTagList,locationID,userCode,businessUnit,customerDescription,notification,ptdd,hdptdd,ptddSchedule,compLive,optInfo,config1,scriptDebugging,clearLocalStorage,created_at,updated_at
5703,2535,9e849f81-56c4-4415-b36e-b0fd370b3986,21126,"3 Way ChProduct",ZYX-21126TC-P1,"3 Way ChProduct - CHSH2112601DSHCL - Showroom",24-1c-04-0a-84-9c,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CH,"ChProduct,ALL,SHOP-SPECIFIC",CHSH2112601DSHCL,"{"oemCode":"222923"}",Showroom,Showroom,false,false,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":false,"OPT_IN_SP":false,"OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":false,"OPT_OUT_FA":false,"OPT_OUT_GT":false,"OPT_IN_TT":false}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
5704,4248,494af61a-54ca-4ff2-a82d-7e795fd671ac,21126,"3 Way ChProduct",ZYX-21126TC-P2,"3 Way ChProduct - CISW2112602DTHCL - Service Waiting Area",e0-d5-5e-63-51-b9,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":27,"name":"FLEX Shop TV Players","description":"FLEX Shop TV Players group","numberOfPlayers":169},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CI,"ChProduct,CADILLAC,ALL,SHOP-SPECIFIC",CISW2112602DTHCL,"{"oemCode":"222923"}",Service_Lounge,"Service Waiting Area",true,true,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":"PSA337,PSA335,PSA332,PSA338,PSA326,PSA323,PS212,PS213,PSA293,PSA289","OPT_IN_SP":"SPCH1807","OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":"044077,034160,029205,022398,019888,019881,019880,019860,017596,010857","OPT_OUT_FA":"FACA1900,FACA1903,FACA1904,FACA1902,FACA1901,FACA1704,FACA1705,FACA1702,FACA1703,FACA1700,FACA1701,FACA1625,FACA1624,FACA1623,FACA1602,FACA1601,FACA1603,FACA0025,FACA0024,FACA0023,FACA1621,FACA00020,FACA00019,FACA1622,FACH1701","OPT_OUT_GT":false,"OPT_IN_TT":"TT043,TT037,TT033,TT032,TT031,TT027,TT028,TT026,TT029,TT030"}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
5705,2537,94c4e9dc-e94e-4942-862a-1e4c98276f09,21126,"3 Way ChProduct",ZYX-21126TC-P3,"3 Way ChProduct - CHMA2112603MBHCL - Service Advisor Area & Service Waiting Area",24-1c-04-0a-84-b7,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CH,"ChProduct,ALL,SHOP-SPECIFIC",CHMA2112603MBHCL,"{"oemCode":"222923"}",Service,"Service Advisor Area & Service Waiting Area",false,false,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":false,"OPT_IN_SP":false,"OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":false,"OPT_OUT_FA":false,"OPT_OUT_GT":false,"OPT_IN_TT":false}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"

基本上,将这种类型的数据(内部[](作为单个值遇到困难。该文件还具有其他类型的数据,例如嵌套引号,引号中的逗号等。

"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]"

对如何进行有关如何进行的任何帮助/建议。

谢谢

如果您的输入数据会很干净,则可以简单地解决。但是,您的JSON字符串无法正确逃脱,例如"{"OPT_IN_FT":false,"OPT_IN_PSA":"PSA337...使用双引号在CSV上下文以及JSON上下文中指示字符串。

我的建议是在JSON上下文中使用单个引号作为您的项目,并在JSON字符串项目中逃脱引号。

然后,您可以轻松地适应这样的解决方案,也可以使用Regex或自定义书面解析器代码的解决方案。

如果要将CSV上下文中的字符串定界符更改为单个报价,则可以尝试使用这些行:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Test
{
    public static void main (String[] args) throws java.lang.Exception
    {
      String input = "...";
      Pattern re1= Pattern.compile("(?<=^|,)(?:"(\[.*?\])"|"(\{.*?\})"|"([^"\{\}\[\]]*)")(?=,|$)", Pattern.DOTALL);
      Matcher matcher = re1.matcher(input);
      String singleQuoted = matcher.replaceAll("'$1$2$3'");
      Pattern re = Pattern.compile(",(?=(?:[^']*'[^']*')*[^']*$)");
      String[] parts = re.split(singleQuoted);
        for(int partsIdx = 0; partsIdx < parts.length; partsIdx++ ){
          System.out.println( "[" + partsIdx + "] = " + parts[partsIdx]);
        }
     }
}

显然有更优雅的方法可以做到这一点。

如果定界符为'|'此CSV文件可能由

解析
public List<Map<String, String>> parse(Path csvPath) throws IOException {
        Reader in = new FileReader(csvPath.toFile());
        Iterable<CSVRecord> records = CSVFormat.DEFAULT
                .withFirstRecordAsHeader()
                .withIgnoreEmptyLines(true)
                .withDelimiter('|')
                .withTrim()
                .withQuote(null)
                .parse(in);

        return StreamSupport
                        .stream(records.spliterator(), false)
                        .map(csvRecord -> csvRecord.toMap().entrySet().stream()
                                .collect(Collectors.toMap(
                                        e -> e.getKey(),
                                        e -> StringUtils.unwrap(e.getValue(), """)
                                )))
                        .collect(Collectors.toList());
} 

您需要libs:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.6</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.9</version>
</dependency>

请看一个项目https://github.com/sbzdev/stackoverflow/tree/master/master/question56028130使用单位测试

尽管已经2年以上,但我希望这个答案可以帮助来自Google的其他人。

我的解决方案是一一读取字符,并使用括号匹配的概念,这是从"堆栈"引入"。

"的引入。

当涉及a {&quot时或"&quot",推入堆栈。

涉及a}&quot时或"]&quot,从堆栈中弹出。

当涉及到",检查堆栈。如果是空的,那么您已经阅读了列字段,或者继续读取(此逗号在JSON内(。

这是Java的示例代码,我只是使用BracketCount变量而不是真实堆栈:

        File file = new File("D:/test.csv");
        FileReader fileReader = new FileReader(file);
        BufferedReader reader = new BufferedReader(fileReader);
        List<List<String>> results = reader.lines().map(str -> {
            List<String> lineResult = new ArrayList<>(256);
            char[] chars = str.toCharArray();
            int bracketCount = 0;
            StringBuilder fieldBuilder = new StringBuilder();
            for (int index = 0; index < chars.length; index++) {
                fieldBuilder.append(chars[index]);
                if (chars[index] == '{' || chars[index] == '[') {
                    bracketCount++;
                } else if (chars[index] == '}' || chars[index] == ']') {
                    bracketCount--;
                } else if (chars[index] == ',') {
                    if (bracketCount == 0) {
                        lineResult.add(fieldBuilder.substring(0, fieldBuilder.length() - 1));
                        fieldBuilder.setLength(0);
                    }
                } else if (index == chars.length - 1) {
                    lineResult.add(fieldBuilder.substring(0, fieldBuilder.length()));
                    fieldBuilder.setLength(0);
                }
            }
            return lineResult;
        }).collect(Collectors.toList());

ps:不考虑性能。

最新更新