我需要解析一个逗号分隔的文本文件,其中一行可以包含json作为列值。该文件还包含一个标题行。
我正在尝试构建正则表达式,以便我可以解析文件并为文件的每行列表创建一个hashmap。
我意识到,这可能会为此变得过于复杂,但是我不知道我还能设计代码以获取每一行的hashmap。
我尝试了不同的方式 - 从SO和其他站点上的示例中,但无法获得良好的工作表达。他们在不同的零件上工作,但整个数据
不使用 String[] values = readLine.split(",(?=([^"]*"[^"]*")*(?![^"]*"))", -1);
String[] values = readLine.split(",(?=(?:(?:[^'",]*(?:'|")){2})*[^'",]*)", -1);
String[] values = readLine.split(",("([^"]|"")*")", -1);
String[] values = readLine.split(",(\w+\s)?("[^"]+"|\w+)(\(\w\d(,\w\d)*\))?", -1);
String[] values = readLine.split(",(?=(?:(?:[^'",]*(?:'|")){2})*[^'",]*$)", -1);
以下是数据示例
id,apptoolID,apptoolUUID,accountNumber,accountName,name,description,mac,status,pGroups,oemCode,oemTagList,locationID,userCode,businessUnit,customerDescription,notification,ptdd,hdptdd,ptddSchedule,compLive,optInfo,config1,scriptDebugging,clearLocalStorage,created_at,updated_at
5703,2535,9e849f81-56c4-4415-b36e-b0fd370b3986,21126,"3 Way ChProduct",ZYX-21126TC-P1,"3 Way ChProduct - CHSH2112601DSHCL - Showroom",24-1c-04-0a-84-9c,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CH,"ChProduct,ALL,SHOP-SPECIFIC",CHSH2112601DSHCL,"{"oemCode":"222923"}",Showroom,Showroom,false,false,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":false,"OPT_IN_SP":false,"OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":false,"OPT_OUT_FA":false,"OPT_OUT_GT":false,"OPT_IN_TT":false}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
5704,4248,494af61a-54ca-4ff2-a82d-7e795fd671ac,21126,"3 Way ChProduct",ZYX-21126TC-P2,"3 Way ChProduct - CISW2112602DTHCL - Service Waiting Area",e0-d5-5e-63-51-b9,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":27,"name":"FLEX Shop TV Players","description":"FLEX Shop TV Players group","numberOfPlayers":169},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CI,"ChProduct,CADILLAC,ALL,SHOP-SPECIFIC",CISW2112602DTHCL,"{"oemCode":"222923"}",Service_Lounge,"Service Waiting Area",true,true,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":"PSA337,PSA335,PSA332,PSA338,PSA326,PSA323,PS212,PS213,PSA293,PSA289","OPT_IN_SP":"SPCH1807","OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":"044077,034160,029205,022398,019888,019881,019880,019860,017596,010857","OPT_OUT_FA":"FACA1900,FACA1903,FACA1904,FACA1902,FACA1901,FACA1704,FACA1705,FACA1702,FACA1703,FACA1700,FACA1701,FACA1625,FACA1624,FACA1623,FACA1602,FACA1601,FACA1603,FACA0025,FACA0024,FACA0023,FACA1621,FACA00020,FACA00019,FACA1622,FACH1701","OPT_OUT_GT":false,"OPT_IN_TT":"TT043,TT037,TT033,TT032,TT031,TT027,TT028,TT026,TT029,TT030"}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
5705,2537,94c4e9dc-e94e-4942-862a-1e4c98276f09,21126,"3 Way ChProduct",ZYX-21126TC-P3,"3 Way ChProduct - CHMA2112603MBHCL - Service Advisor Area & Service Waiting Area",24-1c-04-0a-84-b7,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CH,"ChProduct,ALL,SHOP-SPECIFIC",CHMA2112603MBHCL,"{"oemCode":"222923"}",Service,"Service Advisor Area & Service Waiting Area",false,false,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":false,"OPT_IN_SP":false,"OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":false,"OPT_OUT_FA":false,"OPT_OUT_GT":false,"OPT_IN_TT":false}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
基本上,将这种类型的数据(内部[](作为单个值遇到困难。该文件还具有其他类型的数据,例如嵌套引号,引号中的逗号等。
"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]"
对如何进行有关如何进行的任何帮助/建议。
谢谢
如果您的输入数据会很干净,则可以简单地解决。但是,您的JSON字符串无法正确逃脱,例如"{"OPT_IN_FT":false,"OPT_IN_PSA":"PSA337...
使用双引号在CSV上下文以及JSON上下文中指示字符串。
我的建议是在JSON上下文中使用单个引号作为您的项目,并在JSON字符串项目中逃脱引号。
然后,您可以轻松地适应这样的解决方案,也可以使用Regex或自定义书面解析器代码的解决方案。
如果要将CSV上下文中的字符串定界符更改为单个报价,则可以尝试使用这些行:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Test
{
public static void main (String[] args) throws java.lang.Exception
{
String input = "...";
Pattern re1= Pattern.compile("(?<=^|,)(?:"(\[.*?\])"|"(\{.*?\})"|"([^"\{\}\[\]]*)")(?=,|$)", Pattern.DOTALL);
Matcher matcher = re1.matcher(input);
String singleQuoted = matcher.replaceAll("'$1$2$3'");
Pattern re = Pattern.compile(",(?=(?:[^']*'[^']*')*[^']*$)");
String[] parts = re.split(singleQuoted);
for(int partsIdx = 0; partsIdx < parts.length; partsIdx++ ){
System.out.println( "[" + partsIdx + "] = " + parts[partsIdx]);
}
}
}
显然有更优雅的方法可以做到这一点。
如果定界符为'|'此CSV文件可能由
解析public List<Map<String, String>> parse(Path csvPath) throws IOException {
Reader in = new FileReader(csvPath.toFile());
Iterable<CSVRecord> records = CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withIgnoreEmptyLines(true)
.withDelimiter('|')
.withTrim()
.withQuote(null)
.parse(in);
return StreamSupport
.stream(records.spliterator(), false)
.map(csvRecord -> csvRecord.toMap().entrySet().stream()
.collect(Collectors.toMap(
e -> e.getKey(),
e -> StringUtils.unwrap(e.getValue(), """)
)))
.collect(Collectors.toList());
}
您需要libs:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.9</version>
</dependency>
请看一个项目https://github.com/sbzdev/stackoverflow/tree/master/master/question56028130使用单位测试
尽管已经2年以上,但我希望这个答案可以帮助来自Google的其他人。
我的解决方案是一一读取字符,并使用括号匹配的概念,这是从"堆栈"引入"。
"的引入。当涉及a {&quot时或"&quot",推入堆栈。
涉及a}&quot时或"]&quot,从堆栈中弹出。
当涉及到",检查堆栈。如果是空的,那么您已经阅读了列字段,或者继续读取(此逗号在JSON内(。
这是Java的示例代码,我只是使用BracketCount变量而不是真实堆栈:
File file = new File("D:/test.csv");
FileReader fileReader = new FileReader(file);
BufferedReader reader = new BufferedReader(fileReader);
List<List<String>> results = reader.lines().map(str -> {
List<String> lineResult = new ArrayList<>(256);
char[] chars = str.toCharArray();
int bracketCount = 0;
StringBuilder fieldBuilder = new StringBuilder();
for (int index = 0; index < chars.length; index++) {
fieldBuilder.append(chars[index]);
if (chars[index] == '{' || chars[index] == '[') {
bracketCount++;
} else if (chars[index] == '}' || chars[index] == ']') {
bracketCount--;
} else if (chars[index] == ',') {
if (bracketCount == 0) {
lineResult.add(fieldBuilder.substring(0, fieldBuilder.length() - 1));
fieldBuilder.setLength(0);
}
} else if (index == chars.length - 1) {
lineResult.add(fieldBuilder.substring(0, fieldBuilder.length()));
fieldBuilder.setLength(0);
}
}
return lineResult;
}).collect(Collectors.toList());
ps:不考虑性能。