如何使用Apache Arrow的Java Libs编写结构向量列表?



我已经挣扎了几天,试图使用Apache Arrow编写一个结构向量列表。

我基本上是在尝试构建以下内容:

[[{"key1", "value1"},{"key1", "value1"},...],[{"key1", "value1"}, {"key1", "value1"}...]...]

我已经尝试了许多变体,但下面是我认为应该工作的一个版本,用于结构向量的列表,其中每个结构向量都包含几个varchar&dateday字段,以及一个int字段:

ListVector listVector = (ListVector) root.getVector("units");
listVector.allocateNew();
UnionListWriter listWriter = listVector.getWriter();
for (int i = 0; i < allUnits.size(); i++) {
listWriter.setPosition(i);
listWriter.startList();
BaseWriter.StructWriter structWriter = listWriter.struct("unit");
StructVector structVector = 
(StructVector) structWriter.getField()
.createVector(allocator);
structVector.allocateNew();
// using this alternative below, I can see the StructVector filling up, but still nothing in the ListVector
// StructVector structVector = 
//       (StructVector)listVector.getChildrenFromFields().get(0);
// structVector.allocateNew();
// BaseWriter.StructWriter structWriter = structVector.getWriter();
ArrayNode units = allUnits.get(i);
// "accn", "form", "fp", "fy", "type" -> field names of 'varchar' type
for (int x = 0; x < units.size(); x++) {
structWriter.start();
structWriter.setPosition(x);
JsonNode unitNode = units.get(x);
for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
bytes;
String varCharVal = unitNode.get(varCharFieldName).asText();
byte[] bytes = varCharVal.getBytes();
try(ArrowBuf tempBuf = allocator.buffer(bytes.length)) {
tempBuf.setBytes(0, bytes, 0, bytes.length);
structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);
}
}
// "end", "filed" -> field names of 'dateday' type
for (String dateFieldName : UNIT_DATE_FIELDS) {
LocalDate date = 
LocalDate.parse(unitNode.get(dateFieldName).asText(), ISO_LOCAL_DATE);
structWriter.dateDay(dateFieldName)
.writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
}
structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
structVector.setIndexDefined(x);
structWriter.end();
}
structVector.setValueCount(units.size()); 
listWriter.endList();
}
listVector.setValueCount(allUnits.size());

我可以看到structVector正在填充给定"unit"结构向量的数据,但写入不会传播到"unit"的结构向量列表,即"units"列表字段本身。

下面是谷歌Colab笔记本的要点,它或多或少会运行这个例子。最好将该代码放入您选择的IDE中运行,并在该笔记本中指定maven依赖项。

https://gist.github.com/gmsharpe/52aee837db9ebcacaf87a7ac07667bac

我从listWriter循环中删除了structVectorstructWriter.close(),因此它现在填充units数组。也许你可以从这里继续:

ListVector listVector = (ListVector) root.getVector("units"); 
UnionListWriter listWriter = listVector.getWriter();
listWriter.allocate();
listVector.allocateNew();
List<ArrayNode> allUnits = nodes.stream()
.map(n -> (ArrayNode)(n.get("units").get("USD")))
.collect(Collectors.toList());
for (int i = 0; i < allUnits.size(); i++) {
listWriter.setPosition(i);
listWriter.startList();
BaseWriter.StructWriter structWriter = listWriter.struct();
ArrayNode units = allUnits.get(i);
// "accn", "form", "fp", "fy", "type"
for (int x = 0; x < units.size(); x++) {
structWriter.start();
JsonNode unitNode = units.get(x);
for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
byte[] bytes;
if (varCharFieldName.equals("type")) {
bytes = "USD".getBytes();
} else {
String varCharVal = unitNode.get(varCharFieldName).asText();
bytes = varCharVal.getBytes();
}
ArrowBuf tempBuf = allocator.buffer(bytes.length);
tempBuf.setBytes(0, bytes);
structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);
}
// "end", "filed"
for (String dateFieldName : UNIT_DATE_FIELDS) {
LocalDate date = LocalDate.parse(unitNode.get(dateFieldName).asText(),
DateTimeFormatter.ISO_LOCAL_DATE);
structWriter.dateDay(dateFieldName).writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
}
structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
structWriter.end();
}
listWriter.setValueCount(units.size());
listWriter.endList();
}
listVector.setValueCount(allUnits.size());
root.contentToTSVString();

最新更新