如何基于索引组合生成数据集



我希望计算作为整数列表的索引组合的结果的数据集:例如,如果我有以下整数列表[0,1,2,3]和初始数据集:

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  0|     A        |     1000| 
|  1|     B        |     1000|
|  2|     C        |     2000|
|  3|     D        |     3000|
+---+--------------+---------+

然后CCD_ 2给出的结果索引组合是:

[0, 1]
[0, 2]
[0, 3]
[1, 2]
[1, 3]
[2, 3]

我想要以下相应的数据集:

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  0|     A        |     1000| 
|  1|     B        |     1000|

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  0|     A        |     1000|
|  2|     C        |     2000|
+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  0|     A        |     1000|
|  3|     D        |     3000|

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  1|     B        |     1000|
|  2|     C        |     2000|
+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  1|     B        |     1000|
|  3|     D        |     3000|
+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  2|     C        |     2000|
|  3|     D        |     3000|

目前,我正在一个节点上使用JAVA中生成组合的经典方法,通过以下代码:

private void helper(List<int[]> combinations, int data[], int start, int end, int index) {
if (index == data.length) {
int[] combination = data.clone();
combinations.add(combination);
} else if (start <= end) {
data[index] = start;
helper(combinations, data, start + 1, end, index + 1);
helper(combinations, data, start + 1, end, index);
}
}
public List<int[]> generate(int n, int r) {
List<int[]> combinations = new ArrayList<>();
helper(combinations, new int[r], 0, n-1, 0);
return combinations;
}

List<int[]> combinations = generate(numberOfRows, k);
for (int[] combination : combinations) {
ArrayList<Row> datasetRows = new ArrayList<Row>();
List<Row> rows = initialDataset.collectAsList();
for (int index : combination) {
datasetRows.add(rows.get(index));
}
Dataset<Row> datasetOfSRows = sparksession.createDataFrame(datasetRows, schema);
datasetOfRows.add(datasetOfSRows);
}

但我想要一个解决这个问题的本地Spark解决方案,它将使用许多节点来计算结果数据集(例如通过map()(如何使用JAVA/Scala实现这一点?

您可能需要了解Spark SQL的isin。此链接解释如何使用https://sparkbyexamples.com/spark/spark-isin-is-not-in-operator-example/.

在研究了你的代码之后,我试着得出一些代码如下。希望它能帮助你。

datasetOfRows.filter(col("id").isin(combination.toArray())

最新更新