从Java的InputStream的字符串中创建Spark RDD或DataFrame

我在Java中有一系列字符串。这来自其他机器上的CSV文件。如下所示，我正在从Java中的BufferedReader创建一个InputStream并按行读取CSV文件行。

        //call a method that returns inputStream 

        InputStream stream = getInputStreamOfFile();
        BufferedReader lineStream = new BufferedReader(new InputStreamReader(stream));
        while ((inputLine = lineStream.readLine()) != null) {
            System.out.println("******************new Line***********");
            System.out.println(inputLine);
        }
        lineStream.close();
        stream.close();

现在，我想从中创建一个Spark RDD或DataFrame。

一个解决方案是，我一直在每行中创建新的RDD并维护Globle RDD并继续进行RDD。还有其他解决方案吗？

注意：此文件不在同一台计算机上。它来自一些远程存储。我确实有文件的HTTP URL。

如果输入流的内容适合内存，我们可以使用以下内容：

private static List<String> displayTextInputStream(InputStream input) throws IOException {
    // Read the text input stream one line at a time and display each line.
    BufferedReader reader = new BufferedReader(new InputStreamReader(input));
    String line = null;
    List<String> result = new ArrayList<String>();
    while ((line = reader.readLine()) != null) {
        result.add(line);
    }
    return result;
}

现在我们可以将List<String>转换为相应的RDD。

S3Object fullObject = s3Client.getObject(new GetObjectRequest("bigdataanalytics", each.getKey()));
                            List<String> listVals = displayTextInputStream(fullObject.getObjectContent());
                            JavaRDD<String> s3Rdd = sc.parallelize(listVals);

相关内容

最新更新

热门标签：