在Spark中基于空行分割文本文件

我正在处理一个非常大的文件，这是一个非常大的文本文档，大约2gb。

像这样-

#*MOSFET table look-up models for circuit simulation
#t1984
#cIntegration, the VLSI Journal
#index1
#*The verification of the protection mechanisms of high-level language machines
#@Virgil D. Gligor
#t1984
#cInternational Journal of Parallel Programming
#index2
#*Another view of functional and multivalued dependencies in the relational database model
#@M. Gyssens, J. Paredaens
#t1984
#cInternational Journal of Parallel Programming
#index3
#*Entity-relationship diagrams which are in BCNF
#@Sushil Jajodia, Peter A. Ng, Frederick N. Springsteel
#t1984
#cInternational Journal of Parallel Programming
#index4

我想在spark中读取它们，并根据spark中的空块对它们进行拆分，并在PySpark中创建这些数据块。

#*Entity-relationship diagrams which are in BCNF #@Sushil Jajodia, Peter A. Ng, Frederick N. Springsteel #t1984 #cInternational Journal of Parallel Programming #index4

我现在写的代码是rdd = sc.textFile('acm.txt').flatMap( lambda x : x.split("nn") )

据我所知，你想在spark中阅读这个文本文件，每段有一个记录。为此，您可以像这样更改记录分隔符(默认为n):

scala中的:

sc.hadoopConfiguration.set("textinputformat.record.delimiter","nn")
val rdd = sc.textFile("acm.txt")

在python中(你需要访问java spark上下文来访问hadoop配置):

sc._jsc.hadoopConfiguration().set("textinputformat.record.delimiter","nn")
rdd = sc.textFile("acm.txt")

相关内容

最新更新

热门标签：