Spark 如何根据时间段或 2 个给定日期或两年之间的行从数据集中检索行



我最近开始使用Spark。我正在火花壳上练习。

我有一个数据集"movies.dat",格式如下:

电影ID,标题,流派

样本记录 :-

2,Jumanji (1995),Adventure|Children|Fantasy

我想生成 1985 年至 1995 年间发行的"恐怖"电影列表。

这是我的方法。

scala> val movies_data = sc.textFile("file:///home/cloudera/cs/movies.dat")
scala> val tags=movies_data.map(line=>line.split(","))
scala> tags.take(5)
res3: Array[Array[String]] = Array(Array(1, Toy Story (1995), Adventure|Animation|Children|Comedy|Fantasy), Array(2, Jumanji (1995), Adventure|Children|Fantasy), Array(3, Grumpier Old Men (1995), Comedy|Romance), Array(4, Waiting to Exhale (1995), Comedy|Drama|Romance), Array(5, Father of the Bride Part II (1995), Comedy))
scala> val horrorMovies = tags.filter(genre=>genre.contains("Horror"))
scala> horrorMovies.take(5)
res4: Array[Array[String]] = Array(Array(177, Lord of Illusions (1995), Horror), Array(220, Castle Freak (1995), Horror), Array(841, Eyes Without a Face (Les Yeux sans visage) (1959), Horror), Array(1105, Children of the Corn IV: The Gathering (1996), Horror), Array(1322, Amityville 1992: It's About Time (1992), Horror))

我只想使用 Spark Shell 检索数据。我能够检索"恐怖"类型的所有电影。 现在,有没有办法过滤掉这些电影,只得到那些发行年份在 1985 年至 1995 年之间的电影?

您可以编写逻辑以从分割线(数组(的第二个元素中提取年份,并与以下范围进行比较

scala> val movies_data = sc.textFile("file:///home/cloudera/cs/movies.dat")
movies_data: org.apache.spark.rdd.RDD[String] = file:///home/cloudera/cs/movies.dat MapPartitionsRDD[5] at textFile at <console>:25
scala> val tags=movies_data.map(line=>line.split(","))
tags: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[6] at map at <console>:27
scala> val horrorMovies = tags.filter(genre => {
| val date = genre(1).substring(genre(1).lastIndexOf("(")+1, genre(1).lastIndexOf(")")).toInt
| date >= 1985 && date <= 1995 && genre(2).contains("Horror")
| })
horrorMovies: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at filter at <console>:29
scala> horrorMovies.take(3)
res1: Array[Array[String]] = Array(Array(177, " Lord of Illusions (1995)", " Horror"), Array(220, " Castle Freak (1995)", " Horror"), Array(1322, " Amityville 1992: It's About Time (1992)", " Horror"))

我希望答案对您有所帮助

编辑

您也可以使用regex执行上述逻辑

scala> val horrorMovies = tags.filter(genre => {
| val str = """(d+)""".r findAllIn genre(1) mkString
| val date = if(str.length == 4) str.toInt else 0
| date >= 1985 && date <= 1995 && genre(2).contains("Horror")
| })
warning: there was one feature warning; re-run with -feature for details
horrorMovies: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at filter at <console>:33

其余代码与上述相同。

我希望答案对您有所帮助

最新更新