基于Spark中的列值分开数据集



我试图基于制造商列的内容将数据集拆分为不同的数据集。这很慢
请提出一种改进代码的方法,以便它可以更快地执行并减少Java代码的使用。

List<Row> lsts= countsByAge.collectAsList();
                                
for(Row lst:lsts) {
     String man = lst.toString();
     man = man.replaceAll("[\p{Ps}\p{Pe}]", "");
     Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
     DF.show();                                      
}

代码,输入和输出数据集如下所示。

package org.sparkexample;
import org.apache.parquet.filter2.predicate.Operators.Column;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RelationalGroupedDataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
public class GroupBy {
    public static void main(String[] args) {
        System.setProperty("hadoop.home.dir", "C:\winutils");
        JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
        SQLContext sqlContext = new SQLContext(sc);
        SparkSession spark = SparkSession.builder().appName("split datasets").getOrCreate();
        sc.setLogLevel("ERROR");
                        
        Dataset<Row> src= sqlContext.read()
                    .format("com.databricks.spark.csv")
                    .option("header", "true")
                    .load("sample.csv");
                                    
                          
        Dataset<Row> unq_manf=src.select("Manufacturer").distinct();
        List<Row> lsts= unq_manf.collectAsList();
                        
        for(Row lst:lsts) {
             String man = lst.toString();
             man = man.replaceAll("[\p{Ps}\p{Pe}]", "");
             Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
             DF.show();          
        }
    }
}

输入表

+------+------------+--------------------+---+
|ItemID|Manufacturer|       Category name|UPC|
+------+------------+--------------------+---+
|   804|         ael|Brush & Broom Han...|123|
|   805|         ael|Wheel Brush Parts...|124|
|   813|         ael|      Drivers Gloves|125|
|   632|        west|       Pipe Wrenches|126|
|   804|         bil|     Masonry Brushes|127|
|   497|        west|   Power Tools Other|128|
|   496|        west|   Power Tools Other|129|
|   495|         bil|           Hole Saws|130|
|   499|         bil|    Battery Chargers|131|
|   497|        west|   Power Tools Other|132|
+------+------------+--------------------+---+

输出

+------------+
|Manufacturer|
+------------+
|         ael|
|        west|
|         bil|
+------------+
+------+------------+--------------------+---+
|ItemID|Manufacturer|       Category name|UPC|
+------+------------+--------------------+---+
|   804|         ael|Brush & Broom Han...|123|
|   805|         ael|Wheel Brush Parts...|124|
|   813|         ael|      Drivers Gloves|125|
+------+------------+--------------------+---+
+------+------------+-----------------+---+
|ItemID|Manufacturer|    Category name|UPC|
+------+------------+-----------------+---+
|   632|        west|    Pipe Wrenches|126|
|   497|        west|Power Tools Other|128|
|   496|        west|Power Tools Other|129|
|   497|        west|Power Tools Other|132|
+------+------------+-----------------+---+
+------+------------+----------------+---+
|ItemID|Manufacturer|   Category name|UPC|
+------+------------+----------------+---+
|   804|         bil| Masonry Brushes|127|
|   495|         bil|       Hole Saws|130|
|   499|         bil|Battery Chargers|131|
+------+------------+----------------+---+

在这种情况下,您有两个选择:

  1. 首先,您必须收集独特的制造商价值,然后映射超过结果数组:

    val df = Seq(("HP", 1), ("Brother", 2), ("Canon", 3), ("HP", 5)).toDF("k", "v")    
    val brands = df.select("k").distinct.collect.flatMap(_.toSeq)
    val BrandArray = brands.map(brand => df.where($"k" <=> brand))
    BrandArray.foreach { x =>
    x.show()
    println("---------------------------------------")
    }
    
  2. 您还可以根据制造商保存数据框。

    df.write.partitionBy("hour").saveAsTable("parquet")

而不是制造商分配数据集/数据框,如果您需要经常根据制造商进行查询,则使用制造商作为分区密钥编写数据框可能是最佳的。

INCASE您仍然需要基于使用Pyspark和Spark 2.0 的方法之一的列值之一 -

from pyspark.sql import functions as F
df = spark.read.csv("sample.csv",header=True)
# collect list of manufacturers
manufacturers = df.select('manufacturer').distinct().collect()
# loop through manufacturers to filter df by manufacturers and write it separately 
for m in manufacturers:
    df1 = df.where(F.col('manufacturers')==m[0])
    df1[.repartition(repartition_col)].write.parquet(<write_path>,[write_mode])

相关内容

  • 没有找到相关文章

最新更新