在频率为其他列的行中填充空值



在spark结构化流上下文中,我有这样的数据框架:

+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1   |1632899456|4        |
|BR1   |1632901256|4        |
|BR300 |1632901796|null     |
|BR300 |1632899155|null     |
|BR90  |1632901743|1        |
|BR1   |1632899933|4        |
|BR1   |1632899756|4        |
|BR22  |1632900776|null     |
|BR22  |1632900176|null     |
+------+----------+---------+

我想用品牌的频率来替换空值在批处理中,为了获得像这样的数据帧:

+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1   |1632899456|4        |
|BR1   |1632901256|4        |
|BR300 |1632901796|2        | 
|BR300 |1632899155|2        |
|BR90  |1632901743|1        |
|BR1   |1632899933|4        |
|BR1   |1632899756|4        |
|BR22  |1632900776|2        |
|BR22  |1632900176|2        |
+------+----------+---------+

我使用Spark 2.4.3版本和SQLContext,使用scala语言。

With "count"Over窗口函数:

val df = Seq(
("BR1", 1632899456, Some(4)),
("BR1", 1632901256, Some(4)),
("BR300", 1632901796, None),
("BR300", 1632899155, None),
("BR90", 1632901743, Some(1)),
("BR1", 1632899933, Some(4)),
("BR1", 1632899756, Some(4)),
("BR22", 1632900776, None),
("BR22", 1632900176, None)
).toDF("brand", "Timestamp", "frequency")
val brandWindow = Window.partitionBy("brand")
val result = df.withColumn("frequency", when($"frequency".isNotNull, $"frequency").otherwise(count($"brand").over(brandWindow)))

结果:

+-----+----------+---------+
|BR1  |1632899456|4        |
|BR1  |1632901256|4        |
|BR1  |1632899933|4        |
|BR1  |1632899756|4        |
|BR22 |1632900776|2        |
|BR22 |1632900176|2        |
|BR300|1632901796|2        |
|BR300|1632899155|2        |
|BR90 |1632901743|1        |
+-----+----------+---------+

GroupBy的解:

val countDF = df.select("brand").groupBy("brand").count()

df.alias("df")
.join(countDF.alias("cnt"), Seq("brand"))
.withColumn("frequency", when($"df.frequency".isNotNull, $"df.frequency").otherwise($"cnt.count"))
.select("df.brand", "df.Timestamp", "frequency")

我是java程序员。最好是循环遍历frequency列并搜索第一个null及其相关品牌。数这个的个数,直到表的最后,纠正那个品牌的空值,再去纠正另一个品牌的空值。这是我的java解决方案:(我没有测试这个代码只是写它的文本编辑器,但我希望工作得很好,70%;)

//this is your table  +  dimensions
table[9][3];    
int repeatCounter = 0;
String brand;
boolean thereIsNull = true;
//define an array to save the address of the specified null brand
int[tablecolumns.length()] brandmemory; 
while (thereisnull) {
for (int i = 0; i < tablecolumns.length(); i++) {

if (array[i][3] == null) {
thereIsNull = true;
brand = array[i][1];
for (int n = i; n < tablecolumns.length(); i++) {
if (brand == array[i][1]) {
repeatCounter++;
// making an array to save address of  the null brand in table:
brandmemory[repeatCounter] = i;
else{
break ;
}
}
for (int p = 1; p = repeatCounter ; p++) {
//changing null values to number of repeats 
array[brandmemory[p]][3] = repeatCounter;
}
}
}
else{
continue;
//check if the table has any null content if no :end of program. 
for(int w>i ; w=tablecolumns.length();w++ ){
if(array[w] != null  ){
thereIsNull = false;
else{ thereIsNull = true;
break;

}
}
}
}
}

最新更新