考虑一个示例数据帧:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 70|wa |
| 50|mn |
| 20|fl |
| 50|mo |
| 10|ar |
| 90|wi |
| 30|al |
| 50|ca |
+-------+-----+
我想更改"tech"列,以便将 50 的任何值更改为 1,所有其他值都等于 0。
输出如下所示:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 0 |wa |
| 1 |mn |
| 0 |fl |
| 1 |mo |
| 0 |ar |
| 0 |wi |
| 0 |al |
| 1 |ca |
+-------+-----+
这是我到目前为止所拥有的:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
changing_column = 'tech'
udf_first = UserDefinedFunction(lambda x: 1, IntegerType())
udf_second = UserDefinedFunction(lambda x: 0, IntegerType())
first_df = zero_df.select(*[udf_first(changing_column) if column == 50 else column for column in zero_df])
second_df = first_df.select(*[udf_second(changing_column) if column != 50 else column for column in first_df])
second_df.show()
希望这有帮助
from pyspark.sql.functions import when
df = spark
.createDataFrame([
(70, 'wa'),
(50, 'mn'),
(20, 'fl')],
["tech", "state"])
df
.select("*", when(df.tech == 50, 1)
.otherwise(0)
.alias("tech"))
.show()
+----+-----+----+
|tech|state|tech|
+----+-----+----+
| 70| wa| 0|
| 50| mn| 1|
| 20| fl| 0|
+----+-----+----+