我正在看一些时区为GMT-4的拼木地板
def get_spark():
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
spark.conf.set("spark.sql.session.timeZone", "GMT-4")
return spark
文件显示
base_so.where(base_so.ID_NUM_CLIENTE == 2273).show()
+--------------+-----------+----------------+------------------+-------------------+-------------------+-------------------+
|ID_NUM_CLIENTE|NUM_TRAMITE|COD_TIPO_1 |COD_TIPO_2 | FECHA_TRAMITE| FECHA_INGRESO| FECHA_INICIO_PAGO|
+--------------+-----------+----------------+------------------+-------------------+-------------------+-------------------+
| 2273| 238171| X| NN |2005-10-25 00:00:00|2005-10-25 09:26:54|1995-05-03 00:00:00|
| 2273| 238171| X| NMP|2005-10-25 00:00:00|2005-10-25 09:26:54|1995-05-03 00:00:00|
+--------------+-----------+----------------+------------------+-------------------+-------------------+-------------------+
当我从测试中创建一个数据帧时,它没有给我留下列的日期
spark = get_spark()
df_busqueda = spark.createDataFrame(
data=[
[Decimal(2273), Decimal(238171), "SO", datetime.strptime('2005-10-25 00:00:00', '%Y-%m-%d %H:%M:%S')],
],
schema=StructType(
[
StructField('ID_NUM_CLIENTE', DecimalType(), True),
StructField('NUM_TRAMITE', DecimalType(), True),
StructField('COD_TIPO_1', StringType(), True),
StructField('FECHA_TRAMITE', TimestampType(), True),
]
),
)
+--------------+-----------+----------------+-------------------+
|ID_NUM_CLIENTE|NUM_TRAMITE|COD_TIPO_1 | FECHA_TRAMITE|
+--------------+-----------+----------------+-------------------+
| 2273| 238171| SO|2005-10-24 23:00:00|
+--------------+-----------+----------------+-------------------+
我如何更好地配置,使parquet和创建的数据帧保持相同的时区?
可以设置timezone
Example:
For spark > 3:
spark.sql("SET TIME ZONE 'America/New_York'").show()
//+--------------------------+----------------+
//|key |value |
//+--------------------------+----------------+
//|spark.sql.session.timeZone|America/New_York|
//+--------------------------+----------------+
spark.sql("select current_timestamp()").show()
//+--------------------------+
//|current_timestamp() |
//+--------------------------+
//|2021-08-25 16:23:16.096459|
//+--------------------------+
For spark < 3.0:
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql("select current_timestamp()").show()
//+--------------------+
//| current_timestamp()|
//+--------------------+
//|2021-08-25 20:26:...|
//+--------------------+
#导入包
import os, timeFrom dateutil import tz
使用以下代码片段格式化时间戳
操作系统。environ['TZ'] = 'GMT+4'
time.tzset ()时间。strftime('%X %X %Z')
在我的例子中,文件是通过NIFI上传的,我必须将引导程序修改为相同的时区
spark-defaults.conf
spark.driver.extraJavaOptions -Duser.timezone=America/Santiago
spark.executor.extraJavaOptions -Duser.timezone=America/Santiago