在pyspark中使用format将数组传递到SQL查询中



我想通过将concepts的值作为参数值传递给UDFhas_any_concept来执行以下查询。

环境

concepts
['CREATININE_QUANTITATIVE_24_HOUR_DIALYSIS_FLUID_OBSTYPE',
'CREATININE_QUANTITATIVE_24_HOUR_URINE_OBSTYPE',
'CREATININE_QUANTITATIVE_SERUM_OBSTYPE']

这是不传递参数的查询。

(spark.sql("""
select 

resultCode.standard.primaryDisplay                                           as display

from results 
WHERE has_any_concept(resultCode, array("CREATININE_QUANTITATIVE_24_HOUR_DIALYSIS_FLUID_OBSTYPE","CREATININE_QUANTITATIVE_24_HOUR_URINE_OBSTYPE","CREATININE_QUANTITATIVE_SERUM_OBSTYPE"))

LIMIT 3
""".format(concepts = concepts))
.toPandas()
)
display
0   Creatinine [Mass/volume] in Serum or Plasma
1   Creatinine [Mass/volume] in Serum or Plasma
2   Creatinine [Mass/volume] in Serum or Plasma

(spark.sql("""
select 

resultCode.standard.primaryDisplay                                           as display,
ontologicalCategoryAliases                                                   as category

from results 
WHERE has_any_concept(resultCode, array("{concepts[0]}","{concepts[1]}","{concepts[2]}"))

LIMIT 3
""".format(concepts = concepts))
.toPandas()
)
display     category
0   Creatinine [Mass/volume] in Serum or Plasma     [LABS_OBSTYPE]
1   Creatinine [Mass/volume] in Serum or Plasma     [LABS_OBSTYPE]
2   Creatinine [Mass/volume] in Serum or Plasma     [LABS_OBSTYPE]

这行不通

(spark.sql("""
select 

resultCode.standard.primaryDisplay                                           as display,
ontologicalCategoryAliases                                                   as category

from results 
WHERE has_any_concept(resultCode, array({concepts}))

LIMIT 3
""".format(concepts = [''' "{concept}"   '''.format(concept = concept) for concept in concepts]))
.toPandas()
)
ParseException: 'nmismatched input 'from' expecting <EOF>(line 7, pos 3)nn== SQL ==nnselect n   n   resultCode.standard.primaryDisplay                                           as display,n   ontologicalCategoryAliases                                                   as categoryn   n   from results n---^^^n   WHERE has_any_concept(resultCode, array([' "CREATININE_QUANTITATIVE_24_HOUR_DIALYSIS_FLUID_OBSTYPE"   ', ' "CREATININE_QUANTITATIVE_24_HOUR_URINE_OBSTYPE"   ', ' "CREATININE_QUANTITATIVE_SERUM_OBSTYPE"   ']))n   AND normalizedValue.typedValue.type = "NUMERIC" n   AND interpretation.standard.primaryDisplay NOT IN ('Not applicable', 'Normal')n   n   LIMIT 10n'

我没有写UDFhas_any_concepts

如果使用python 3.6+,使用f-strings可以使代码看起来更简洁。

在SQL语法中不能直接将列表传递给数组函数。

spark.sql(
f"""
select 
resultCode.standard.primaryDisplay as display,
ontologicalCategoryAliases as category
from results 
WHERE has_any_concept(resultCode, array({", ".join([f"'{x}'" for x in concepts])}))
LIMIT 3
"""
).toPandas()

最新更新