在Spark 3.0.2
中,我正在拼花地板文件中编写Dataset
。我写的代码以这种方式结束:
etablissements = etablissements.repartition(col("codeDepartement"));
etablissements = etablissements.sortWithinPartitions(col("siret"));
etablissements = etablissements.persist();
// Write it in a file named with the year of data, selections, and sorting in it's name.
// Underlying statement writing the parquet file is :
// ds.write().partitionBy(colonnesPartionnement /* = codeDepartement */)
saveToStore(etablissements, new String[] {"codeDepartement"},
"{0}_{1,number,#0}_{2}_{3}", "etablissements", anneeSIRENE, actifsSeulement,
communesValides);
codeDepartment
有一个StringType
,因为法国的部门代码是三个字符的代码。
# schema() :
|-- codeDepartement: string (nullable = true)
它在show()
输出的最后三分之一处可见(城市名称大写前三列(,值为:"01"
:
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|codeDepartement|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |01 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |01 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
我看到我的镶木地板文件下的文件夹很好:
codeDepartement=01
codeDepartement=2A
codeDepartement=75
codeDepartement=971
注意:由于某些值,如2A
(用于Corse(,部门代码永远无法转换为数值。
snappy.parquet
块分别存储在/data/tmp/etablissements_2020_true_true/codeDepartement=01
文件夹中,这样:没关系。
在阅读时,我会尝试从该商店中阅读内容。搜索以"01"
开头的城市代码(在法国以部门代码开头(的城市:读取到期拼花文件和区块:
2021-03-24 07:14:33.825 INFO 13860 --- [er for task 106] o.a.s.s.e.datasources.FileScanRDD : Reading File path: file:/data/tmp/etablissements_2020_true_true/codeDepartement=01/part-00024-f7d33eea-6d79-4f1a-bf35-0666dcc5e0f5.c000.snappy.parquet, range: 0-5246504, partition values: [1]
当显示部门时(即现在位于数据集show()
命令的末尾(,它现在的值为"1"
,而不是"01"
:
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |codeDepartement|
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |1 |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
即使它仍然由镶木地板文件声明为StringType
:
|-- codeDepartement: string (nullable = true)
发生了什么事?
我倾向于让repartition()
语句成为造成这场混乱的原因,但我不知道怎么做。如果该命令很棘手,并且分区无法按字符串值进行分区,那么程序如何按字母中的红色、蓝色和黄色颜色对数据进行分区?
我不理解我所面临的整体行为(问题?(。
我能够重现这个问题。
spark.sql("select '01' key, 123 val union all select 'ab', 456").show()
+---+---+
|key|val|
+---+---+
| 01|123|
| ab|456|
+---+---+
spark.sql("select '01' key, 123 val union all select 'ab', 456").write().partitionBy("key").parquet("test")
spark.read().parquet("test").show()
+---+---+
|val|key|
+---+---+
|456| ab|
|123| 1|
+---+---+
为了解决这个问题,您可以在阅读时提供一个模式:
spark.read().schema(spark.read().parquet("test").schema).parquet("test").show()
+---+---+
|val|key|
+---+---+
|456| ab|
|123| 01|
+---+---+
(在Pyspark中测试,希望能在Java中工作(
您可以禁用选项spark.sql.sources.partitionColumnTypeInference.enabled
。
从文档分区发现:
[…]有时用户可能不想自动推断分区列的数据类型。对于在这些用例中,可以通过
spark.sql.sources.partitionColumnTypeInference.enabled
,即默认为true。禁用类型推断时,字符串类型将为用于分区列。
设置选项:
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")