分区以删除Pyspark中的特殊角色

我有一个带有3列（col1 string，col2 int，col3 string）的dataframe（df），但有数百万记录：

Test's  123   abcdefgh
Tes#t   456   mnopqrst  
Test's  789   hdskfdss

当我尝试使用pyspark用以下语句编写数据时，丢失了Col1中的特殊字符，并在HDFS中创建目录时被ASCII字符替换。在将此数据框写给HDFS时，有什么方法可以保留并包含在目录路径中？

df.write.partitionBy("col1","col2").text(hdfs_path)

请告诉我，如果我感到困惑并且需要更多细节。我正在使用Spark 1.6.1

不建议在文件路径中具有特殊字符。Hadoop Shell中的Uripath不支持一些特殊字符，建议仅使用Javauri中提到的字符：http://docs.oracle.com/javase/7/docs/api/java/net/uri.html

通过用 %27替换为UTF-8字符集中该字符的ESC八位八位字符的序列，用CC_2和#代替%23。

。

如果您想使用原始字符串从文件名中读取，请使用Urllib的quote funtion：

import urllib
file_name = "Tes#t"
url_file_name = urllib.parse.quote(file_name)
print (url_file_name)
print (urllib.parse.unquote(url_file_name))
    Tes%23t
    Tes#t

相关内容

最新更新

热门标签：