如何将数据框中的数组赋值给变量



我需要在数据框中获取我的数组字段,并将其分配给一个变量,以便进一步进行。我正在使用collect()函数,但它不能正常工作。

输入dataframe:

tbody> <<tr>
部门 语言
(A, B, C)英语
[]西班牙

我带来的最简单的解决方案就是使用collect提取数据并显式地将其分配给预定义的变量,如下所示:

from pyspark.sql.types import StringType, ArrayType, StructType, StructField
schema = StructType([
StructField("Department", ArrayType(StringType()), True),
StructField("Language", StringType(), True)
])
df = spark.createDataFrame([(["A", "B", "C"], "English"), ([], "Spanish")], schema)
English = df.collect()[0]["Department"]
Spanish = df.collect()[1]["Department"]
print(f"English: {English}, Spanish: {Spanish}")
# English: ['A', 'B', 'C'], Spanish: []

编辑:我完全没意识到这是一个PySpark问题。

如果您将PySpark Dataframe转换为pandas,下面的代码可能仍然有用,这对于您的情况可能不像听起来那么荒谬。如果表太大而无法放入pandas DataFrame中,那么它也太大而无法将所有数组存储在一个变量中。你可以先使用。filter()和。select()来缩小它。

老回答:


实现这一点的最佳方法实际上取决于数据框架的复杂性。这里有两种方法:

# To recreate your dataframe
df = pd.DataFrame({
'Department': [['A','B', 'C']],
'Language': 'English'
})
df.loc[df.Language == 'English']
# Will return all rows where Language is English.  If you only want Department then:
df.loc[df.Language == 'English'].Department
# This will return a list containing your list. If you are always expecting a single match add [0] as in:
df.loc[df.Language == 'English'].Department[0]
#Which will return only your list
# The alternate method below isn't great but might be preferable in some circumstances, also only if you expect a single match from any query.
department_lookup = df[['Language', 'Department']].set_index('Language').to_dict()['Department']
department_lookup['English']
#returns your list
# This will make a dictionary where 'Language' is the key and 'Department' is the value. It is more work to set up and only works for a two-column relationship but you might prefer working with dictionaries depending on the use-case

如果你有数据类型问题,它可能会处理如何加载DataFrame,而不是如何访问它。Pandas喜欢将列表转换为字符串。


# If I saved and reload the df as so: 
df.to_csv("the_df.csv")
df = pd.read_csv("the_df.csv")
# Then we would see that the dtype has become a string, as in "[A, B, C]" rather than ["A", "B", "C"]
# We can typically correct this by giving pandas a method for converting the incoming string to list.  This is done with the 'converters' argument, which takes a dictionary where trhe keys are column names and the values are functions, as such:
df = pd.read_csv("the_df.csv", converters = {"Department": lambda x: x.strip("[]").split(", "))
# df['Department'] should have a dtype of list

重要的是要注意,lambda函数只有在python将python列表转换为字符串以存储数据帧时才是可靠的。将列表字符串转换为列表已经在这里解决了

最新更新