我是pyspark的新手,我有这个示例数据集:
Ticker_Modelo Ticker Type Period Product Geography Source Unit Test
0 Model1_Index Model1 Index NWE Forties Hydrocraking Daily Refinery Margins NWE Bloomberg None 3
1 Model2_Index Model2 Index NWE Bonny Light Hydrocraking Daily Refinery Margins NWE Bloomberg None 5
2 Model3_Index Model3 Index USGC LLS FCC Daily Refinery Margins USGC Bloomberg None 12
3 Model4_Index Model4 Index USGC Maya Coking Daily Refinery Margins USGC Bloomberg None 67
4 Model6_Index Model6 Index USMC WTI FCC Daily Refinery Margins USMC Bloomberg None 45
5 Model5_Index Model5 Index USMC WCSS Coking Daily Refinery Margins USMC Bloomberg None 22
6 Model7_Index Model7 Index USEC Hibernia FCC Daily Refinery Margins USEC Bloomberg None
7 Model8_Index Model8 Index Singapore Dubai Hydrocracking Daily Refinery Margins Singapore Bloomberg None Null
我需要做一个数据分析并将其存储在数据库中。
我试过用擎天柱(https://github.com/ironmussa/Optimus/)和panda_profiler(https://pandas-profiling.github.io/pandas-profiling/docs/)但他们做了分析,并给你一个HTML,我需要一些值,它不会计算。
我需要计算每列中有多少个null/NaN/空字符串,并用它创建一个新表
我用的是熊猫和pyspark。
我找到了一个我认为有帮助的答案,Python/Pyspark-Count NULL、empty和NaN,但当我尝试将其应用于一列时,可以尝试
data_df.filter((data_df["Ticker_Modelo"] == "") | data_df["Ticker_Modelo"].isNull() | isnan(data_df["Ticker_Modelo"])).count()
它给了我一个错误:AttributeError: 'Series' object has no attribute 'isNull'
然后我不知道如何将其应用于所有列,并将其转换为这样的内容:
Count_nulls
Ticker_Modelo 0
Ticker 0
Type 0
Period 0
Product 0
Geography 0
Source 0
Unit 0
Test 2
您可以执行以下操作:
首先将所有Null/None值更改为Panda NaN的
df.replace(['None','Null'],np.nan)
df.isnull().sum(axis=0).to_frame().rename(columns={0 : 'Count_Nulls'})