[原始数据][1]我有一组全年阅读的书名数据。我只想在原始数据的基础上添加一个名为Date的新列。
Tittle Page Author Rating
0 Monday, Mar. 1 NaN NaN NaN
1 Tittle 1 5.0 JHK 1.50
2 Tittle 2 13.0 ABB 0.03
3 Tittle 3 100.0 ACC 3.5
4 Tittle 4 9.0 NN 5.40
5 Tuesday, Jan. 2 NaN NaN NaN
6 Tittle 5 6.0 BBB 6.50
7 Tittle 7 14.0 CCC 10.00
8 Tittle 8 10.0 CNN 2.50
9 Wednesday, Dec.3 NaN NaN NaN
10 Tittle 10 5.0 CBS 1.00
11 Title 20 5.0 ABC 1.00
12 Title 21 25.0 JJJ 3.50
13 Title 22 1.0 NNN 7.50
14 Thursday, Mar.4 NaN NaN NaN
15 Title 25 100.0 VVV 9.00
16 Title 30 6.0 YYYY 9.00
17 Title 35 2.0 QQQ 9.00
我试过使用dropna((,但最终它只是去掉了整行。
dfs = pd.read_csv('Book2.csv')
df = dfs.dropna()
display(df)
Tittle Page Author Rating
1 Tittle 1 5.0 JHK 1.50
2 Tittle 2 13.0 ABB 0.03
4 Tittle 4 9.0 tvN 5.40
6 Tittle 5 6.0 BBB 6.50
7 Tittle 7 14.0 CCC 10.00
8 Tittle 8 10.0 CNN 2.50
10 Tittle 10 5.0 CBS 1.00
11 Title 20 5.0 ABC 1.00
12 Title 21 25.0 JJJ 3.50
13 Title 22 1.0 NNN 7.50
15 Title 25 100.0 VVV 9.00
16 Title 30 6.0 YYYY 9.00
17 Title 35 2.0 QQQ 9.00
我曾尝试使用pd.isna((来制作一个新的数据帧,但结果并不是我想要的样子。
dfs = pd.read_csv('Book2.csv')
df = dfs[dfs.isnull().any(axis=1)]
display(df)
Tittle Page Author Rating
0 Monday, Mar. 1 NaN NaN NaN
5 Tuesday, Jan. 2 NaN NaN NaN
9 Wednesday, Dec. 3 NaN NaN NaN
14 Thursday, Mar. 4 NaN NaN NaN
最后,我只是在excel中手动编辑它,使它看起来像我想要的样子。[我希望它看起来像这样。][2]
Tittle Page Author Rating Date
0 Tittle 1 5 JHK 1.50 Monday, Mar. 1
1 Tittle 2 13 ABB 0.03 Monday, Mar. 1
2 Tittle 3 100 ACC 4.50 Monday, Mar. 1
3 Tittle 4 9 tvN 5.40 Monday, Mar. 1
4 Tittle 5 6 BBB 6.50 Tuesday, Jan. 2
5 Tittle 7 14 CCC 10.00 Tuesday, Jan. 2
6 Tittle 8 10 CNN 2.50 Tuesday, Jan. 2
7 Tittle 10 5 CBS 1.00 Wednesday, Dec. 3
8 Title 20 5 ABC 1.00 Wednesday, Dec. 3
9 Title 21 25 JJJ 3.50 Wednesday, Dec. 3
10 Title 22 1 NNN 7.50 Wednesday, Dec. 3
11 Title 25 100 VVV 9.00 Thursday, Mar. 4
12 Title 30 6 YYYY 9.00 Thursday, Mar. 4
13 Title 35 2 QQQ 9.00 Thursday, Mar. 4
```
So I could sort or groupby the data by date later on if i want to see if there is a trend on the students reading habits in the future but for now I just want to add another date column and use the date that is all ready there in the data set if possible. I have to comb through a total of 181 pages of this and I am hoping pandas could help cut down the hours I will have to spend editing this manually in excel and using the copy and paste.
if you have any other recommendation in how to efficiently wrangle this data-set where the title will not mix in the same column and the NAN will be taken cared off. It will be greatly appreciated.
[1]: https://i.stack.imgur.com/JBihm.png
[2]: https://i.stack.imgur.com/nlJbE.png
用fillna()
填充具有三个或更多NA编号的条件提取的数据帧。将生成的日期序列与NA为零的数据组合。
df1 = df.loc[df.isnull().sum(axis=1) == 3].reindex(index=df.index)
df1.fillna(method='ffill', inplace=True)
df1 = pd.concat([df1[['Tittle']], df.loc[df.isnull().sum(axis=1) == 0]], axis=1, join='inner')
df1.columns =['Date', 'Tittle', 'Page', 'Author', 'Rating']
df1
Date Tittle Page Author Rating
1 Monday, Mar. 1 Tittle 1 5.0 JHK 1.50
2 Monday, Mar. 1 Tittle 2 13.0 ABB 0.03
3 Monday, Mar. 1 Tittle 3 100.0 ACC 3.50
4 Monday, Mar. 1 Tittle 4 9.0 NN 5.40
6 Tuesday, Jan. 2 Tittle 5 6.0 BBB 6.50
7 Tuesday, Jan. 2 Tittle 7 14.0 CCC 10.00
8 Tuesday, Jan. 2 Tittle 8 10.0 CNN 2.50
10 Wednesday, Dec. 3 Tittle 10 5.0 CBS 1.00
11 Wednesday, Dec. 3 Title 20 5.0 ABC 1.00
12 Wednesday, Dec. 3 Title 21 25.0 JJJ 3.50
13 Wednesday, Dec. 3 Title 22 1.0 NNN 7.50
15 Thursday, Mar. 4 Title 25 100.0 VVV 9.00
16 Thursday, Mar. 4 Title 30 6.0 YYYY 9.00
17 Thursday, Mar. 4 Title 35 2.0 QQQ 9.00