我已经有了这段很长很难看的代码,可能会变长。我不是在要求你解决问题,而是想让你知道如何最好地格式化它。
# Create data frames.
channel_data = pd.DataFrame(result_channel_statistics, columns=[
'result_playlist_id' ,
'result_channel_name' ,
'result_channel_views' ,
'result_channel_subscribers' ,
'result_channel_total_videos'
])
video_data = pd.DataFrame(video_details, columns=[
'result_video_id' ,
'result_video_upload_time' ,
'Published Date' ,
'Published Time' ,
'result_video_name' ,
'result_video_description' ,
'result_video_views' ,
'result_video_likes' ,
'result_video_comments'
])
# Data Frame Structure
channel_data['result_channel_subscribers'] = pd.to_numeric(channel_data['result_channel_subscribers'])
channel_data['result_channel_views'] = pd.to_numeric(channel_data['result_channel_views'])
channel_data['result_channel_total_videos'] = pd.to_numeric(channel_data['result_channel_total_videos'])
upload_time = video_data.groupby('result_video_upload_time', as_index = False).size()
published_date = upload_time['Dates'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.date
published_time = upload_time['Time'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.time
video_data['Published Date'] = published_date #Pull date from string
video_data['Published Time'] = published_time #Pull time from string
# Rename column headers
video_data.columns = ['Video ID', 'Published', 'Uplaod Date', 'Upload Time', 'Video Title', 'Video Description', 'Views', 'Likes', 'Comments']
channel_data.columns = ['Video ID', 'Channel Title', 'Total Views', 'Total Subs', 'Total Videos']
我觉得最简单的方法就是把它扔进一个函数中,但是我很好奇你们是否有更好的方法来表示所有这些:
# Data Frame Structure
channel_data['result_channel_subscribers'] = pd.to_numeric(channel_data['result_channel_subscribers'])
channel_data['result_channel_views'] = pd.to_numeric(channel_data['result_channel_views'])
channel_data['result_channel_total_videos'] = pd.to_numeric(channel_data['result_channel_total_videos'])
upload_time = video_data.groupby('result_video_upload_time', as_index = False).size()
published_date = upload_time['Dates'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.date
published_time = upload_time['Time'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.time
video_data['Published Date'] = published_date #Pull date from string
video_data['Published Time'] = published_time #Pull time from string
# Rename column headers
video_data.columns = ['Video ID', 'Published', 'Uplaod Date', 'Upload Time', 'Video Title', 'Video Description', 'Views', 'Likes', 'Comments']
channel_data.columns = ['Video ID', 'Channel Title', 'Total Views', 'Total Subs', 'Total Videos']
有一个机会,我可以包括一些这些变化到数据帧创建列表?
channel_data['result_channel_subscribers'] = pd.to_numeric(channel_data['result_channel_subscribers'])
channel_data['result_channel_views'] = pd.to_numeric(channel_data['result_channel_views'])
channel_data['result_channel_total_videos'] = pd.to_numeric(channel_data['result_channel_total_videos'])
你可以这样做,因为有重复的。
columns_to_change_type = ['result_channel_subscribers', 'result_channel_views', 'result_channel_total_videos']
for col in columns_to_change_type:
change_type_to_numeric('result_channel_subscribers')
change_type_to_numeric函数如下
def change_type_to_numeric(col_name):
channel_data[col_name] = pd.to_numeric( channel_data[col_name])
这是一个例子。尽量减少重复。
除此之外,可以像这样在配置类中使用常量作为列名。
class Configs():
RESULT_CHANNEL_SUBS_COL = 'result_channel_subscribers'
,您可以将其用于Configs.RESULT_CHANNEL_SUBS_COL
而不是列名。这些是一些简洁的代码策略。
在你的代码中,
published_date = upload_time['Dates'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.date
published_time = upload_time['Time'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.time
video_data['Published Date'] = published_date #Pull date from string
video_data['Published Time'] = published_time #Pull time from string
你可以用这个
pub_dttm = pd.to_datetime(upload_time['result_video_upload_time']).dt
video_data['Published Date'] = upload_time['Dates'] = pub_dttm.date
video_data['Published Time'] = upload_time['Time'] = pub_dttm.time
您可以决定将您的定义(昵称、目标名称、转换)保存在一些整齐的数组中。
首先,为了便于示例,一些合成数据:
# synthetic data
result_channel_statistics = [
[1, 'foo', '30', '10', '200'],
['2', 'bar', '25', '5', '50'],
]
video_details = [
[1, '2022-12-05 01:23:45', 'hello', 'descr', '5', '2',
['best movie ever', 'ok but not great']],
['2', '2022-12-11 09:08:07', 'world', 'descr', '11', '3',
['this was great', 'I fell asleep']],
]
之后,定义:
def numeric(s): return pd.to_numeric(s)
def timestamp(s): return pd.to_datetime(s)
def date(s): return pd.to_datetime(s).dt.date
def time(s): return pd.to_datetime(s).dt.time
def identity(s): return s
def process(df, d):
df = df.copy().set_axis(pd.unique([k for k, *tail in d]), axis=1)
for src, dst, fun in d:
df[dst] = fun(df[src])
return df[[dst for src, dst, fun in d]].copy()
应用程序:
dchan = [
('id', 'Video ID', numeric),
('name', 'Channel Title', identity),
('views', 'Total Views', numeric),
('subscribers', 'Total Subs', numeric),
('n_videos', 'Total Videos', numeric),
]
dvid = [
('id', 'Video ID', numeric),
('t', 'Upload At', timestamp),
('t', 'Upload Date', date),
('t', 'Upload Time', time),
('name', 'Video Title', identity),
('description', 'Video Description', identity),
('views', 'Views', numeric),
('likes', 'Likes', numeric),
('comments', 'Comments', identity),
]
channel_data = process(pd.DataFrame(result_channel_statistics), dchan)
video_data = process(pd.DataFrame(video_details), dvid)
结果:>>> channel_data
Video ID Channel Title Total Views Total Subs Total Videos
0 1 foo 30 10 200
1 2 bar 25 5 50
>>> channel_data.dtypes
Video ID int64
Channel Title object
Total Views int64
Total Subs int64
Total Videos int64
dtype: object
>>> video_data
Video ID Upload At Upload Date Upload Time Video Title
0 1 2022-12-05 01:23:45 2022-12-05 01:23:45 hello
1 2 2022-12-11 09:08:07 2022-12-11 09:08:07 world
Video Description Views Likes Comments
0 descr 5 2 [best movie ever, ok but not great]
1 descr 11 3 [this was great, I fell asleep]
>>> video_data.dtypes
Video ID int64
Upload At datetime64[ns]
Upload Date object
Upload Time object
Video Title object
Video Description object
Views int64
Likes int64
Comments object
dtype: object