数据帧格式更改的最佳方式?



我已经有了这段很长很难看的代码,可能会变长。我不是在要求你解决问题,而是想让你知道如何最好地格式化它。

# Create data frames.
channel_data = pd.DataFrame(result_channel_statistics, columns=[
'result_playlist_id' ,
'result_channel_name' , 
'result_channel_views' ,
'result_channel_subscribers' ,
'result_channel_total_videos' 
])
video_data = pd.DataFrame(video_details, columns=[
'result_video_id' ,
'result_video_upload_time' , 
'Published Date' ,
'Published Time' , 
'result_video_name' , 
'result_video_description' , 
'result_video_views' , 
'result_video_likes' , 
'result_video_comments' 
])
# Data Frame Structure
channel_data['result_channel_subscribers'] = pd.to_numeric(channel_data['result_channel_subscribers'])
channel_data['result_channel_views'] = pd.to_numeric(channel_data['result_channel_views'])
channel_data['result_channel_total_videos'] = pd.to_numeric(channel_data['result_channel_total_videos'])
upload_time = video_data.groupby('result_video_upload_time', as_index = False).size()
published_date = upload_time['Dates'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.date
published_time = upload_time['Time']  = pd.to_datetime(upload_time['result_video_upload_time']).dt.time
video_data['Published Date'] = published_date #Pull date from string
video_data['Published Time'] = published_time #Pull time from string
# Rename column headers
video_data.columns = ['Video ID', 'Published', 'Uplaod Date', 'Upload Time', 'Video Title', 'Video Description', 'Views', 'Likes', 'Comments']
channel_data.columns = ['Video ID', 'Channel Title', 'Total Views', 'Total Subs', 'Total Videos']

我觉得最简单的方法就是把它扔进一个函数中,但是我很好奇你们是否有更好的方法来表示所有这些:

# Data Frame Structure
channel_data['result_channel_subscribers'] = pd.to_numeric(channel_data['result_channel_subscribers'])
channel_data['result_channel_views'] = pd.to_numeric(channel_data['result_channel_views'])
channel_data['result_channel_total_videos'] = pd.to_numeric(channel_data['result_channel_total_videos'])
upload_time = video_data.groupby('result_video_upload_time', as_index = False).size()
published_date = upload_time['Dates'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.date
published_time = upload_time['Time']  = pd.to_datetime(upload_time['result_video_upload_time']).dt.time
video_data['Published Date'] = published_date #Pull date from string
video_data['Published Time'] = published_time #Pull time from string
# Rename column headers
video_data.columns = ['Video ID', 'Published', 'Uplaod Date', 'Upload Time', 'Video Title', 'Video Description', 'Views', 'Likes', 'Comments']
channel_data.columns = ['Video ID', 'Channel Title', 'Total Views', 'Total Subs', 'Total Videos']

有一个机会,我可以包括一些这些变化到数据帧创建列表?

channel_data['result_channel_subscribers'] = pd.to_numeric(channel_data['result_channel_subscribers'])
channel_data['result_channel_views'] = pd.to_numeric(channel_data['result_channel_views'])
channel_data['result_channel_total_videos'] = pd.to_numeric(channel_data['result_channel_total_videos'])

你可以这样做,因为有重复的。

columns_to_change_type = ['result_channel_subscribers', 'result_channel_views', 'result_channel_total_videos']
for col in columns_to_change_type:
change_type_to_numeric('result_channel_subscribers')

change_type_to_numeric函数如下

def change_type_to_numeric(col_name):
channel_data[col_name] = pd.to_numeric( channel_data[col_name])

这是一个例子。尽量减少重复。

除此之外,可以像这样在配置类中使用常量作为列名。

class Configs():
RESULT_CHANNEL_SUBS_COL = 'result_channel_subscribers'

,您可以将其用于Configs.RESULT_CHANNEL_SUBS_COL而不是列名。这些是一些简洁的代码策略。

在你的代码中,

published_date = upload_time['Dates'] = pd.to_datetime(upload_time['result_video_upload_time']).dt.date
published_time = upload_time['Time']  = pd.to_datetime(upload_time['result_video_upload_time']).dt.time
video_data['Published Date'] = published_date #Pull date from string
video_data['Published Time'] = published_time #Pull time from string

你可以用这个

pub_dttm = pd.to_datetime(upload_time['result_video_upload_time']).dt
video_data['Published Date'] = upload_time['Dates'] = pub_dttm.date
video_data['Published Time'] = upload_time['Time']  = pub_dttm.time

您可以决定将您的定义(昵称、目标名称、转换)保存在一些整齐的数组中。

首先,为了便于示例,一些合成数据:

# synthetic data
result_channel_statistics = [
[1, 'foo', '30', '10', '200'],
['2', 'bar', '25', '5', '50'],
]
video_details = [
[1, '2022-12-05 01:23:45', 'hello', 'descr', '5', '2',
['best movie ever', 'ok but not great']],
['2', '2022-12-11 09:08:07', 'world', 'descr', '11', '3',
['this was great', 'I fell asleep']],
]

之后,定义:

def numeric(s): return pd.to_numeric(s)
def timestamp(s): return pd.to_datetime(s)
def date(s): return pd.to_datetime(s).dt.date
def time(s): return pd.to_datetime(s).dt.time
def identity(s): return s
def process(df, d):
df = df.copy().set_axis(pd.unique([k for k, *tail in d]), axis=1)
for src, dst, fun in d:
df[dst] = fun(df[src])
return df[[dst for src, dst, fun in d]].copy()

应用程序:

dchan = [
('id',            'Video ID', numeric),
('name',     'Channel Title', identity),
('views',      'Total Views', numeric),
('subscribers', 'Total Subs', numeric),
('n_videos',  'Total Videos', numeric),
]
dvid = [
('id',                   'Video ID', numeric),
('t',                   'Upload At', timestamp),
('t',                 'Upload Date', date),
('t',                 'Upload Time', time),
('name',              'Video Title', identity),
('description', 'Video Description', identity),
('views',                   'Views', numeric),
('likes',                   'Likes', numeric),
('comments',             'Comments', identity),
]
channel_data = process(pd.DataFrame(result_channel_statistics), dchan)
video_data = process(pd.DataFrame(video_details), dvid)
结果:

>>> channel_data
Video ID Channel Title  Total Views  Total Subs  Total Videos
0         1           foo           30          10           200
1         2           bar           25           5            50
>>> channel_data.dtypes
Video ID          int64
Channel Title    object
Total Views       int64
Total Subs        int64
Total Videos      int64
dtype: object
>>> video_data
Video ID           Upload At Upload Date Upload Time Video Title  
0         1 2022-12-05 01:23:45  2022-12-05    01:23:45       hello   
1         2 2022-12-11 09:08:07  2022-12-11    09:08:07       world   
Video Description  Views  Likes                             Comments  
0             descr      5      2  [best movie ever, ok but not great]  
1             descr     11      3      [this was great, I fell asleep]  
>>> video_data.dtypes
Video ID                      int64
Upload At            datetime64[ns]
Upload Date                  object
Upload Time                  object
Video Title                  object
Video Description            object
Views                         int64
Likes                         int64
Comments                     object
dtype: object

最新更新