创建SAS数据步骤以从python中的pandas数据帧导入csv



我正在尝试创建一个字符串,可以复制并粘贴该字符串以将数据帧导入SAS,其中的开头和结尾行是静态的,中间行需要根据列名、名称长度和列的数据类型进行调整,以使结果看起来像:

data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat A Best32. ;
informat B Best32. ;
informat C Best32. ;
informat D Best32. ;
informat E $11. ;
format A Best12. ;
format B Best12. ;
format C Best12. ;
format D Best12. ;
format E $11. ;
input A
input B
input C
input D 
input E $
;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;

我当前的代码省略了一列。根据输入数据帧的不同,它省略的列会发生变化。对于a、B、C和D的数据名,它将D从中间的一组打印中删除。再加上E,就把C从中间的一组版画中去掉了。从我的另一个数据集来看,它去掉了第一组打印中5列中的第4列。我不确定我做错了什么。这是我的:

def sas_import_csv(df):
'''Takes a dataframe and prepares a data step to import the csv file to SAS.
'''
value_fmts = [np.float,np.int32,np.int64]
opening = '''data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;'''
closing = ''';
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;'''
measurer = np.vectorize(len)
dfLen = measurer(df.values.astype(str)).max(axis=0)
print(f'{opening}')
for l,col in zip(dfLen,df.columns):
if df[col].dtypes in value_fmts: infmt = 'Best32. ;'
else: infmt = f'${l}. ;'
print(f'tinformat {col} {infmt}')
for l2,col2 in zip(dfLen,df.columns):
if df[col2].dtypes in value_fmts: fmt = 'Best12. ;'
else: fmt = f'${l2}. ;'
print(f'tformat {col2} {fmt}')
for col3 in df.columns:
if df[col3].dtypes in value_fmts: ct = ''
else: ct = '$'
print(f'ttinput {col3} {ct}')
print(closing)
dates = pd.date_range(start='1/1/2018', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df['E'] = "some string"
sas_import_csv(df)

它给出的输出在格式部分缺少C列:

data **infile** %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile **filepath** delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat A Best32. ;
informat B Best32. ;
informat C Best32. ;
informat D Best32. ;
informat E $11. ;
format A Best12. ;
format B Best12. ;
format D Best12. ;
format E $11. ;
input A
input B
input C
input D
input E $
;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;

这并不能回答为什么循环没有在一个实例中打印的问题,但这是一种更好的方法来完成我最初试图做的事情。感谢@Tom的指导。

from pandas.api.types import is_datetime64_any_dtype as is_datetime, is_object_dtype as is_object
def sas_import_csv(df,sas_date_fmt='yymmddn8.',filePath='',outName = 'X'):
'''Takes a dataframe and prepares a data step to import the csv file to SAS.
'''
value_fmts = [np.float,np.int32,np.int64]
opening = f"%let infile = '{filePath}';ndata {outName}; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ ninfile &infile delimiter = ',' MISSOVER DSD TRUNCOVER lrecl=32767 firstobs=2 ;"
inp = 'input '
fmt = 'format '
infmt = 'informat '
closing = "if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */nrun;"
measurer = np.vectorize(len)
dfLen = measurer(df.values.astype(str)).max(axis=0)
for l,col in zip(dfLen,df.columns):
if is_object(df[col]): inp = inp + f'{col} :${l}. '
elif is_datetime(df[col]): 
inp = inp + f'{col} '
fmt = fmt + f'{col} {sas_date_fmt} '
infmt = infmt + f'{col} yymmdd10. '
else: inp = inp + f'{col} '
return f'{opening} {inp} ;n{fmt} ;n{infmt} ;n{closing}'

现在,您只需在运行以下代码后复制并粘贴print(c)的输出,就可以将数据帧读取到SAS中:

import pandas as pd
dates = pd.date_range(start='1/1/2018', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df['E'] = "some string"
df = df.reset_index().rename(columns = {'index':'Date'})
f = r'C:\Users\user\example.csv'
c = sas_import_csv(df,filePath=f)
df.to_csv(f,index=False)
print(c)

最新更新