我正在进行一个项目,该项目将接受我们的旧数据库系统查询,并将它们映射到新的数据库系统查询。我希望能够读取FROM语句,将它们拆分为Database、Schema、Table和Alias。数据如下:
FROM DatabaseA.SchemaA.Table1 tbl
INNER JOIN DatabaseB.SchemaC.Table13 tbl13
ON tbl.column12 = tbl.column12
我从完整的SQL查询中创建了一个数据帧,并将from语句提取到一个新的数据帧中(我使用两个不同的函数(:
# Reads file - creates dataframe
def readSQL(file_path):
# convert to a string
file_path=file_path.decode("utf-8")
# Set sql_script as a global variable
# Once sql_script is populated it can
# be called from other functions
global sql_script
# Open file
# Separate every line
df = open(file_path,'r')
lines = df.readlines()
df.close()
# print(lines)
# Create the dataframe
sql_script = pd.DataFrame(columns=('StatementDSC','ColumnTable'))
i = 0
StatementDSC = ""
ColumnTable = ""
for line in lines:
if ('SELECT' in line) or ('FROM' in line) or ('WHERE' in line):
StatementDSC = line.strip() # remove n
else: #(',' in line) or ('.' in line):
ColumnTable = line.strip() # remove t and /n
# Create next line
sql_script.loc[i] = [StatementDSC,ColumnTable]
i = i + 1
return sql_script
# print(sql_script)
# Reads FROM Statement breaks down tables/aliases/joins
def readFROM(full_SQL):
# Create FROM Dataframe
df = full_SQL.loc[full_SQL['StatementDSC']== 'FROM']
# Drop empty rows
# df.drop(df.ColumnTable == "",axis=0)
print(df)
# split ColumnTable by periods
# Create new Columns:
# Database, Schema, Table
split_data = df["ColumnTable"].str.split('.')
data = split_data.to_list()
names = ["Database","Schema","TableAlias"]
new_df = pd.DataFrame(data,columns=names)
print(new_df)
我可以从中获得要拆分的数据:
StatementDSC ColumnTable
5 FROM
6 FROM PPDAC.PLAN.DOCTORS Docs
7 FROM INNER JOIN PPDAC.PLAN.DOCTORSINFO DocInfo
8 FROM ON Docs.DocID = DocInfo.DocID
对此:
Database Schema TableAlias
0 None None
1 PPDAC PLAN DOCTORS Docs
2 INNER JOIN PPDAC PLAN DOCTORSINFO DocInfo
如何添加另一个分隔符,通过同时使用句点和空格将它们分隔为四个?
我看到了其他答案,但它们只针对一行,而不是在多行重复。
使用regex基于多个分隔符进行拆分,在这种情况下,例如[括号]中的任何分隔符。或空间。看看这个玩具的例子:
row1list = ['1.2.3 4']
row2list = ['1.4.5 6']
row3list = ['2.7.8 9']
df = pd.DataFrame([row1list, row2list, row3list], columns=['ColumnTable'])
df2 = df['ColumnTable'].str.split('[ .]', expand=True)
print(df2)
# 0 1 2 3
# 0 1 2 3 4
# 1 1 4 5 6
# 2 2 7 8 9