如何使用两个不同的分隔符将DataFrame列拆分为四部分



我正在进行一个项目,该项目将接受我们的旧数据库系统查询,并将它们映射到新的数据库系统查询。我希望能够读取FROM语句,将它们拆分为Database、Schema、Table和Alias。数据如下:

FROM DatabaseA.SchemaA.Table1 tbl 
INNER JOIN DatabaseB.SchemaC.Table13 tbl13 
ON tbl.column12 = tbl.column12

我从完整的SQL查询中创建了一个数据帧,并将from语句提取到一个新的数据帧中(我使用两个不同的函数(:

# Reads file - creates dataframe
def readSQL(file_path):
# convert to a string
file_path=file_path.decode("utf-8")

# Set sql_script as a global variable
# Once sql_script is populated it can
# be called from other functions
global sql_script

# Open file
# Separate every line
df = open(file_path,'r')
lines = df.readlines()
df.close()
# print(lines)

# Create the dataframe
sql_script = pd.DataFrame(columns=('StatementDSC','ColumnTable'))
i = 0
StatementDSC = ""
ColumnTable = ""
for line in lines:
if ('SELECT' in line) or ('FROM' in line) or ('WHERE' in line):
StatementDSC = line.strip() # remove n
else: #(',' in line) or ('.' in line):
ColumnTable = line.strip() # remove t and /n

# Create next line
sql_script.loc[i] = [StatementDSC,ColumnTable]
i = i + 1

return sql_script
# print(sql_script)

# Reads FROM Statement breaks down tables/aliases/joins
def readFROM(full_SQL):
# Create FROM Dataframe
df = full_SQL.loc[full_SQL['StatementDSC']== 'FROM']

# Drop empty rows
# df.drop(df.ColumnTable == "",axis=0)
print(df)

# split ColumnTable by periods
# Create new Columns: 
# Database, Schema, Table
split_data = df["ColumnTable"].str.split('.')
data = split_data.to_list()
names = ["Database","Schema","TableAlias"]
new_df = pd.DataFrame(data,columns=names)
print(new_df)

我可以从中获得要拆分的数据:

StatementDSC              ColumnTable
5         FROM                                                   
6         FROM                PPDAC.PLAN.DOCTORS Docs
7         FROM  INNER JOIN PPDAC.PLAN.DOCTORSINFO DocInfo
8         FROM                    ON Docs.DocID = DocInfo.DocID

对此:

Database              Schema           TableAlias
0                         None             None
1  PPDAC                  PLAN             DOCTORS Docs
2  INNER JOIN PPDAC     PLAN             DOCTORSINFO DocInfo

如何添加另一个分隔符,通过同时使用句点和空格将它们分隔为四个?

我看到了其他答案,但它们只针对一行,而不是在多行重复。

使用regex基于多个分隔符进行拆分,在这种情况下,例如[括号]中的任何分隔符。或空间。看看这个玩具的例子:

row1list = ['1.2.3 4']
row2list = ['1.4.5 6']
row3list = ['2.7.8 9']
df = pd.DataFrame([row1list, row2list, row3list], columns=['ColumnTable'])
df2 = df['ColumnTable'].str.split('[ .]', expand=True)
print(df2)
#    0  1  2  3
# 0  1  2  3  4
# 1  1  4  5  6
# 2  2  7  8  9

相关内容

  • 没有找到相关文章

最新更新