动态SQL::计算每个索引的NULL百分比



我有一个查询可以帮助我列出数据库中的所有索引,并且运行得很好:

SELECT TableName = t.name, 
IndexName = ind.name, 
IndexId = ind.index_id, 
ColumnId = ic.index_column_id, 
ColumnName = col.name,
--(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
ind.*, 
ic.*, 
col.*
FROM sys.indexes ind
INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
AND ind.index_id = ic.index_id
INNER JOIN sys.columns col ON ic.object_id = col.object_id
AND ic.column_id = col.column_id
INNER JOIN sys.tables t ON ind.object_id = t.object_id
WHERE ind.is_primary_key = 0
AND ind.is_unique = 0
AND ind.is_unique_constraint = 0
AND t.is_ms_shipped = 0
ORDER BY t.name, 
ind.name, 
ind.index_id, 
ic.is_included_column, 
ic.key_ordinal;

不幸的是,如果我取消注释第6行,单词t.name将用红色下划线,如果我运行查询,我会收到错误:

Invalid object name 'TableName'.

如何使此子查询工作?

目标是在每列上具有一定百分比的NULLs

您的问题和查询非常有趣,可以借助于参与索引的列的统计信息来解决,就像您在查询中选择的列一样。

为了获得准确的结果,在运行我下面提供的查询之前更新统计信息是有益的。

;WITH cteColumnAllStats AS
(
SELECT
ST_COL.object_id,
ST_COL.column_id,
ST_COL.stats_id,
-- NOTE: order no among stats of the same column
ROW_NUMBER()
OVER(
PARTITION BY
ST_COL.object_id,
ST_COL.column_id
ORDER BY
ST_COL.stats_id
) AS StatsOrderNo
FROM sys.stats ST
INNER JOIN sys.stats_columns ST_COL
ON  ST_COL.stats_id = ST.stats_id
AND ST_COL.object_id = ST.object_id
)
,cteColumnFirstStats AS
(
SELECT
ST_COL.object_id,
ST_COL.column_id,
-- NOTES:
-- =====
-- this would be null if there were no statistics for the column
-- however not in this case because we are only considering columns
-- participating in an index and all indices have statistics behind
-- the scenes.
--
-- Also consider whether the statistics have been updated:
-- If they have, the result will be a whole number (without decimals)
-- and the result is exact.
-- If they have not, the result is an estimate and in most of the cases
-- there will be decimals or even produce a negative result.
--
-- If you want accurate results, you need to update the statistics:
-- EXEC sp_updatestats
--
SUM(ST_HIST.range_rows) + SUM(ST_HIST.equal_rows) AS NonNullsRowCount
FROM cteColumnAllStats ST_COL
-- NOTE: this is the important bit
CROSS APPLY sys.dm_db_stats_histogram(
ST_COL.object_id,
ST_COL.stats_id
) ST_HIST
WHERE   ST_COL.StatsOrderNo = 1 -- take only the first stats for the column
GROUP BY
ST_COL.object_id,
ST_COL.column_id
)
SELECT TableName = t.name, 
IndexName = ind.name, 
IndexId = ind.index_id, 
ColumnId = ic.index_column_id, 
ColumnName = col.name,
-- NOTE: included these columns for reference purposes (PLEASE REMOVE)
SIND.rowcnt AS [RowCount],
ST_COL.NonNullsRowCount,
SIND.rowcnt - ST_COL.NonNullsRowCount AS NullsRowCount,
--(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
CASE
-- NOTE: stats are definitely out of date
WHEN SIND.rowcnt < ST_COL.NonNullsRowCount THEN NULL
-- NOTE: stats could be out of date (good to update them first)
-- Also we don't want a divide by 0 hence the NULLIF
ELSE (SIND.rowcnt - ST_COL.NonNullsRowCount) * 100.0 / NULLIF(SIND.rowcnt, 0)
END as nulls_percent,
ind.*, 
ic.*, 
col.*
FROM sys.indexes ind
INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
AND ind.index_id = ic.index_id
INNER JOIN sys.columns col ON ic.object_id = col.object_id
AND ic.column_id = col.column_id
INNER JOIN sys.tables t ON ind.object_id = t.object_id
-- NOTE: this gives you the COUNT(*) without querying the table
INNER JOIN sys.sysindexes SIND
ON  SIND.id = t.object_id
-- NOTE:
-- 0 means Heap
-- 1 means Clustered Index
-- Only these are reliable to use their rowcnt.
-- There's always 1 of these and not the other.
AND SIND.indid < 2
-- NOTE: inner join is OK here because all columns participating in a index
-- have associated statistics
INNER JOIN cteColumnFirstStats ST_COL
ON  ST_COL.object_id = t.object_id
AND ST_COL.column_id = col.column_id
WHERE ind.is_primary_key = 0
AND ind.is_unique = 0
AND ind.is_unique_constraint = 0
AND t.is_ms_shipped = 0
ORDER BY t.name, 
ind.name, 
ind.index_id, 
ic.is_included_column, 
ic.key_ordinal;

据我所知,您有一个来自各种表的列的列表,对于每一列,您希望确定NULL的百分比。

这里有一个类似的问答SQL查询,它从DB 的每一列中获取null的数量

关于循环——正如你将从一个表、下一个表和下一个。。。根据@Gordon的评论,需要使用某种循环(好吧,除了单独编写每个命令之外(。

基本方法是

  • 创建要查询的表/列列表
  • 创建一个表以保存结果/输出
  • 动态创建SQL以读取行数和NULL数,并将其保存在结果/输出表中

在我的回答中,我建议创建一个宽表(有许多对列-一对用于列名,一对用于NOT NULL的数量(,例如col1_name、col1_num、col2_name、col2_num等。

以上方法的优点是提高性能——您可以只使用一个完整的表读取来进行所有的数字运算,然后将值放入结果表中。

OP选择了一种稍微不同的方法,他们一次询问每一列。虽然性能较低(例如,需要多次读取每个表——每列一次(,但它是

  • 提供更清晰的输出(例如,表名、列名、行数、NULL数(
  • 允许运行其他统计数据,例如获取最小值和最大值

如果您只需要运行一次它就可以获得数据库的快照,并且对暂时的性能打击很满意,那么在每表1行列中给出答案是一个不错的解决方案。

从前面的问题中,我建议

  • 制作一个具有字段database_name、Schema_name、table_name和Column_name的临时表(甚至是新数据库中专门用于此目的的普通表(。这是循环的输入
  • 用要查看的列填充
  • 制作另一个表来存储输出,其中包含Database_name、Schema_name、table_name、Column_name、Num_Rows和Num_NotNULL
  • 编写一个循环,将下一个列名等放入变量中
  • 运行类似于的命令
-- Assume @Database_Name, @Schema_Name, @Table_Name, @Column_Name have been taken from the loop
SET @CustomSQL = '
INSERT INTO TableResults (Database_Name, Schema_name, Table_Name, Column_Name, Num_Rows, Num_NonNull)
SELECT  ''' + @Database_Name + ''', 
''' + @Schema_Name + ''', 
''' + @Table_Name + ''', 
''' + @Column_Name + ''',
COUNT(*) AS Num_Rows, 
COUNT(' + QUOTENAME(@Column_Name) + ') AS Num_NotNULL
FROM    ' + QUOTENAME(@Database_Name) + '.' + QUOTENAME(@Schema_Name) + '.' + QUOTENAME(@TableName);
EXEC (@CustomSQL);

由它创建的SQL命令示例是

INSERT INTO TableResults (Database_Name, Schema_name, Table_Name, Column_Name, Num_Rows, Num_NonNull)
SELECT  'SalesDB', 
'dbo',
'ProductList', 
'ProductName',
COUNT(*) AS Num_Rows, 
COUNT([ProductName]) AS Num_NotNULL
FROM    [SalesDB].[dbo].[Products];

当然,然后你需要做简单的数学运算来获得所需的数字,例如,要获得NULL的数量,请从行数中减去NotNULL的数量。

类似这样的东西:

DROP TABLE IF EXISTS #TEST;
SELECT TableName = t.name, 
IndexName = ind.name, 
IndexId = ind.index_id, 
ColumnId = ic.index_column_id, 
ColumnName = col.name,
--(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
CAST(-1 AS DECIMAL(9,2)) AS nulls_percent--,
--ind.*, 
--ic.*, 
--col.*
INTO #TEST
FROM sys.indexes ind
INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
AND ind.index_id = ic.index_id
INNER JOIN sys.columns col ON ic.object_id = col.object_id
AND ic.column_id = col.column_id
INNER JOIN sys.tables t ON ind.object_id = t.object_id
WHERE ind.is_primary_key = 0
AND ind.is_unique = 0
AND ind.is_unique_constraint = 0
AND t.is_ms_shipped = 0
ORDER BY t.name, 
ind.name, 
ind.index_id, 
ic.is_included_column, 
ic.key_ordinal;

DECLARE @DynamicTSQLSTatement NVARCHAR(MAX)
,@TableName SYSNAME
,@ColumnID INT
,@nulls_percent DECIMAL(9,2);
WHILE EXISTS(SELECT 1 FROM #TEST WHERE [nulls_percent] = -1)
BEGIN;
SELECT TOP 1 @DynamicTSQLSTatement = '(SELECT @nulls_percent = SUM(CASE WHEN ' + ColumnName + ' IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM ' + TableName+' )'
,@TableName = [TableName]
,@ColumnID = [ColumnId]
FROM #TEST 
WHERE [nulls_percent] = -1;
--SELECT @DynamicTSQLSTatement
EXEC sp_executesql @DynamicTSQLSTatement, N'@nulls_percent DECIMAL(9,2) OUTPUT', @nulls_percent = @nulls_percent OUTPUT;

UPDATE #TEST
SET nulls_percent = ISNULL(@nulls_percent,0)
WHERE [TableName] = @TableName
AND [ColumnId] = @ColumnID;
END;
SELECT *
FROM #TEST;

当然,你需要改进它。例如,添加每个表的schema名称。

最新更新