我有一个查询可以帮助我列出数据库中的所有索引,并且运行得很好:
SELECT TableName = t.name,
IndexName = ind.name,
IndexId = ind.index_id,
ColumnId = ic.index_column_id,
ColumnName = col.name,
--(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
ind.*,
ic.*,
col.*
FROM sys.indexes ind
INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
AND ind.index_id = ic.index_id
INNER JOIN sys.columns col ON ic.object_id = col.object_id
AND ic.column_id = col.column_id
INNER JOIN sys.tables t ON ind.object_id = t.object_id
WHERE ind.is_primary_key = 0
AND ind.is_unique = 0
AND ind.is_unique_constraint = 0
AND t.is_ms_shipped = 0
ORDER BY t.name,
ind.name,
ind.index_id,
ic.is_included_column,
ic.key_ordinal;
不幸的是,如果我取消注释第6行,单词t.name
将用红色下划线,如果我运行查询,我会收到错误:
Invalid object name 'TableName'.
如何使此子查询工作?
目标是在每列上具有一定百分比的NULLs
您的问题和查询非常有趣,可以借助于参与索引的列的统计信息来解决,就像您在查询中选择的列一样。
为了获得准确的结果,在运行我下面提供的查询之前更新统计信息是有益的。
;WITH cteColumnAllStats AS
(
SELECT
ST_COL.object_id,
ST_COL.column_id,
ST_COL.stats_id,
-- NOTE: order no among stats of the same column
ROW_NUMBER()
OVER(
PARTITION BY
ST_COL.object_id,
ST_COL.column_id
ORDER BY
ST_COL.stats_id
) AS StatsOrderNo
FROM sys.stats ST
INNER JOIN sys.stats_columns ST_COL
ON ST_COL.stats_id = ST.stats_id
AND ST_COL.object_id = ST.object_id
)
,cteColumnFirstStats AS
(
SELECT
ST_COL.object_id,
ST_COL.column_id,
-- NOTES:
-- =====
-- this would be null if there were no statistics for the column
-- however not in this case because we are only considering columns
-- participating in an index and all indices have statistics behind
-- the scenes.
--
-- Also consider whether the statistics have been updated:
-- If they have, the result will be a whole number (without decimals)
-- and the result is exact.
-- If they have not, the result is an estimate and in most of the cases
-- there will be decimals or even produce a negative result.
--
-- If you want accurate results, you need to update the statistics:
-- EXEC sp_updatestats
--
SUM(ST_HIST.range_rows) + SUM(ST_HIST.equal_rows) AS NonNullsRowCount
FROM cteColumnAllStats ST_COL
-- NOTE: this is the important bit
CROSS APPLY sys.dm_db_stats_histogram(
ST_COL.object_id,
ST_COL.stats_id
) ST_HIST
WHERE ST_COL.StatsOrderNo = 1 -- take only the first stats for the column
GROUP BY
ST_COL.object_id,
ST_COL.column_id
)
SELECT TableName = t.name,
IndexName = ind.name,
IndexId = ind.index_id,
ColumnId = ic.index_column_id,
ColumnName = col.name,
-- NOTE: included these columns for reference purposes (PLEASE REMOVE)
SIND.rowcnt AS [RowCount],
ST_COL.NonNullsRowCount,
SIND.rowcnt - ST_COL.NonNullsRowCount AS NullsRowCount,
--(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
CASE
-- NOTE: stats are definitely out of date
WHEN SIND.rowcnt < ST_COL.NonNullsRowCount THEN NULL
-- NOTE: stats could be out of date (good to update them first)
-- Also we don't want a divide by 0 hence the NULLIF
ELSE (SIND.rowcnt - ST_COL.NonNullsRowCount) * 100.0 / NULLIF(SIND.rowcnt, 0)
END as nulls_percent,
ind.*,
ic.*,
col.*
FROM sys.indexes ind
INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
AND ind.index_id = ic.index_id
INNER JOIN sys.columns col ON ic.object_id = col.object_id
AND ic.column_id = col.column_id
INNER JOIN sys.tables t ON ind.object_id = t.object_id
-- NOTE: this gives you the COUNT(*) without querying the table
INNER JOIN sys.sysindexes SIND
ON SIND.id = t.object_id
-- NOTE:
-- 0 means Heap
-- 1 means Clustered Index
-- Only these are reliable to use their rowcnt.
-- There's always 1 of these and not the other.
AND SIND.indid < 2
-- NOTE: inner join is OK here because all columns participating in a index
-- have associated statistics
INNER JOIN cteColumnFirstStats ST_COL
ON ST_COL.object_id = t.object_id
AND ST_COL.column_id = col.column_id
WHERE ind.is_primary_key = 0
AND ind.is_unique = 0
AND ind.is_unique_constraint = 0
AND t.is_ms_shipped = 0
ORDER BY t.name,
ind.name,
ind.index_id,
ic.is_included_column,
ic.key_ordinal;
据我所知,您有一个来自各种表的列的列表,对于每一列,您希望确定NULL的百分比。
这里有一个类似的问答SQL查询,它从DB 的每一列中获取null的数量
关于循环——正如你将从一个表、下一个表和下一个。。。根据@Gordon的评论,需要使用某种循环(好吧,除了单独编写每个命令之外(。
基本方法是
- 创建要查询的表/列列表
- 创建一个表以保存结果/输出
- 动态创建SQL以读取行数和NULL数,并将其保存在结果/输出表中
在我的回答中,我建议创建一个宽表(有许多对列-一对用于列名,一对用于NOT NULL的数量(,例如col1_name、col1_num、col2_name、col2_num等。
以上方法的优点是提高性能——您可以只使用一个完整的表读取来进行所有的数字运算,然后将值放入结果表中。
OP选择了一种稍微不同的方法,他们一次询问每一列。虽然性能较低(例如,需要多次读取每个表——每列一次(,但它是
- 提供更清晰的输出(例如,表名、列名、行数、NULL数(
- 允许运行其他统计数据,例如获取最小值和最大值
如果您只需要运行一次它就可以获得数据库的快照,并且对暂时的性能打击很满意,那么在每表1行列中给出答案是一个不错的解决方案。
从前面的问题中,我建议
- 制作一个具有字段database_name、Schema_name、table_name和Column_name的临时表(甚至是新数据库中专门用于此目的的普通表(。这是循环的输入
- 用要查看的列填充
- 制作另一个表来存储输出,其中包含Database_name、Schema_name、table_name、Column_name、Num_Rows和Num_NotNULL
- 编写一个循环,将下一个列名等放入变量中
- 运行类似于的命令
-- Assume @Database_Name, @Schema_Name, @Table_Name, @Column_Name have been taken from the loop
SET @CustomSQL = '
INSERT INTO TableResults (Database_Name, Schema_name, Table_Name, Column_Name, Num_Rows, Num_NonNull)
SELECT ''' + @Database_Name + ''',
''' + @Schema_Name + ''',
''' + @Table_Name + ''',
''' + @Column_Name + ''',
COUNT(*) AS Num_Rows,
COUNT(' + QUOTENAME(@Column_Name) + ') AS Num_NotNULL
FROM ' + QUOTENAME(@Database_Name) + '.' + QUOTENAME(@Schema_Name) + '.' + QUOTENAME(@TableName);
EXEC (@CustomSQL);
由它创建的SQL命令示例是
INSERT INTO TableResults (Database_Name, Schema_name, Table_Name, Column_Name, Num_Rows, Num_NonNull)
SELECT 'SalesDB',
'dbo',
'ProductList',
'ProductName',
COUNT(*) AS Num_Rows,
COUNT([ProductName]) AS Num_NotNULL
FROM [SalesDB].[dbo].[Products];
当然,然后你需要做简单的数学运算来获得所需的数字,例如,要获得NULL的数量,请从行数中减去NotNULL的数量。
类似这样的东西:
DROP TABLE IF EXISTS #TEST;
SELECT TableName = t.name,
IndexName = ind.name,
IndexId = ind.index_id,
ColumnId = ic.index_column_id,
ColumnName = col.name,
--(SELECT SUM(CASE WHEN col.name IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM t.name) as nulls_percent,
CAST(-1 AS DECIMAL(9,2)) AS nulls_percent--,
--ind.*,
--ic.*,
--col.*
INTO #TEST
FROM sys.indexes ind
INNER JOIN sys.index_columns ic ON ind.object_id = ic.object_id
AND ind.index_id = ic.index_id
INNER JOIN sys.columns col ON ic.object_id = col.object_id
AND ic.column_id = col.column_id
INNER JOIN sys.tables t ON ind.object_id = t.object_id
WHERE ind.is_primary_key = 0
AND ind.is_unique = 0
AND ind.is_unique_constraint = 0
AND t.is_ms_shipped = 0
ORDER BY t.name,
ind.name,
ind.index_id,
ic.is_included_column,
ic.key_ordinal;
DECLARE @DynamicTSQLSTatement NVARCHAR(MAX)
,@TableName SYSNAME
,@ColumnID INT
,@nulls_percent DECIMAL(9,2);
WHILE EXISTS(SELECT 1 FROM #TEST WHERE [nulls_percent] = -1)
BEGIN;
SELECT TOP 1 @DynamicTSQLSTatement = '(SELECT @nulls_percent = SUM(CASE WHEN ' + ColumnName + ' IS NULL THEN 1 ELSE 0 END) * 100.0 / count(*) FROM ' + TableName+' )'
,@TableName = [TableName]
,@ColumnID = [ColumnId]
FROM #TEST
WHERE [nulls_percent] = -1;
--SELECT @DynamicTSQLSTatement
EXEC sp_executesql @DynamicTSQLSTatement, N'@nulls_percent DECIMAL(9,2) OUTPUT', @nulls_percent = @nulls_percent OUTPUT;
UPDATE #TEST
SET nulls_percent = ISNULL(@nulls_percent,0)
WHERE [TableName] = @TableName
AND [ColumnId] = @ColumnID;
END;
SELECT *
FROM #TEST;
当然,你需要改进它。例如,添加每个表的schema
名称。