使用LIKE选择所有可能重复的相似行



将歌曲信息导入我的SQLite数据库后,我想使用SELECT语句查找所有可能重复的歌曲,使用以下条件:

songName与同表(Songs)中任何其他行的songName相似或相等,并且artistID在两行中相同。在不知道songName内容的情况下也可以工作。如果我想比较一个已知的歌曲名称与数据库中的所有其他歌曲名称,可以使用"songName LIKE '%known name%'",但我如何找到所有重复没有这个?

示例歌曲表:

id  songName            artistID  duration
--------------------------------------------
0  This is a song       5         3:43
1  Another song         3         3:23
2  01-This is a song    5         3:42
3  song                 4         4:01
4  song                 4         6:33
5  Another record       2         2:45

预期结果:

id  songName            artistID  duration
--------------------------------------------
0   This is a song      5         3:43
2   01-This is a song   5         3:42
3   song                4         4:01
4   song                4         6:33
编辑:

既然已经提出了创建散列并比较它们的想法,我正在考虑使用这个伪函数为每个歌曲名称创建一个散列:

Public Function createHash(ByVal phrase As String) As String
    'convert to lower case
    phrase = LCase(phrase)
    'split the phrase into words
    Dim words() As String = phrase.Replace("_", " ").Split(" ")
    Dim hash As String = ""
    For w = 0 To words.Count - 1
        'remove noise words (a, an, the, etc.)
        words(w) = removeNoiseWords(words(w))
        'convert 1 or 2-digit numbers to corresponding words
        words(w) = number2word(words(w))
    Next
    'rebuild using replaced words and remove spaces
    hash = String.Join("", words)
    'convert upper ascii into alphabetic (ie. ñ = n, Ö = O, etc.)
    hash = removeUnsupChars(hash, True)
    'strip away all remaining non-alphanumeric characters
    hash = REGEX_Replace(hash, "[^A-Za-z0-9]", "")
    Return hash
End Function

一旦计算出哈希值,我将把它存储在每条记录中,然后使用count(hash)>1选择重复项。然后,我将使用. net代码查看返回的记录的artistID是否相同。

到目前为止,这个解决方案似乎工作得很好。下面是我用来查找重复歌曲的SQLite语句:
SELECT count(*),hash from Songs GROUP BY hash HAVING count(hash) > 1 ORDER BY hash;

这给了我一个出现不止一次的所有哈希的列表。我将这些结果存储在一个数组中,然后循环遍历该数组,并简单地使用以下语句获取详细信息:

    For i = 0 To dupeHashes.Count - 1
        SQLconnect.Open()
        SQLcommand = SQLconnect.CreateCommand
        SQLcommand.CommandText = "SELECT * from Songs WHERE hash = '" & dupeHashes(i) & "';"
        SQLreader = SQLcommand.ExecuteReader()
        While SQLreader.Read()
            'get whatever data needed for each duplicate song
        End While
        SQLcommand.Dispose()
        SQLconnect.Close()
    Next

我个人会添加一个额外的字段来计算标题的某种"哈希"。一个很好的函数是去掉每个非字母字符,包括空格,删除任何文章(如"the","A","an"),然后计算标题的soundex代码,并以artistId字符串作为前缀。

在你的例子中,你会得到:

id  songName            artistID  duration  Hash
----------------------------------------------------
0  This is a song       5         3:43      5.T0021
1  Another song         3         3:23      3.A9872
2  01-This is a song    5         3:42      5.T0021
3  song                 4         4:01      4.S0332
4  song                 4         6:33      4.S0332
5  Another record       2         2:45      2.A7622

从现在开始,只获取具有…count(Hash)>1的行应该很容易…

还请注意,我建议Soundex,但您可以创建自己的功能,或改编现有的功能,使某些元素比其他元素更相关。

可以大致了解这个问题,但有一点需要澄清:为什么结果没有1另一首歌3:23记录呢?因为它可以被视为重复的那些3 .歌曲4:4:01歌4:6:33记录?

我只是在tsql中写了一个简单的脚本来解决,效率很低,仅供参考。

 drop table #t;
drop table #result;
create table #t 
(
id int ,
songName varchar(100),
artistID int,
duration varchar(20)
)
insert into #t
select '0',  'This is a song'   ,    '5'  ,       '3:43' union all
select '1',  'Another song'     ,    '3'  ,       '3:23' union all
select '2',  '01-This is a song',    '5'  ,       '3:42' union all
select '3',  'song'             ,    '4'  ,       '4:01' union all
select '4',  'song'             ,    '4'  ,       '6:33' union all
select '5',  'Another record'   ,    '2'  ,       '2:45'
select * from #t
select * into #result from #t where 1 = 0
declare @sName varchar(100)
declare @id int
declare @count int
declare c cursor for 
select id, songName from #t
open c
fetch next from c into @id, @sName
while @@FETCH_STATUS = 0
begin
    select @count = COUNT(*) from #result where id = @id
    if @count = 0 
    begin
        select @count = COUNT(*) from #t where songName like '%'+@sName+'%'
        --select @count , @sName
        if @count > 1
        begin
            insert into #result select *  from #t where songName like '%'+@sName+'%' and id not in (select id from #result)
        end
    end
fetch next from c into @id, @sName
end
close c
deallocate c
select * from #result 

相关内容

最新更新