将歌曲信息导入我的SQLite数据库后,我想使用SELECT语句查找所有可能重复的歌曲,使用以下条件:
songName与同表(Songs)中任何其他行的songName相似或相等,并且artistID在两行中相同。在不知道songName内容的情况下也可以工作。如果我想比较一个已知的歌曲名称与数据库中的所有其他歌曲名称,可以使用"songName LIKE '%known name%'",但我如何找到所有重复没有这个?
示例歌曲表:
id songName artistID duration
--------------------------------------------
0 This is a song 5 3:43
1 Another song 3 3:23
2 01-This is a song 5 3:42
3 song 4 4:01
4 song 4 6:33
5 Another record 2 2:45
预期结果:
id songName artistID duration
--------------------------------------------
0 This is a song 5 3:43
2 01-This is a song 5 3:42
3 song 4 4:01
4 song 4 6:33
编辑:既然已经提出了创建散列并比较它们的想法,我正在考虑使用这个伪函数为每个歌曲名称创建一个散列:
Public Function createHash(ByVal phrase As String) As String
'convert to lower case
phrase = LCase(phrase)
'split the phrase into words
Dim words() As String = phrase.Replace("_", " ").Split(" ")
Dim hash As String = ""
For w = 0 To words.Count - 1
'remove noise words (a, an, the, etc.)
words(w) = removeNoiseWords(words(w))
'convert 1 or 2-digit numbers to corresponding words
words(w) = number2word(words(w))
Next
'rebuild using replaced words and remove spaces
hash = String.Join("", words)
'convert upper ascii into alphabetic (ie. ñ = n, Ö = O, etc.)
hash = removeUnsupChars(hash, True)
'strip away all remaining non-alphanumeric characters
hash = REGEX_Replace(hash, "[^A-Za-z0-9]", "")
Return hash
End Function
一旦计算出哈希值,我将把它存储在每条记录中,然后使用count(hash)>1选择重复项。然后,我将使用. net代码查看返回的记录的artistID是否相同。
到目前为止,这个解决方案似乎工作得很好。下面是我用来查找重复歌曲的SQLite语句:SELECT count(*),hash from Songs GROUP BY hash HAVING count(hash) > 1 ORDER BY hash;
这给了我一个出现不止一次的所有哈希的列表。我将这些结果存储在一个数组中,然后循环遍历该数组,并简单地使用以下语句获取详细信息:
For i = 0 To dupeHashes.Count - 1
SQLconnect.Open()
SQLcommand = SQLconnect.CreateCommand
SQLcommand.CommandText = "SELECT * from Songs WHERE hash = '" & dupeHashes(i) & "';"
SQLreader = SQLcommand.ExecuteReader()
While SQLreader.Read()
'get whatever data needed for each duplicate song
End While
SQLcommand.Dispose()
SQLconnect.Close()
Next
我个人会添加一个额外的字段来计算标题的某种"哈希"。一个很好的函数是去掉每个非字母字符,包括空格,删除任何文章(如"the","A","an"),然后计算标题的soundex代码,并以artistId字符串作为前缀。
在你的例子中,你会得到:
id songName artistID duration Hash
----------------------------------------------------
0 This is a song 5 3:43 5.T0021
1 Another song 3 3:23 3.A9872
2 01-This is a song 5 3:42 5.T0021
3 song 4 4:01 4.S0332
4 song 4 6:33 4.S0332
5 Another record 2 2:45 2.A7622
从现在开始,只获取具有…count(Hash)>1的行应该很容易…
还请注意,我建议Soundex,但您可以创建自己的功能,或改编现有的功能,使某些元素比其他元素更相关。
可以大致了解这个问题,但有一点需要澄清:为什么结果没有1另一首歌3:23记录呢?因为它可以被视为重复的那些3 .歌曲4:4:01歌4:6:33记录?
我只是在tsql中写了一个简单的脚本来解决,效率很低,仅供参考。
drop table #t;
drop table #result;
create table #t
(
id int ,
songName varchar(100),
artistID int,
duration varchar(20)
)
insert into #t
select '0', 'This is a song' , '5' , '3:43' union all
select '1', 'Another song' , '3' , '3:23' union all
select '2', '01-This is a song', '5' , '3:42' union all
select '3', 'song' , '4' , '4:01' union all
select '4', 'song' , '4' , '6:33' union all
select '5', 'Another record' , '2' , '2:45'
select * from #t
select * into #result from #t where 1 = 0
declare @sName varchar(100)
declare @id int
declare @count int
declare c cursor for
select id, songName from #t
open c
fetch next from c into @id, @sName
while @@FETCH_STATUS = 0
begin
select @count = COUNT(*) from #result where id = @id
if @count = 0
begin
select @count = COUNT(*) from #t where songName like '%'+@sName+'%'
--select @count , @sName
if @count > 1
begin
insert into #result select * from #t where songName like '%'+@sName+'%' and id not in (select id from #result)
end
end
fetch next from c into @id, @sName
end
close c
deallocate c
select * from #result