优化 C# 中的每个循环,添加线程?



所以已经制作了一个邮件过滤程序,它在"测试"环境中工作得很好。但是当我想在真正的数据库集上尝试它时,等待了一个小时,可能可以再等 10 个小时才能得到结果。

这是我的循环:

foreach (var word in mail)
{               
foreach (var wordInSpam in countsWordOccurenceSpam)
{
foreach (var wordInOk in countsWordOccurenceOk)
{
if (countsWordOccurenceOk.ContainsKey(word.Key) && countsWordOccurenceSpam.ContainsKey(word.Key))
{
if (word.Key == wordInOk.Key && word.Key == wordInSpam.Key)
{
//math
}
}
else if (countsWordOccurenceOk.ContainsKey(word.Key) && (!countsWordOccurenceSpam.ContainsKey(word.Key)))
{
if (word.Key == wordInOk.Key)
{
//math
}
}
else if (countsWordOccurenceSpam.ContainsKey(word.Key) && (!countsWordOccurenceOk.ContainsKey(word.Key)))
{
if (word.Key == wordInSpam.Key)
{
//math
}
}
else
{
//math
}
}
}
}

mail是邮件"检查"的字典,其中包含单词并反对每个单词,countsWordHappenenceSpam/Ok是多个邮件的字典,其中包含单词及其计数器。

看起来像这样:

if (openFileDialog.ShowDialog() == true)
{
foreach (string filename in openFileDialog.FileNames)
{
myOkMail.Add(filename);
}
}
string[] okFiles = myOkMail.ToArray();

var logFile2 = okFiles
.SelectMany(i => System.IO.File.ReadAllLines(i)).ToList();
countsWordOccurenceOk = okFiles
.SelectMany(i => System.IO.File.ReadAllLines(i)
.SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', '.' }, StringSplitOptions.RemoveEmptyEntries))
.Distinct())
.GroupBy(word => word)
.ToDictionary(g => g.Key, g => g.Count());

当我测试50封邮件时,该程序完美运行,但是当有50k垃圾邮件和50kham邮件时......只是没有。使用的处理器仅在 10% 左右。

另外,可能值得注意的是,"数学"部分在每个检查类别中几乎相同,如下所示:

else if (countsWordOccurenceSpam.ContainsKey(word.Key) && (!countsWordOccurenceOk.ContainsKey(word.Key)))
{
if (word.Key == wordInSpam.Key)
{
totals = wordInSpam.Value;
fprob_spam = ((double)wordInSpam.Value) / ile_spam;
sum_spam = (((weight * probability) + (totals * fprob_spam)) / (totals + weight));
sum_ok = ((weight * probability) / (totals + weight)); 
sum_spam = Math.Pow(sum_spam, word.Value);
sum_ok = Math.Pow(sum_ok, word.Value);
cos = countsWordOccurenceOk.Count;
wp_spam = Math.Pow(sum_spam, (1/cos));
last_o = Math.Pow(sum_ok, (1 / cos));
wp_spam_1 = wp_spam_1 * wp_spam;
last_o_1 = last_o_1 * last_o;
}
}

是的,看起来很糟糕。而且,我仍然没有进入的一件事是我必须使用才能获得正确的结果:

cos = countsWordOccurenceOk.Count;
wp_spam = Math.Pow(sum_spam, (1/cos));
last_o = Math.Pow(sum_ok, (1 / cos));

因为它将其乘以数据库中的单词数。

感谢帮助, 健一

您可以尝试的一种简单方法是使用Parallel.ForEach(MSDN),它可以在不同线程中运行循环的迭代。

您可以尝试替换外部 ForEach,看看是否注意到任何性能差异。它应该看起来像这样:

Parallel.ForEach(mail, this.DoWork);

然后,您可以在DoWork方法中调用下一个循环:

public void DoWork(String word)
{
foreach (var wordInSpam in countsWordOccurenceSpam)
{
...
}
}

最新更新