如何在BODY字段中计算文件中的单词?



以下代码从所有".sgm"文件。但是我需要计算所有的单词数。BODY标签之间的文件,例如。

我该怎么做呢?

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.Serialization;

namespace Project2
{
class Program
{
static void Main(string[] args)
{
string[] parcesPlaces = new string[] { "west-germany", "usa", "france", "uk", "canada", "japan" };
DirectoryInfo filePaths = new DirectoryInfo(@"D:project_IAD");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
List<TotalBody> allNeedBody = new List<TotalBody>();
foreach (FileInfo file in Files)
{
string fileContent = File.ReadAllText(file.FullName);
string fileContentCleared = ReplaceHexadecimalSymbols(fileContent);
string myRootedXml = "<root>" + fileContentCleared + "</root>";
root result = (root)XmlDeserializeFromString(myRootedXml, typeof(root));
Console.WriteLine(" Ilość potrzebnych słów: {0}", result.REUTERS.ToList().Count);
foreach (rootREUTERS rootREUTERS in result.REUTERS)
{
if (rootREUTERS.PLACES.Length != 1)
{
continue;
}
else if (!parcesPlaces.Contains(rootREUTERS.PLACES[0]))
{
continue;
}
else
{
if (rootREUTERS.TEXT.BODY != null)
{
allNeedBody.Add(new TotalBody(rootREUTERS.PLACES[0], rootREUTERS.TEXT.BODY));
}
else
{
continue;
}
}
}
}
Console.WriteLine("Total count words: ");
Console.WriteLine(allNeedBody.Count);
Console.ReadKey();
}
private static object XmlDeserializeFromString(string v, Type type)
{
object result = null;
using (TextReader reader = new StringReader(v))
{
result = new XmlSerializer(type).Deserialize(reader);
}
return result;
}
private static string ReplaceHexadecimalSymbols(string txt)
{
string r = "[x00-x08x0Bx0Cx0E-x1Fx26]";
return Regex.Replace(txt, r, "", RegexOptions.Compiled);
}
}
}

文件"reut2- 2000 .sgm":

文本示例
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES> <UNKNOWN>  &#5;&#5;&#5;C T
&#22;&#22;&#1;f0704&#31;reute u f BC-BAHIA-COCOA-REVIEW   02-26
0105</UNKNOWN> <TEXT>&#2; <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE>
SALVADOR, Feb 26 - </DATELINE><BODY>**Showers continued throughout the
week in the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao, although
normal humidity levels have not been restored, Comissaria Smith said
in its weekly review.
The dry period means the temporao will be late this year.
Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against
5.81 at the same stage last year. Again it seems that cocoa delivered earlier on consignment was included in the arrivals figures.
Comissaria Smith said there is still some doubt as to how much old crop cocoa is still available as harvesting has practically come to an
end. With total Bahia crop estimates around 6.4 mln bags and sales
standing at almost 6.2 mln there are a few hundred thousand bags still
in the hands of farmers, middlemen, exporters and processors.
There are doubts as to how much of this cocoa would be fit for export as shippers are now experiencing dificulties in obtaining
+Bahia superior+ certificates.
In view of the lower quality over recent weeks farmers have sold a good part of their cocoa held on consignment.
Comissaria Smith said spot bean prices rose to 340 to 350 cruzados per arroba of 15 kilos.
Bean shippers were reluctant to offer nearby shipment and only limited sales were booked for March shipment at 1,750 to 1,780 dlrs
per tonne to ports to be named.
New crop sales were also light and all to open ports with June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs under
New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs per tonne FOB.
Routine sales of butter were made. March/April sold at 4,340, 4,345 and 4,350 dlrs.
April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
Destinations were the U.S., Covertible currency areas, Uruguay and open ports.
Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for
Oct/Dec.
Buyers were the U.S., Argentina, Uruguay and convertible currency areas.
Liquor sales were limited with March/April selling at 2,325 and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New York July,
Aug/Sept at 2,400 dlrs and at 1.25 times New York Sept and Oct/Dec at
1.25 times New York Dec, Comissaria Smith said.
Total Bahia sales are currently estimated at 6.13 mln bags against the 1986/87 crop and 1.06 mln bags against the 1987/88 crop.
Final figures for the period to February 28 are expected to be published by the Brazilian Cocoa Trade Commission after carnival which
ends midday on February 27.**  Reuter &#3;</BODY></TEXT> </REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5545" NEWID="2"> <DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE>
<ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES>
<UNKNOWN>  &#5;&#5;&#5;F Y &#22;&#22;&#1;f0708&#31;reute d f
BC-STANDARD-OIL-&lt;SRD>-TO   02-26 0082</UNKNOWN>
  • 只需要统计BODY字段中的单词(在黑体标记的示例中)),没有不同字符等

测试建议的文件示例。

我在你的问题中看到的是你试图创建xml格式的内容,并试图反序列化它只是为了计算内容,如果你需要收集数据,这将是很好的,但如果目的只是计算标记在文档体之间的单词,它是更快的只是解析它并计数它在飞。

我的策略是取以<body>开头的内容子字符串,取以</body>结尾的子字符串,并通过拆分来计数。

解决方案如下:

DirectoryInfo filePaths = new DirectoryInfo(@"D:StackoverflowSgmCountdocs");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
int wordCount = 0;
foreach (FileInfo file in Files)
{
string content = File.ReadAllText(file.FullName);
content = content.Substring(content.IndexOf("<BODY>", StringComparison.Ordinal) + 5);
content = content.Substring(0, content.IndexOf("</BODY>", StringComparison.Ordinal) - 1);
char[] delimiters = { ' ', 'r', 'n' };
wordCount = content.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Length;
}
Console.WriteLine($"Total count words: {wordCount}" words);

输出:

Total count words: 488 words

最新更新