用人类语言字典动态填充hashmap，用于文本分析

我正在编写一个软件项目，以人类语言作为输入文本，并确定它是用什么语言编写的。

我的想法是，我要把字典存储在哈希映射中，单词作为键，bool作为值。

如果文档中有这个单词，我将把bool值调为true。

现在我正在想一个好方法来读取这些字典，把它们放入哈希映射中，我现在的方法很幼稚，看起来很笨拙，有没有更好的方法来填充这些哈希映射?

此外，这些字典很大。也许这不是最好的方法，即像这样连续地填充它们。

我认为最好一次只考虑一个字典，然后创建一个分数，即输入文本中有多少单词与该文档注册，保存该分数，然后处理下一个字典。这样可以节省内存，不是吗?这是一个好的解决方案吗?

到目前为止，代码看起来像这样:

static HashMap<String, Boolean>  de_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean>  fr_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean>  ru_map = new HashMap<String, Boolean>();
static HashMap<String, Boolean> eng_map = new HashMap<String, Boolean>();
public static void main(String[] args) throws IOException
{
    ArrayList<File> sub_dirs = new ArrayList<File>();
    final String filePath = "/home/matthias/Desktop/language_detective/word_lists_2";
    listf( filePath, sub_dirs );
    for(File dir : sub_dirs)
    {
        String word_holding_directory_path = dir.toString().toLowerCase();

        BufferedReader br = new BufferedReader(new FileReader( dir ));
        String line = null;
        while ((line = br.readLine()) != null)
        {
            //System.out.println(line);
            if(word_holding_directory_path.toLowerCase().contains("/de/") )
            {
                de_map.put(line, false);    
            }
            if(word_holding_directory_path.toLowerCase().contains("/ru/") )
            {
                ru_map.put(line, false);
            }
            if(word_holding_directory_path.toLowerCase().contains("/fr/") )
            {
                fr_map.put(line, false);
            }
            if(word_holding_directory_path.toLowerCase().contains("/eng/") )
            {
                eng_map.put(line, false);
            }
        }
    }

所以我正在寻找关于如何一次填充一个的建议，以及关于这是否是一个好的方法的意见，或者关于实现这一目标的可能更好的方法的建议。

完整的程序可以在我的GitHub页面找到。

27 <一口> th

语言识别的任务研究得很好，并且有很多好的库。对于Java，尝试TIKA，或Java语言检测库(他们报告"53种语言超过99%的精度")，或TextCat，或LingPipe -我建议从1开始，它似乎有最详细的教程。

如果您的任务过于特定于现有的库(尽管我对此表示怀疑)，请参考这篇调查论文并采用最接近的技术。

如果你想重新发明车轮，例如为了自我学习的目的，请注意识别可以被视为文本分类的特殊情况，并阅读本文本分类的基本教程。

相关内容

最新更新

热门标签：