从文本中抓取句子,将所有句子分别存储在某个数据结构中



我想从文本中获取哨兵。文本充满了段落和!.或任何其他行分隔符。使用正则表达式我可以做到,但想要它没有正则表达式库。有没有C++类可以分隔句子?

否则,另一个步骤是将每个字符与行分隔字符进行比较。但我不知道如何使用矢量来做到这一点。任何帮助,不胜感激。

在这里它与正则表达式一起使用

#include <string>
#include <vector>
#include <iostream>
#include <iterator>
#include <boost/regex.hpp>
int main()
{
  /* Input. */
  std::string input = "Here is a short sentence. Here is another one. And we say "this is the final one.", which is another example.";
  /* Define sentence boundaries. */
  boost::regex re("(?: [\.\!\?]\s+" // case 1: punctuation followed by whitespace
                  "|   \.\",?\s+"   // case 2: start of quotation
                  "|   \s+\")",      // case 3: end of quotation
           boost::regex::perl | boost::regex::mod_x);
  /* Iterate through sentences. */
  boost::sregex_token_iterator it(begin(input),end(input),re,-1);
  boost::sregex_token_iterator endit;
  /* Copy them onto a vector. */
  std::vector<std::string> vec;
  std::copy(it,endit,std::back_inserter(vec));
  /* Output the vector, so we can check. */
  std::copy(begin(vec),end(vec),
            std::ostream_iterator<std::string>(std::cout,"n"));
  return 0;
}

使用蛮力方法...我希望我正确理解了您的要求...

#include <vector>
#include <string>
#include <iostream>
int main()
{
    std::string input = "Here is a short sentence. Here is another one. And we say "this  is the final one.", which is another example.";
    int i = 0;
    std::vector<std::string> sentences;
    std::string current;
    while(i < input.length())
    {
        current += input[i];
        if(input[i] == '"')
        {
            int j = i + 1;
            while( j < input.length() && input[j] != '"')
            {
                current += input[j];
                j ++;
            }
            current += input[j];
            i = j + 1;
        }
        if(input[i] == '.' || input [i] == '!' || input[i] == '?')
        {
            sentences.push_back(current);
            current = "";
        }
        i ++;
    }
    for(i =0; i<sentences.size(); i++)
    {
        std::cout << i << " -> " << sentences[i] << std::endl;
    }
}

显然它需要更多的改进,例如删除多个空格等......

最新更新