有选择地使用 jsoup 从 XML 中提取文本和一些有用的标签



我正在尝试使用 jsoup 从 xml 中提取文本,但也保留一些标签,因为它们很有用,如何实现?

也许像迭代文档并通过它的标签取出一个组件,然后迭代该组件并根据嵌套标签提取进一步。 但我无法解决这个问题。

           for( Element item : doc.select("sentence") )
           {
               for( Element component : item)
               {
                   get the tag of sentence and the words of the 
                   sentence as described below
               } 
           }

我有一个以这种方式标记的 xml 文档:

<sentences>
  <sentence id="1">
    <tokens>
      <token id="1">
        <word>The</word>
        <CharacterOffsetBegin>0</CharacterOffsetBegin>
        <CharacterOffsetEnd>3</CharacterOffsetEnd>
      </token>
      <token id="2">
        <word>newspaper</word>
        <CharacterOffsetBegin>4</CharacterOffsetBegin>
        <CharacterOffsetEnd>13</CharacterOffsetEnd>
      </token>
      <token id="3">
        <word>cartoons</word>
        <CharacterOffsetBegin>14</CharacterOffsetBegin>
        <CharacterOffsetEnd>22</CharacterOffsetEnd>
      </token>
      <token id="4">
        <word>here</word>
        <CharacterOffsetBegin>23</CharacterOffsetBegin>
        <CharacterOffsetEnd>27</CharacterOffsetEnd>
      </token>
      <token id="5">
        <word>often</word>
        <CharacterOffsetBegin>28</CharacterOffsetBegin>
        <CharacterOffsetEnd>33</CharacterOffsetEnd>
      </token>
      <token id="6">
        <word>portray</word>
        <CharacterOffsetBegin>34</CharacterOffsetBegin>
        <CharacterOffsetEnd>41</CharacterOffsetEnd>
      </token>
      <token id="7">
        <word>Per-Kristian</word>
        <CharacterOffsetBegin>42</CharacterOffsetBegin>
        <CharacterOffsetEnd>54</CharacterOffsetEnd>
      </token>
      <token id="8">
        <word>Foss</word>
        <CharacterOffsetBegin>55</CharacterOffsetBegin>
        <CharacterOffsetEnd>59</CharacterOffsetEnd>
      </token>
      <token id="9">
        <word>,</word>
        <CharacterOffsetBegin>59</CharacterOffsetBegin>
        <CharacterOffsetEnd>60</CharacterOffsetEnd>
      </token>
      <token id="10">
        <word>the</word>
        <CharacterOffsetBegin>61</CharacterOffsetBegin>
        <CharacterOffsetEnd>64</CharacterOffsetEnd>
      </token>
      <token id="11">
        <word>finance</word>
        <CharacterOffsetBegin>65</CharacterOffsetBegin>
        <CharacterOffsetEnd>72</CharacterOffsetEnd>
      </token>
      <token id="12">
        <word>minister</word>
        <CharacterOffsetBegin>73</CharacterOffsetBegin>
        <CharacterOffsetEnd>81</CharacterOffsetEnd>
      </token>
      <token id="13">
        <word>of</word>
        <CharacterOffsetBegin>82</CharacterOffsetBegin>
        <CharacterOffsetEnd>84</CharacterOffsetEnd>
      </token>
      <token id="14">
        <word>Norway</word>
        <CharacterOffsetBegin>85</CharacterOffsetBegin>
        <CharacterOffsetEnd>91</CharacterOffsetEnd>
      </token>
      <token id="15">
        <word>,</word>
        <CharacterOffsetBegin>91</CharacterOffsetBegin>
        <CharacterOffsetEnd>92</CharacterOffsetEnd>
      </token>
      <token id="16">
        <word>buoyed</word>
        <CharacterOffsetBegin>93</CharacterOffsetBegin>
        <CharacterOffsetEnd>99</CharacterOffsetEnd>
      </token>
      <token id="17">
        <word>by</word>
        <CharacterOffsetBegin>100</CharacterOffsetBegin>
        <CharacterOffsetEnd>102</CharacterOffsetEnd>
      </token>
      <token id="18">
        <word>a</word>
        <CharacterOffsetBegin>103</CharacterOffsetBegin>
        <CharacterOffsetEnd>104</CharacterOffsetEnd>
      </token>
      <token id="19">
        <word>spouting</word>
        <CharacterOffsetBegin>105</CharacterOffsetBegin>
        <CharacterOffsetEnd>113</CharacterOffsetEnd>
      </token>
      <token id="20">
        <word>geyser</word>
        <CharacterOffsetBegin>114</CharacterOffsetBegin>
        <CharacterOffsetEnd>120</CharacterOffsetEnd>
      </token>
      <token id="21">
        <word>of</word>
        <CharacterOffsetBegin>121</CharacterOffsetBegin>
        <CharacterOffsetEnd>123</CharacterOffsetEnd>
      </token>
      <token id="22">
        <word>oil</word>
        <CharacterOffsetBegin>124</CharacterOffsetBegin>
        <CharacterOffsetEnd>127</CharacterOffsetEnd>
      </token>
      <token id="23">
        <word>.</word>
        <CharacterOffsetBegin>127</CharacterOffsetBegin>
        <CharacterOffsetEnd>128</CharacterOffsetEnd>
      </token>
    </tokens>
  </sentence>

理想的输出是这样的:

<sentence id="1">
    The newspaper cartoons here often portray Per-Kristian Foss, the finance minister of Norway, buoyed by a spouting geyser of oil. 
</sentence>
以此类推,

以此类推,文档的其余部分可能包含许多句子,也可能只包含一个句子。

到目前为止,我尝试了:

String sentence = doc.select("sentence").text();

但我得到的只是这个烂摊子:

The 0 3 newspaper 4 13 cartoons 14 22 here 23 27 often 28 33 portray 34 41 Per-Kristian 42

           for( Element item : doc.select("sentence") )
           {
               System.out.println("<sentence> " + index );
               String word = item.select("word").text();
               System.out.println(word);
              System.out.println("</sentence>" + "n");
               index++;
           }

最新更新