StringTokenizer in JAVA



StringTokenizer用于在JAVA中标记带标记的字符串。该字符串使用斯坦福大学的词性MaxentTagger进行标记。标记文本的子字符串被用来迭代地只显示POS标签和单词。

以下是标记前的文本:

人类一直有这样一种观念,即勇敢的行为体现在身体行动中。虽然这并不是完全错误的,但通往勇气的唯一道路并不存在。自古以来,击退野兽就是力量的象征。如果在防御中作战,这是可以理解的;然而,多走一步,煽动动物并与之斗争,是人类所能表现出的最低文明程度。更重要的是,在这个推理和知识的时代。传统可能会称之为"斗牛",但盲目坚持是愚蠢的,无论是泰米尔纳德邦著名的Jallikattu(印度人相当于西班牙斗牛)还是斗鸡。向狗扔石头并享受它痛苦的嚎叫是可怕的。如果一个人只付出那么多的思考和良知,这个问题在各个方面都会显得可悲。动物在我们的生态系统中与我们一起发挥着作用。而且,有些动物更珍贵:守卫我们街道的流浪狗、聪明的乌鸦、负担之兽和牧场上的日常动物。文学以自己的方式发声:在《指环王》中,联谊会极其小心地对待比尔·弗尼的小马;在《哈利波特》中,当他们没有听从赫敏关于对待家庭精灵的建议时,他们艰难地学会了这会导致他们自己的毁灭;杰克·伦敦写的都是动物。的确,善待动物是一种美德。

这是POS标记文本:

Man_NN has_VBZ always_RB had_VBN this_DT notification_NN that _IN brave _VBP deeds_NNS are _VBP manifest_JJ IN _IN physical_JJ actions_NNS._。如果_IN it_PRP是_VBZ不是_RB整体_RB错误_JJ,_,则_EX做_VBZ不t_RB列_VB的_DT奇异_JJ路径_NN到_to值_NN。_。从_IN旧_JJ的_IN,_,it_PRP is_VBZ a_DT sign_NN of _IN strength_NN到_to fight_VB back_RP a_DT wild_JJ animal_NN._。It_PRP is_VBZ可理解_JJ if_IN fought_VBN IN_IN defense_NN;_:然而,RB,_,to _to go_VB the _DT extra_JJ mile_NN and _CC stimulate _VB an _DT animal_NN and _CC fight_VB it_PRP is _VBZ the _DT lowest _JJS degree_NN of _IN civilization _NN man_NN can _MD exphibit_VB._。更多_RBR so-RB,_,在_in的_DT年龄_NN和_CC知识_NN._中。传统_NN可能_MD调用_VB it_PRP,_,但是_CC adhering_JJ blindly_RB到_to it_PRP是_VBZ具体_NN,_,be_VB it-PRP中的_DT famed_JJ Jallikattu_NNP在_in Tamil_NNP Nadu_NNP-LRB-_-LRB-the_DT Indian_JJ等效_NN到_DT Spanish_JJ Bullfighting_NN-_-RRB-或_CC的_DT驾驶舱_NS._。Pelting_VBG stones_NNS at _IN a_DT dog_NN and _CC relishing_VBG it_PRP holl_NN IN _IN pain_NN is _VBZ dreadful_JJ._。如果在一个_CD中,仅_RB gave_VBD为_RB much_JJ为_IN,则_DT问题_NN将_MD表面_VB为_IN部署_JJ,在每个_DT方面_NN._中。动画_NNS播放_VBP a_DT部件_NN和IN,其中包含_IN us_PRP IN _IN our_PRP$ecosystem_NN._。和_CC,_,一些_DT动物_NNS是_VBP亲爱的RBR:_:_DT流浪的JJ狗NNS,_WDT保护VBP的_PRP$街道_NN,_,_DT智能的JJ乌鸦_NN,,_,_IN负担_NN的_DT野兽_NN和_CC每天_IN牧场_NN._的_DT JJ动物_NNS。文献_NN在其_PRP$own_JJ way_NN:_:在其_DT Rings_NN的_DT Lord_NN中有_VBZ voiced_VBN在_in Harry_NNP Potter_NNP中,当_WRB他们_PRP做了_VBD’t_RB注意_VB Hermone_NNP’s_POS建议_NN在_in房子的_DT治疗_NN elves_NNS他们_PRP学习了_DT硬_JJ方式_NN,在它中_PRP导致VBD他们的_PRP$own_JJ撤消_NN;_:和_CC Jack_NNP London_NNP,_,写入_VBZall_DT about_IN animals_NNS._。Indeed_RB,_,Kindness_NN到_to动画_NNS is _VBZ a_DT virtual_NN._。

下面是试图获得上述子字符串的代码:

String line;
StringBuilder sb=new StringBuilder();
try(FileInputStream input = new FileInputStream("E:\D.txt"))
    {
    int data = input.read();
    while(data != -1)
        {
        sb.append((char)data);
        data = input.read();
        }
    }
catch(FileNotFoundException e)
{
    System.err.println("File Not Found Exception : " + e.getMessage());
}
line=sb.toString();
String line1=line;//Copy for Tagger
line+=" T";       
List<String> sentenceList = new ArrayList<String>();//TAGGED DOCUMENT
MaxentTagger tagger = new MaxentTagger("E:\Installations\Java\Tagger\english-left3words-distsim.tagger");
String tagged = tagger.tagString(line1);
File file = new File("A.txt");
BufferedWriter output = new BufferedWriter(new FileWriter(file));
output.write(tagged);
output.close();
DocumentPreprocessor dp = new DocumentPreprocessor("C:\Users\Admin\workspace\Project\A.txt");
int largest=50;
int m=0;
StringTokenizer st1;
for (List<HasWord> sentence : dp) 
{
   String sentenceString = Sentence.listToString(sentence);
   sentenceList.add(sentenceString.toString());
}
String[][] Gloss=new String[sentenceList.size()][largest];
String[] Adj=new String[largest];
String[] Adv=new String[largest];
String[] Noun=new String[largest];
String[] Verb=new String[largest];
int adj=0,adv=0,noun=0,verb=0;
for(int i=0;i<sentenceList.size();i++)
{
    st1= new StringTokenizer(sentenceList.get(i)," ,(){}[]/.;:&?!");
    m=0;//Count for Gloss 2nd dimension
    //GETTING THE POS's COMPARTMENTALISED
    while(st1.hasMoreTokens())
    {
        String token=st1.nextToken();
        if(token.length()>1)//TO SKIP PAST TOKENS FOR PUNCTUATION MARKS
        {
        System.out.println(token);
        String s=token.substring(token.lastIndexOf("_")+1,token.length());
        System.out.println(s);
        if(s.equals("JJ")||s.equals("JJR")||s.equals("JJS"))
        {
            Adj[adj]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Adj[adj]);
            adj++;
        }
        if(s.equals("NN")||s.equals("NNS"))
        {
            Noun[noun]=token.substring(0,  token.lastIndexOf("_"));
            System.out.println(Noun[noun]);
            noun++;
        }
        if(s.equals("RB")||s.equals("RBR")||s.equals("RBS"))
        {
            Adv[adv]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Adv[adv]);
            adv++;
        }
        if(s.equals("VB")||s.equals("VBD")||s.equals("VBG")||s.equals("VBN")||s.equals("VBP")||s.equals("VBZ"))
        {
            Verb[verb]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Verb[verb]);
            verb++;
        }
        }
    }
    i++;//TO SKIP PAST THE LINES WHERE AN EXTRA UNDERSCORE OCCURS FOR FULLSTOP
 }

D.txt包含纯文本。

关于问题:

每个单词都会在空格处标记化。除了'nt_RB',其中它被分别标记为n't和RB。

输出是这样的:

Man_NN
NN
Man
has_VBZ 
VBZ
has
always_RB
RB
always
had_VBN
VBN
had
this_DT
DT
notion_NN
NN
notion
that_IN
IN
brave_VBP
VBP
brave
deeds_NNS
NNS
deeds
are_VBP
VBP
are
manifest_JJ
JJ
manifest
in_IN
IN
physical_JJ
JJ
physical
actions_NNS
NNS
actions
While_IN
IN
it_PRP
PRP
is_VBZ
VBZ
is
not_RB
RB
not
entirely_RB
RB
entirely
erroneous_JJ
JJ
erroneous
there_EX
EX
does_VBZ
VBZ
does
n't
n't
RB
RB

但如果我只是在标记化器'nt_RB'中运行'there_EX does_VBZ't_RB lie_VB',则'nt_RB'将被标记在一起。当我运行程序时,我会得到StringIndexOutOfBounds异常,这是可以理解的,因为"n’t"或"RB"中没有"_"。有人能看一下吗?非常感谢。

在DocumentPreprocessor文档中,它被称为

注意:如果使用了null参数,则假定文档已标记化,DocumentPreprocessor不执行标记化。

由于您从文件中加载的文档在程序的第一步中已经标记了,因此您应该执行以下操作:

DocumentPreprocessor dp = new DocumentPreprocessor("./data/stanford-nlp/A.txt");
dp.setTokenizerFactory(null);

然后它正确地输出'字,例如

...
did_VBD
VBD
did
n't_RB
RB
n't
heed_VB
VB
heed
Hermione_NNP
NNP
's_POS
POS
...

当出现错误时,方法lastIndexOf返回-1。收到的异常是由于lastIndexOf方法未能在字符串中获得正确字符时使用的子字符串

我认为你可以做的是检查索引是否与-1不同,然后使用它。通过这种检查,你可以避免收到令人讨厌的错误。不幸的是,如果没有完整的输入文本,很难理解哪些字符串不包含您指定的特定字符。

为了完整起见,我认为您还需要修复获取所有POS元素的方式。在我看来,字符串矩阵很容易出错(你需要弄清楚如何管理索引),而且对于这类任务来说效率也很低。

也许你可以使用Multimap来关联每个POS类型,所有属于它的元素。我认为这样你可以更好地管理一切。

我会尝试String.split()而不是StringTokenizer

String str = "Man_NN has_VBZ always_RB had_VBN this_DT notion_NN that_IN brave_VBP deeds_NNS are_VBP manifest_JJ in_IN physical_JJ actions_NNS ._. While_IN it_PRP is_VBZ not_RB entirely_RB erroneous_JJ ,_, there_EX does_VBZ n't_RB lie_VB the_DT singular_JJ path_NN to_TO valor_NN ._. From_IN of_IN old_JJ ,_, it_PRP is_VBZ a_DT sign_NN of_IN strength_NN to_TO fight_VB back_RP a_DT wild_JJ animal_NN ._. It_PRP is_VBZ understandable_JJ if_IN fought_VBN in_IN defense_NN ;_: however_RB ,_, to_TO go_VB the_DT extra_JJ mile_NN and_CC instigate_VB an_DT animal_NN and_CC fight_VB it_PRP is_VBZ the_DT lowest_JJS degree_NN of_IN civilization_NN man_NN can_MD exhibit_VB ._. More_RBR so_RB ,_, in_IN this_DT age_NN of_IN reasoning_NN and_CC knowledge_NN ._. Tradition_NN may_MD call_VB it_PRP ,_, but_CC adhering_JJ blindly_RB to_TO it_PRP is_VBZ idiocy_NN ,_, be_VB it_PRP the_DT famed_JJ Jallikattu_NNP in_IN Tamil_NNP Nadu_NNP -LRB-_-LRB- The_DT Indian_JJ equivalent_NN to_TO the_DT Spanish_JJ Bullfighting_NN -RRB-_-RRB- or_CC the_DT cock-fights_NNS ._. Pelting_VBG stones_NNS at_IN a_DT dog_NN and_CC relishing_VBG it_PRP howl_NN in_IN pain_NN is_VBZ dreadful_JJ ._. If_IN one_CD only_RB gave_VBD as_RB much_JJ as_IN a_DT trickle_VB of_IN thought_NN and_CC conscience_NN the_DT issue_NN would_MD surface_VB as_IN deplorable_JJ in_IN every_DT aspect_NN ._. Animals_NNS play_VBP a_DT part_NN along_IN with_IN us_PRP in_IN our_PRP$ ecosystem_NN ._. And_CC ,_, some_DT animals_NNS are_VBP dearer_RBR :_: the_DT stray_JJ dogs_NNS that_WDT guard_VBP our_PRP$ street_NN ,_, the_DT intelligent_JJ crow_NN ,_, the_DT beast_NN of_IN burden_NN and_CC the_DT everyday_JJ animals_NNS of_IN pasture_NN ._. Literature_NN has_VBZ voiced_VBN in_IN its_PRP$ own_JJ way_NN :_: In_IN The_DT Lord_NN of_IN the_DT Rings_NNP the_DT fellowship_NN treated_VBN Bill_NNP Ferny_NNP 's_POS pony_NN with_IN utmost_JJ care_NN ;_: in_IN Harry_NNP Potter_NNP when_WRB they_PRP did_VBD n't_RB heed_VB Hermione_NNP 's_POS advice_NN on_IN the_DT treatment_NN of_IN house_NN elves_NNS they_PRP learned_VBD the_DT hard_JJ way_NN that_IN it_PRP caused_VBD their_PRP$ own_JJ undoing_NN ;_: and_CC Jack_NNP London_NNP ,_, writes_VBZ all_DT about_IN animals_NNS ._. Indeed_RB ,_, Kindness_NN to_TO animals_NNS is_VBZ a_DT virtue_NN ._. ";
for(String word : str.split("\s")){
    if(word.split("_").length==2){
        String filteredWord = word.split("_")[0];
        String wordType     = word.split("_")[1];
        System.out.println(word+" = "+filteredWord+ " - "+wordType );
    }
}

输出看起来像:

Man_NN = Man - NN
has_VBZ = has - VBZ
always_RB = always - RB
had_VBN = had - VBN
this_DT = this - DT
notion_NN = notion - NN
that_IN = that - IN
brave_VBP = brave - VBP
deeds_NNS = deeds - NNS
are_VBP = are - VBP
manifest_JJ = manifest - JJ
in_IN = in - IN
physical_JJ = physical - JJ
actions_NNS = actions - NNS
......

为什么只有't_RB'被拆分为n't和RB

StringTokenizer stk = new StringTokenizer("n't_RB","_");
while(stk.hasMoreTokens()){
    System.out.println(stk.nextToken());
}

这将正确拆分,

n't
RB

相关内容

  • 没有找到相关文章

最新更新