如何在java中从混乱的字符串中获取文本



我正在读取一个文本文件,其中包含电影标题、年份、语言等。我正试图抓住这些特质。

假设一些字符串是这样的:

 String s = "A Fatal Inversion" (1992)"
 String d = "(aka "Verhngnisvolles Erbe" (1992))    (Germany)"
 String f =  ""#Yaprava" (2013) "
 String g = "(aka "Love Heritage" (2002)) (International: English title)"

如果指定了标题、年份、国家,我如何获取标题?如果指定了什么类型的标题?

我不太擅长使用regex和模式,但我不知道如何在没有指定它们的情况下找到它是什么类型的属性。我这样做是因为我试图从文本文件生成xml。我有它的dtd,但我不确定在这种情况下是否需要它。

编辑:这是我尝试过的。

    String pattern;
    Pattern p = Pattern.compile(""([^"]*)"");
    Matcher m;

    Pattern number = Pattern.compile("\d+");
    Matcher num;
    m = p.matcher(s);
    num = number.matcher(s);
    if(m.find()){
        System.out.println(m.group(1));
    }
    if(num.find()){
        System.out.println(num.group(0));
    }

我建议您先提取年份,因为这看起来相当一致。然后我会提取这个国家(如果有的话),剩下的我会认为是标题。

为了提取国家,我建议您使用已知国家的名称硬编码regex模式。可能需要一些迭代才能确定这些是什么,因为它们看起来非常不一致。

这个代码有点难看(但数据也是如此!):

public class Extraction {
    public final String original;
    public String year = "";
    public String title = "";
    public String country = "";
    private String remaining;
    public Extraction(String s) {
        this.original = s;
        this.remaining = s;
        extractBracketedYear();
        extractBracketedCountry();
        this.title = remaining;
    }
    private void extractBracketedYear() {
        Matcher matcher = Pattern.compile(" ?\(([0-9]+)\) ?").matcher(remaining);
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            this.year = matcher.group(1);
            matcher.appendReplacement(sb, "");
        }
        matcher.appendTail(sb);
        remaining = sb.toString();
    }
    private void extractBracketedCountry() {
        Matcher matcher = Pattern.compile("\((Germany|International: English.*?)\)").matcher(remaining);
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            this.country = matcher.group(1);
            matcher.appendReplacement(sb, "");
        }
        matcher.appendTail(sb);
        remaining = sb.toString();
    }
    public static void main(String... args) {
        for (String s : new String[] {
                "A Fatal Inversion (1992)",
                "(aka "Verhngnisvolles Erbe" (1992))    (Germany)",
                ""#Yaprava" (2013) ",
                "(aka "Love Heritage" (2002)) (International: English title)"}) {
            Extraction extraction = new Extraction(s);
            System.out.println("title   = " + extraction.title);
            System.out.println("country = " + extraction.country);
            System.out.println("year    = " + extraction.year);
            System.out.println();
        }
    }
}

产品:

title   = A Fatal Inversion
country = 
year    = 1992
title   = (aka "Verhngnisvolles Erbe")    
country = Germany
year    = 1992
title   = "#Yaprava"
country = 
year    = 2013
title   = (aka "Love Heritage") 
country = International: English title
year    = 2002

一旦你得到了这些数据,你就可以进一步操作它(例如"国际:英文标题"->"England")。

最新更新