OutOfMemoryError in StringBuilder and HashSet



我在Amazon S3中有一个JSON文件(.json)。我需要阅读它并为每个JsonObject创建一个名为 Hash_index 的新字段。该文件非常大,因此我正在使用GSON库来避免读取文件时出现内存不足错误。下面是我的代码。请注意,我正在使用GSON

//Create the Hashed JSON
public void createHash() throws IOException
{
System.out.println("Hash Creation Started");
strBuffer = new StringBuffer("");

try
{
//List all the Buckets
List<Bucket>buckets = s3.listBuckets();
for(int i=0;i<buckets.size();i++)
{
System.out.println("- "+(buckets.get(i)).getName());
}

//Downloading the Object
System.out.println("Downloading Object");
S3Object s3Object = s3.getObject(new GetObjectRequest(inputBucket, inputFile));
System.out.println("Content-Type: "  + s3Object.getObjectMetadata().getContentType());

//Read the JSON File
/*BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
while (true) {
String line = reader.readLine();
if (line == null) break;
// System.out.println("    " + line);
strBuffer.append(line);
}*/
// JSONTokener jTokener = new JSONTokener(new BufferedReader(new InputStreamReader(s3Object.getObjectContent())));
// jsonArray = new JSONArray(jTokener);
JsonReader reader = new JsonReader( new BufferedReader(new InputStreamReader(s3Object.getObjectContent())) );
reader.beginArray();
int gsonVal = 0;
while (reader.hasNext()) {
JsonParser  _parser = new JsonParser();
JsonElement jsonElement =  _parser.parse(reader);
JsonObject jsonObject1 = jsonElement.getAsJsonObject();
//Do something

StringBuffer hashIndex = new StringBuffer("");
//Add Title and Body Together to the list
String titleAndBodyContainer = jsonObject1.get("title")+" "+jsonObject1.get("body");

//Remove full stops and commas
titleAndBodyContainer = titleAndBodyContainer.replaceAll("\.(?=\s|$)", " ");
titleAndBodyContainer = titleAndBodyContainer.replaceAll(",", " ");
titleAndBodyContainer = titleAndBodyContainer.toLowerCase();

//Create a word list without duplicated words
StringBuilder result = new StringBuilder();
HashSet<String> set = new HashSet<String>();
for(String s : titleAndBodyContainer.split(" ")) {
if (!set.contains(s)) {
result.append(s);
result.append(" ");
set.add(s);
}
}
//System.out.println(result.toString());

//Re-Arranging everything into Alphabetic Order
String testString = "acarpous barnyard gleet diabolize acarus creosol eaten gleet absorbance";
//String testHash = "057        1$k     983    5*1      058     52j    6!v   983     03z";
String[]finalWordHolder = (result.toString()).split(" ");
Arrays.sort(finalWordHolder);

//Navigate through text and create the Hash
for(int arrayCount=0;arrayCount<finalWordHolder.length;arrayCount++)
{

if(wordMap.containsKey(finalWordHolder[arrayCount]))
{
hashIndex.append((String)wordMap.get(finalWordHolder[arrayCount]));
}
}
//System.out.println(hashIndex.toString().trim());
jsonObject1.addProperty("hash_index", hashIndex.toString().trim()); 
jsonObject1.addProperty("primary_key", gsonVal); 
jsonObjectHolder.add(jsonObject1); //Add the JSON Object to the JSON collection
jsonHashHolder.add(hashIndex.toString().trim());
System.out.println("Primary Key: "+jsonObject1.get("primary_key"));
//System.out.println(Arrays.toString(finalWordHolder));
//System.out.println("- "+hashIndex.toString());
//break;
gsonVal++;
}
System.out.println("Hash Creation Completed");
}
catch(Exception e)
{
e.printStackTrace();
}
}

执行此代码时,我收到以下错误

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at HashCreator.createHash(HashCreator.java:252)
at HashCreator.<init>(HashCreator.java:66)
at Main.main(Main.java:9)
[root@ip-172-31-45-123 JarFiles]#

第 252 行是-result.append(s);。它位于HashSet循环内。

以前,它在第 254 行生成OutOfMemoryError第 254 行是-set.add(s);它也在HashSet数组内。

我的 Json 文件真的非常大。千兆字节和太兆字节。我不知道如何避免上述问题。

使用像 Jackson 这样的流式 JSON 库。 读取一些 JSON,添加哈希,然后写出它们。 然后再读一些,处理它们,然后把它们写出来。 继续操作,直到处理完所有对象。

http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example

(另请参阅这篇 StackOverflow 帖子:是否有用于 JSON 的流式处理 API?)

最新更新