读取一个大的txt文件(2GB),将其传递给字符串,耗时太长



我有一个大的文本文件(2GB),其中包含几本书。我想创建一个(**char)它包含整个文本文件中的每个单词。但首先我传递所有的文本文件数据在一个巨大的string,然后使**char变量

问题是花费的时间太长(小时),getline()循环结束。我运行了30分钟,程序读取了50万行。整个文件有4300万行

int main (){
ifstream book;
string sbook,str;
book.open("gutenberg.txt"); // the huge file
cout<<"Reading the file ....."<<endl;
while(!book.eof()){
getline(book,sbook);//passing the line as a string to sbook
if(str.empty()){
str= sbook;
}
else
str= str + " " + sbook;//apend sbook to another string until the file closes
}//I never managed to get out of this loop
cout<<"Done reading the file."<<endl;
cout<<"Removal....."<<endl;
removal(str);//removes all puncuations and makes each upperccase letter to a lowercase
cout<<"done removal"<<endl;
cout<<"Removing doublewhitespaces...."<<endl;
int whitespaces=removedoublewhitespace(str);//removes excess whitespaces leaving only one whitespace within each word
//and returns the number of all the whitespaces
cout<<"doublewhitespaces removed."<<endl;
cout<<"initiating leksis....."<<endl;
char **leksis=new char*[whitespaces+1];//whitespase+1 is how many words are left in the file
for(int i=0;i<whitespaces+1;i++){
leksis[i]= new char[30];
}
cout<<"done initiating leksis."<<endl;
int y=0,j=0;
cout<<"constructing leksis,finding plithos...."<<endl;
for(int i=0;i<str.length();i++){
if(isspace(str[i])){;
y++;
j=0;
leksis[y][j]=' ';
j++;
}
else{
leksis[y][j]=str[i];
j++;
} 
}
cout<<"Done constructing leksis,finding plithos...."<<endl;

removal()function

void removal(string &s) {
for (int i = 0, len = s.size(); i < len; i++)
{
char c=s[i];
if(isupper(s[i])){
s[i]=tolower(s[i]);
}
int flag=ispunct(s[i]);
if (flag){
s.erase(i--, 1);
len = s.size();
}
}

}

removedoublewhitespace()功能:

int removedoublewhitespace(string &str){
int wcnt=0;
for(int i=str.size()-1; i >= 0; i-- )
{
if(str[i]==' '&&str[i]==str[i-1]) //added equal sign
{
str.erase( str.begin() + i );
}
}
for(int i=0;i<str.size();i++){
if(isspace(str[i])){
wcnt++;
}
}
return wcnt;

}

this loop

while(!book.eof()){
getline(book,sbook);//passing the line as a string to sbook
if(str.empty()){
str= sbook;
}
else
str= str + " " + sbook;

是非常低效的。像这样连接一个大字符串是很可怕的。如果您必须一次将整个文件存储在内存中,那么将其放入字符串链表中,每行一个字符串。或者一个字符串的向量,这也是一个巨大的内存块,但它会更有效地分配

最新更新