C/C90/统计大文本文件中的单词



我有一个大约30000字的文本文件。我的目标是计算单词的实际数量(请记住,包含多个标点符号和连续空格,以及与-(例如three-legged)相连的单词,因此仅计算空格是不正确的)。

我已经设法计算了字符总数,但我在单词上挣扎。任何帮助吗?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define SIZE 50
char *getfile(void);
void stats(char *filename);
int main() {
char *file;
file = getfile();
stats(file);
return 0;
}
char *getfile(void) {
char *filename;
FILE *fp;
filename = malloc(SIZE);
printf("Enter the name of the text file: ");
scanf("%49s", filename);
fp = fopen(filename, "r");
printf("n");
if (fp == NULL) {
printf("The entered file does not exist.");
printf("n");
} else {
printf("The file exists.");
fclose(fp);
}
return filename;
}
void stats(char *filename) {
int cnt = 0, space = 0, lines = 0;
int c;
int count = 0;
FILE *fp;
fp = fopen(filename, "r");
while (((c = fgetc(fp)) != EOF)) {
cnt++;
if (c == ' ') {
space++;
}
if (c == 'n' || c == '') {
lines++;
}
}
printf("nTotal characters in file: %d", cnt);
printf("nTotal characters (excluding spaces) in file: %d", cnt - space);

fclose(fp);
return;
}

你应该列出所有可以分隔单词的字符,并计算每个分隔字符的序列。

您遇到麻烦的原因是您没有状态。也就是说,对之前的内容进行分类。您可以使用其他方法将文件分解为单词,但是状态机简单且快速。正如评论和其他回答所建议的那样,你需要两种状态,空格和字符。它有点像一比特导数,有上升沿,空格,作为一个你要计算的东西。

去掉大部分无关的东西,这可能就是你做状态机的方式。

#include <stdio.h>
int main(void) {
unsigned char buf[16384 /*50*/]; /* 50 is small. */
enum { WHITE, WORD } state = WHITE;
size_t cnt = 0, lines = 0, words = 0, nread, i;
do { /* Fill `buf`. */
nread = fread(buf, 1, sizeof buf, stdin);
if(ferror(stdin)) { perror("wc"); return 1; }
cnt += nread;
for(i = 0; i < nread; i++) { /* Char-by-char in `buf`. */
unsigned char c = buf[i];
/* https://en.cppreference.com/w/cpp/string/byte/isspace */
switch(c) {
case 'n':
lines++; /* Fall-though. Doesn't handle CRs properly. */
case '': case ' ': case 'f': case 'r': case 't': case 'v':
state = WHITE;
break;
default:
if(state == WORD) break;
state = WORD;
words++;
break;
}
}
} while(nread == sizeof buf);
printf("Total characters in file: %lun", (unsigned long)(cnt - lines));
printf("Total lines in file: %lun", (unsigned long)lines);
printf("Total words in file: %lun", (unsigned long)words);
return 0;
}

为了简洁起见,我卸载了托管环境中的一些工作,./wc < file.txt和我使用了缓冲区。

最新更新