Why are some words printed twice when working with word frequency

40 views Asked by At

I read some words from a file and print the 30 most frequent words but some words are printed

twice as you can see in the output.

#include <iostream>
#include <vector>
#include <map>
#include <iterator>
#include <fstream>
using namespace std;

int main(){
  
  fstream fs, output;
  fs.open("/Users/brah79/Downloads/skola/c++/inlämningsuppgifter/labb4/L4_wc/hitchhikersguide.txt");
  output.open("/Users/brah79/Downloads/skola/c++/inlämningsuppgifter/labb4/labb4/output.txt");
  if(!fs.is_open() || !output.is_open()){
    cout << "could not open file" << endl; 
  }

  map <string, int> mp; 
  string word; 
  while(fs >> word){

    for(int i = 0; i < word.length(); i++){
        if(!isalpha(word[i])){
        word.erase(i--, 1);
      }
    }
    if(word.empty()){
        continue; 
    }

  
    mp[word]++; 
  }
  vector<pair<int, string>> v;
  v.reserve(mp.size());

  for (const auto& p : mp){
    v.emplace_back(p.second, p.first);
  }

  sort(v.rbegin(), v.rend()); 

  cout << "Theese are the 30 most frequent words: " << endl; 
  for(int i = 0; i < 30; i++){
      cout << v[i].second << " : " << v[i].first << " times" << endl;
  }


  output << "Theese are the 30 most frequent words: " << endl; 
  for(int i = 0; i < 30; i++){
      cout << v[i].second << " : " << v[i].first << " times" << endl;
  }
 

  return 0; 
}

output:

the : 2230 times !!!

of : 1254 times

to : 1177 times

a : 1121 times

and : 1109 times

said : 680 times

it : 665 times

was : 605 times

in : 590 times

he : 546 times

that : 520 times

you : 495 times

I : 428 times

on : 349 times

Arthur : 332 times

his : 324 times

Ford : 314 times

The : 307 times !!!

at : 306 times

for : 284 times

is : 281 times

with : 273 times

had : 252 times

He : 242 times

this : 220 times

as : 207 times

Zaphod : 206 times

be : 188 times

all : 186 times

him : 182 times

"the" is printed twice. Also "could not open file" is printed at the top even

though the file was open and it's content is stored in the map.

1

There are 1 answers

0
user12002570 On BEST ANSWER

Because you've written your program in an case-sensitive manner.

In particular, The and the are considered different from each other and so have different frequencies. For example, the is 2230 times while The is 307 times.