How can I count words or tokens in my code?

793 views Asked by At

There are all sorts of tools for counting lines of code in a source file or directory tree (e.g. cloc). There are also tools for counting words in a plain text file (wc).

How would I go about counting words or tokens in my code, though? Is this feasible without writing a full-fledged program of my own to do it, using some generic programming language parsing mechanism like tree-sitter? More specifically, can I do this with shell tools or a simple(ish) script?

Note: Only words/tokens outside of comments must be counted. For general word counting I'm sure there are other questions on SO...

Example: Suppose my code is in the C language, and my foo.c file contains

int /* this is
a multi-line
comment!
*/
foo(int x) { 
    /* comment 1 */
    return 123;  // comment 2
}

The exact number expected here would depend on whether we think of braces and semicolons as words/tokens to count. If we do, then this should be 11 tokens: int, foo, (, int, x, ), {, return, 123, ;, }. If we ignore them (which I would rather not, but it could still be a legitimate approach) then we have 6 words: int, foo, int, x, return, 123.

2

There are 2 answers

3
Gilles Quénot On

File

$ cat file
foo bar base base
lorem ipsum doloris
qux aze qwe base

Consider this simple concise snippet:

$ perl -snE '$c += s/\bbase\b/$&/g;END{say $c}' file
3

With :

for word in $(< file); do
    [[ $word == base ]] && ((c++))
done
echo "$c"

With :

printf '%s\n' $(< file) | grep -wc base 

With :

tr ' ' $'\n' < file | awk '$1=="base"{c++}END{print c}'
4
David C. Rankin On

Total Non-Comment Tokens Per-Line

Edit, my bad, I went off @Gilles example and missed the comment part. Per your example using C/C++ comments and ignoring multi-line comments between /* and */, the per-line non-comment tokens can be obtained with awk using a counter tokens and a flag skip by checking whether a field is comprised on "//", "/*" or "*/" as you show whitespace surrounding each. A simple awk script to process the file into non-comment whitespace separated tokens could be:

#!/bin/awk -f

{
  tokens = 0
  skip = 0
  for (i=1; i<=NF; i++) {
    if ($i == "//") {
      break
    }
    if ($i == "/*") {
      skip = 1
    }
    if (!skip) {
      tokens++
    }
    if ($i == "*/") {
      skip = 0
    }
  }
  printf "line %d: %d tokens\n", FNR, tokens
}

(note: parsing individual tokens from C containing non-witespace, e.g. "foo(int" isn't addressed. If parsing at that level is needed, then reinventing the wheel with awk may not be your best choice. However adding conditions to ignore fields comprised solely of (,{,[ or ],},) is easy to do.)

The single-rule iterates over each field and checks for the opening comment. In the case of "//", the remainder of the line is ignored. In the case of "/*", the skip flag is set and no more tokens are counted until after a closing "*/" is encountered in that line.

Example Use/Output

Modified example file:

$ cat file
foo bar // base base
lorem ipsum doloris
qux /* aze */ qwe base

If you named your awk script noncmttokens.awk and made it executable with chmod +x noncmttokens.awk then all you need to is run it providing file as the argument, e.g.

$ ./noncmttokens.awk file
line 1: 2 tokens
line 2: 3 tokens
line 3: 3 tokens

Sorry about overlooking the comment verbiage in the question, I got off track using the example file from the other answer -- happens...


Adding Mult-line Comment Handling and split on "("

To process your file into the tokens you desire, while maintaining that all comment open/close will be whitespace separated and only splitting non-whtiespace separated tokens on "(", you can do:

#!/bin/awk -f

BEGIN {
  tokens_in_file = 0    # initialize vars that are persistent across records
  skip = 0
}

{
  tokens_in_line = 0;   # per-record reset of varaibles
  ndx = 1
}

skip {  # if in muli-line comment
  for (ndx=1; ndx<=NF; ndx++) {   # iterate fields
    if ($ndx == "*/") {           # check for multi-line close
      skip = 0;                   # unset skip flag
      ndx++                       # increment field index
      break
    }
  }
  if (skip) {   # still in multi-line comment
    ndx = 1
    printf "line %d: %d tokens\n", FNR, tokens_in_line
    next
  }
}

{
  for (i=ndx; i<=NF; i++) {   # process fields from ndx to last
    if ($i ~/^[({})]$/) {     # ignore "(, {, }, )" fields
      continue
    }
    if ($i == "//") {         # C++ rest of line comment
      break
    }
    if ($i == "/*") {         # multi-line opening
      if (skip) {             # handle malformed multi-line error
        print "error: duplicate milti-line comment entry tokens" 
      }
      skip = 1                # set skip flag
    }
    if (!skip) {              # if not skip, process toks, split on "("
      tokens_in_line += split ($i, tok_arr, "(")
    }
    if ($i == "*/") {         # check if last field multi-line close
      skip = 0
    }
  }
  # output per-line stats, add tokens_in_line to tokens_in_file
  printf "line %d: %d tokens\n", FNR, tokens_in_line
  tokens_in_file += tokens_in_line
}

END { # output file stats
  printf "\nindentified %d tokens in %d lines\n", tokens_in_file, FNR
}

Example Use/Output

With the sample file you provide in file2.c, e.g.

$ cat file2.c
int /* this is
a multi-line
comment!
*/
foo(int x) {
    /* comment 1 */
    return 123;  // comment 2
}

Providing that file as the argument to the expanded awk script you would get:

$ ./noncmttokens2.awk file2.c
line 1: 1 tokens
line 2: 0 tokens
line 3: 0 tokens
line 4: 0 tokens
line 5: 3 tokens
line 6: 0 tokens
line 7: 2 tokens
line 8: 0 tokens

indentified 6 tokens in 8 lines

awk can handle just about anything you need to do in a highly efficient manner, but as mentioned in the comments, I suspect that as more detail is added it will become more of a job reinventing what the compiler does in one of its compilation levels. This splitting of tokens in rudimentary, but the number of corner cases that would need to be handled, e.g. to handle obfuscated C/C++ code rapidly grows exponentially.

Hopefully this provides what you need.