There are all sorts of tools for counting lines of code in a source file or directory tree (e.g. cloc). There are also tools for counting words in a plain text file (wc).
How would I go about counting words or tokens in my code, though? Is this feasible without writing a full-fledged program of my own to do it, using some generic programming language parsing mechanism like tree-sitter? More specifically, can I do this with shell tools or a simple(ish) script?
Note: Only words/tokens outside of comments must be counted. For general word counting I'm sure there are other questions on SO...
Example: Suppose my code is in the C language, and my foo.c file contains
int /* this is
a multi-line
comment!
*/
foo(int x) {
/* comment 1 */
return 123; // comment 2
}
The exact number expected here would depend on whether we think of braces and semicolons as words/tokens to count. If we do, then this should be 11 tokens: int, foo, (, int, x, ), {, return, 123, ;, }. If we ignore them (which I would rather not, but it could still be a legitimate approach) then we have 6 words: int, foo, int, x, return, 123.
File
Consider this simple perl concise snippet:
With bash:
With grep:
With awk: