Text file normalization and pattern matching

613 views Asked by At

I got a file written in a kind of metalanguage, which describes the procedure which is needed to validate some data. I need to generate validation functions to validate the data. Data are already stored in a structure

Steps I made:

  1. Split text into string[], using char like(' . , ; == >= )
  2. Remove articles, prepositions...
  3. Normalize text(how?)
  4. Match words with tokens using Regex or text matching
  5. Match patern using Token type
  6. Generate functions based on the matched pattern rule


What would you use in step 3 or in general to improve this procedure?

1

There are 1 answers

0
Quinn On

As quoted from wiki, regex is one of the techniques to achieve "Text normalization":

For simple, context-independent normalization, such as removing non-alphanumeric characters or diacritical marks, regular expressions would suffice. For example, the sed script sed -e "s/\s+/ /g" inputfile would normalize runs of whitespace characters into a single space. More complex normalization requires correspondingly complicated algorithms, including domain knowledge of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text[5] and as a special case of machine translation.[6][7]

It seems to me that the data involves linguistic annotations. You can check out tools like The IMS Open Corpus Workbench (CWB). Also, there is another site (with sample code) that you may find useful: What Is Text Normalization?.