Text file normalization and pattern matching

Question

Text file normalization and pattern matching

613 views Asked by Stefano Vuerich At 17 March 2016 at 19:05

I got a file written in a kind of metalanguage, which describes the procedure which is needed to validate some data. I need to generate validation functions to validate the data. Data are already stored in a structure

Steps I made:

Split text into string[], using char like(' . , ; == >= )
Remove articles, prepositions...
Normalize text(how?)
Match words with tokens using Regex or text matching
Match patern using Token type
Generate functions based on the matched pattern rule

What would you use in step 3 or in general to improve this procedure?

Original Q&A

There are 1 answers

**Quinn** · Answer 1 · 2016-03-17T20:35:06+00:00

As quoted from wiki, regex is one of the techniques to achieve "Text normalization":

For simple, context-independent normalization, such as removing non-alphanumeric characters or diacritical marks, regular expressions would suffice. For example, the sed script sed -e "s/\s+/ /g" inputfile would normalize runs of whitespace characters into a single space. More complex normalization requires correspondingly complicated algorithms, including domain knowledge of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text[5] and as a special case of machine translation.[6][7]

It seems to me that the data involves linguistic annotations. You can check out tools like The IMS Open Corpus Workbench (CWB). Also, there is another site (with sample code) that you may find useful: What Is Text Normalization?.

TechQA.

Text file normalization and pattern matching

There are 1 answers

Related Questions in REGEX

Related Questions in TEXT

Related Questions in PATTERN-MATCHING

Related Questions in NORMALIZATION

Related Questions in TEXT-NORMALIZATION

Popular Questions

Trending Questions