How to keep punctuation in elasticsearch's thai tokenizer

23 views Asked by At

I'm working with Elasticsearch 7.17.1 to analyze Thai text. My goal is to tokenize Thai text while also retaining punctuation as separate tokens. However, I've encountered a challenge: the default behavior of most Elasticsearch analyzers, including the Thai tokenizer, is to discard punctuation, and I haven't found a way to configure them to do otherwise.

I attempted to create a custom analyzer in hopes of achieving this, but so far, I've had no success. Below is my latest attempt:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "thai_with_punctuation": {
          "tokenizer": "thai",
          "filter": ["punctuation_filter"]
        }
      },
      "filter": {
        "punctuation_filter": {
          "type": "pattern_capture",
          "preserve_original": true,
          "patterns": [
            "([\\p{Punct}])"
          ]
        }
      }
    }
  }
}

When analyzing text with the custom analyzer:

POST /my_thai_index/_analyze
{
  "analyzer": "thai_with_punctuation",
  "text": "(เปิด) ไม่ เป็น??? ??lol."
}

The response ignores punctuation:

{
    "tokens": [
        {
            "token": "เปิด",
            "start_offset": 1,
            "end_offset": 5,
            "type": "word",
            "position": 0
        },
        {
            "token": "ไม่",
            "start_offset": 7,
            "end_offset": 10,
            "type": "word",
            "position": 1
        },
        {
            "token": "เป็น",
            "start_offset": 11,
            "end_offset": 15,
            "type": "word",
            "position": 2
        },
        {
            "token": "lol",
            "start_offset": 21,
            "end_offset": 24,
            "type": "word",
            "position": 3
        }
    ]
}

Retaining punctuation is crucial for my application because it's a requirement for another part of the system which adjusts the behaviour based on the punctuation information.

Is there a workaround or a different approach to achieve this without creating a custom Elasticsearch plugin?

0

There are 0 answers