Elasticsearch: Indexing .srt files

154 views Asked by At
1
00:02:17,440 --> 00:02:20,375
Hello Bob,

2
00:02:20,476 --> 00:02:22,501
how are you doing today?
...

Consider a standard .srt file, which contains text data with timestamp information for displaying audio at the correct interval on the client side.

I need to index this text data into Elasticsearch while retaining the timestamp information. I am currently using a custom formatter that includes the timestamps within the sentence. For example:

(137) Hello Bob, how are you doing today? (142)

This indicates that the sentence starts at second 137 and ends at second 142.

However, I'm not sure if this approach is the best way to handle the timestamps. Any help would be appreciated.

2

There are 2 answers

0
Ankit On

You can create a field for the start and end timestamps, and then use range queries to retrieve the relevant text data. This approach allows for more complex queries and filtering when you want to operate on timestamp-related information.

You could also consider using the Elasticsearch "date" data type for the timestamps, allowing you to perform date-based queries and aggregation on the data.

input {
  file {
    path => "/path/to/your/file.srt"
    codec => plain {
      charset => "UTF-8"
    }
  }
}

filter {
  grok {
    match => { "message" => "(?<start_timestamp>\d{2}:\d{2}:\d{2},\d{3}) --> (?<end_timestamp>\d{2}:\d{2}:\d{2},\d{3})\s+(?<text>.*)" }
  }
  date {
    match => ["start_timestamp", "HH:mm:ss,SSS"]
    target => "@timestamp"
  }
  mutate {
    remove_field => ["message"]
    convert => { "start_timestamp" => "date_time" } #conversion into a date-field
    convert => { "end_timestamp" => "date_time" }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "your_index_name"
    .
    .
    .
  }
}
0
Paulo On

Another way to go is with filebeat:

filebeat.inputs:
- type: filestream
  id: srt
  paths:
    - /usr/share/filebeat/*.srt
  parsers:
    - multiline:
        type: pattern
        pattern: '^\d+$'
        negate: true
        match: after

processors:
  - dissect:
      tokenizer: "%{index}\n%{start} --> %{stop}\n%{text}"
      field: "message"
      target_prefix: "dissect"
      trim_chars: "\n"
      trim_values: "right"
  - replace:
      fields:
        - field: "dissect.start"
          pattern: "^(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})"
          replacement: "$1 h$2 m$3 s$4 ms"
        - field: "dissect.start"
          pattern: " "
          replacement: ""
        - field: "dissect.stop"
          pattern: "^(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})"
          replacement: "$1 h$2 m$3 s$4 ms"
        - field: "dissect.stop"
          pattern: " "
          replacement: ""
  - decode_duration:
      field: "dissect.start"
      format: "seconds"
  - decode_duration:
      field: "dissect.stop"
      format: "seconds"
      
output.console:
  pretty: true