update: Added
I want to perform unique count on my ElasticSearch cluster. The cluster contains about 50 millions of records.
I've tried the following methods:
First method
Mentioned in this section:
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.
Second method
Mentioned in this section:
Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.
My property mapping
"my_prop": {
"index": "not_analyzed",
"fielddata": {
"format": "doc_values"
},
"doc_values": true,
"type": "string",
"fields": {
"hash": {
"type": "murmur3"
}
}
}
The problem
When I use unique count on my_prop.hash in Kibana I receive the following error:
Data too large, data for [my_prop.hash] would be larger than limit
ElasticSearch has 2g heap size. The above also fails for a single index with 4 millions of records.
My questions
- Am I missing something in my configurations?
- Should I increase my machine? This does not seem to be the scalable solution.
ElasticSearch query
Was generated by Kibana: http://pastebin.com/hf1yNLhE
That error says you don't have enough memory (more specifically, memory for
fielddata) to store all the values fromhash, so you need to take them out from the heap and put them on disk, meaning usingdoc_values.Since you are already using
doc_valuesformy_propI suggest doing the same formy_prop.hash(and, no, the settings from the main field are not inherited by the sub-fields):"hash": { "type": "murmur3", "index" : "no", "doc_values" : true }.