I'm trying to plan a SolrCloud implementation, and given current index sizes from testing, my estimated physical index size for 1 billion documents is roughly 20 terabytes. So far, I've been unable to find a cloud host that can support a single volume of this size. I was hoping somebody could provide some guidance with regard to managing an index this large. Is a 20TB index absurd? Is there something I'm missing with regard to SolrCloud architecture? Most of the guidelines I've seen indicate that the entire index, regardless of shard count, should be replicated on every machine to guarantee redundancy, so every node would require a 20TB storage device. If there's anyone out there who can shed some light, I would greatly appreciate it.
Related Questions in SOLR
- Upgrading to Solr 9 failes due to NoSuchFileException
- regex to produce duplicate string with modification
- Apache atlas UI not showing up
- SAP Commerce Cloud multisite SOLR configuration
- Solr 9 punctuation issue
- Accessing solr web interface behind reverse proxy returns "Content Encoding Error"
- Getting NPE in apache SOLR 8.11.2 while doing atomic update using add-distinct from my java based appication
- how to specify the maximum number of clusters for the STC algorithm in Solr admin console?
- SOLR compatibility of the KNN query parser with function queries
- How to use Solr as retriever in RAG
- Multiple replacement / substitute NGgram string SOLR 8.6
- Solr updates are taking too long. The update requests are stalling
- solrCloud(9.5) integrates springboots, and adds user authentication, and there is no problem with queries, but the new one keeps reporting errors
- Why does Spring Data for Apache Solr run a count query before running the actual query?
- SOLR 'facet.prefix' is not working as expected
Related Questions in BIGDATA
- How to make an R Shiny app with big data?
- Liquibase as SaaS To Configure Multiple Database as Dynamic
- how to visualize readible big datasets with matplotlib?
- Are there techniques to mathematically compute the amount of searching in greedy graph searching?
- Pyspark & EMR Serialized task 466986024 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes)
- Is there a better way to create a custom analytics dashboard tailored for different users?
- Trigger a lambda function/url with Apache Superset
- How to download, then archive and send zip to the user without storing data in RAM and memory?
- Using bigmemory package in R to solve the Ram memory problem
- spark - How is it even possible to get an OOM?
- Aws Athena SQL Query is not working in Apache spark
- DB structure/file formats to persist a 100TB table and support efficient data skipping with predicates in Spark SQL
- How can I make this matching function faster in R? It currently takes 6-7 days, and this is not practical
- K-means clustering time series data
- Need help related to Data Sets
Related Questions in SOLRCLOUD
- Getting NPE in apache SOLR 8.11.2 while doing atomic update using add-distinct from my java based appication
- SOLR 'facet.prefix' is not working as expected
- SolrJ deleteById does not deletes data in Solr
- Using solr admin ui try to upload pdf file
- Solrcloud performance issue with new Solr
- How to handle special characters in solr?
- How to Reindex Solr in Production with Data Changes/Updates
- How to do deep pagination for solr result grouping?
- Total number of docs on a shard in solr
- Problem Integrating Solr into postgresSQL
- SOLR, Error waiting for SolrCore to be loaded on startup
- How to use termfreq(field,value) for both the exact match and partial match in solr?
- In SOLR DB Is total number of documents always right
- Sitecore index: SwitchOnRebuildSolrCloudSearchIndex not working with manual update strategy
- "Solr Spellchecker Issue: Suggesting 'kothi' Instead of 'kochi'"
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Not sure where you read such guidelines?
It is totally normal to keep only a portion of the index in each shard (each shard having one master and a number of replicas).
You would need to study how to shard your index, using built in routing based on a hash or provide your own.
Edit: so if I understand correctly, you are assuming that every node in the cluster must have either a master or a replica of EVERY shard, correct? If so, the answer is no. In order to provide resilience, you need to have master/replicas of every shard somewhere in the cluster, but you can have a node N that does not contain anything from shard S, as long as S has a master and a replica (at least) in other nodes.