Elasticsearch node out of space due to large index size - indexing

I have run out of space on the machine that is running my Graylog server.
A lot of the space is taken up with files in the
/var/lib/elasticsearch/graylog2/nodes/0/indices/graylog2_0/0/index folder.
Is it safe to remove the files in this folder?
Is this a problem with elastic search?
How can I prevent this happening in the future?
Thanks,
Seán

I have recently gone through this problem. I have few suggestions
Is it safe to remove the files in this folder?
No, its not safer way to free up space rather as suggested #Val you can delete old indices using curator.
But if you don't have older indices to removed I have few suggestion.
sol 1) If you have replica in you index and you have multiple elasticsearch node.
Then you can delete replica using live cluster settings and reroute the shard on other node.
Once routing done you can increase the replica size.
sol 2) Add new Elasitcseach node.
Is this a problem with elastic search?
No, its not Elasitcsearch problem if you have large amount of data indexing in your Elasticsearch cluster.
How can I prevent this happening in the future?
Set monitoring alert on your Elasticsearch server. may be on 90% limit. If your space is increases more than 90% its should send you alert message.

Related

Is Infinispan an improvement of JBoss Cache?

According to this link which belongs to JBoss documentation, I understood that Infinispan is a better product than JBoss Cache and kind of improvement the reason for which they recommend to migrate from JBoss Cache to Infinispan, that is supported by JBoss as well. Am I right in what I understood? Otherwise, are there differences?
One more question : Talking about replication and distribution, can any one of them be better than the other according to the need?
Thank you
Question:
Talking about replication and distribution, can any one of them be better than the other according to the need?
Answer:
I am taking a reference directly from Clustering modes - Infinispan
Distributed:
Number of copies represents the tradeoff between performance and durability of data
The more copies you maintain, the lower performance will be, but also the lower the risk of losing data due to server outages
use of a consistent hash algorithm to determine where in a cluster entries should be stored
No need to replicate data on each node that takes more time than just communicating hash code
Best suitable if no of nodes are high
Best suitable if size of data stored in cache is high.
Replicated:
Entries added to any of these cache instances will be replicated to all other cache instances in the cluster
This clustered mode provides a quick and easy way to share state across a cluster
replication practically only performs well in small clusters (under 10 servers), due to the number of replication messages that need to happen - as the cluster size increases
Practical Experience:
I are using Infinispan cache in my running live application on Jboss server having 8 nodes. Initially I used replicated cache but it took much longer time to respond due to large size of data. Finally we come back to Distributed and now its working fine.
Use replicated or distributed cache only for data specific to any user session. If data is common regardless of any user than prefer Local cache that's created separately for each node.

how to keep visited urls and maintain the job queue when writing a crawler

I'm writing a crawler. I keep the visited urls in redis set,and maintain the job queue using redis list. As data grows,memory is used up, my memory is 4G. How to maintain these without redis? I have no idea,if I store these in files,they also need to be in memory.
If I use a mysql to store that,I think it maybe much slower than redis.
I have 5 machines with 4G memory,if anyone has some material to set up a redis cluster,it also helps a lot. I have some material to set up a cluster to be failover ,but what I need is to set a load balanced cluster.
thx
If you are just doing the basic operations of adding/removing from sets and lists, take a look at twemproxy/nutcracker. With it you can use all of the nodes.
Regarding your usage pattern itself, are you removing or expiring jobs and URLs? How much repetition is there in the system? For example, are you repeatedly crawling the same URLs? If so, perhaps you only need a mapping of URLs to their last crawl time, and instead of a job queue you pull URLs that are new or outside a given window since their last run.
Without the details on how your crawler actually runs or interacts with Redis, that is about what I can offer. If memory grows continually, it likely means you aren't cleaning up the DB.

Prevent Redis from saving to disc

I'm building a Redis db which consumes nearly all of my machines memory. If Redis startes to save to disc while heavy inserting is going on, the memory consumption is more or less doubled (as described in the documentation). This leads to terrible performance in my case.
I would like to force Redis to not to store any data while I'm inserting and would trigger the save manually afterwards. That should solve my problem, but however I configure the "save" setting, at some point in time Redis starts to save to disc. Any hint how to prevent Redis from doing so?
You can disable saving by commenting all the "save" lines" in your redis.conf.
Alternately, if you don't want to edit any .conf files, run Redis with:
redis-server --save ""
As per example config (search for save):
It is also possible to remove all the previously configured save
points by adding a save directive with a single empty string argument
like in the following example:
save ""
I would also suggest looking at Persistence only slaves (Master / Slave replication) (Have the slaves persist data instead of master)
Take a look at this LINK

RavenDB disk storage

I have a requirement to keep the RavenDB database running when the disk for main database and index storage is full. I know I can configure provide a drive for storage with config option - Raven/IndexStoragePath
But I need to design for the corner case of when the this disk is full. What is the usual pattern used in this situation. One way is to halt all access while shutting the service down and updating the config file programatically and then start the service - but it is a bit risky.
I am aware of sharding and this question is not related to that , assume that sharding is enables and I have multiple shards and I want to increase storage for each shard by adding a new drive to each. Is there an elegant solution to this?
user544550,
In a disk full scenario, RavenDB will continue to operate, but will refuse to accept further writes.
Indexing will fail as well and eventually mark the indexes as permanently failing.
What is your actual scenario?
Note that in RavenDB, indexes tend to be significantly smaller than the actual data size, so the major cause for disk space utilization is actually the main database, not the indexes.

Solr approaches to re-indexing large document corpus

We are looking for some recommendations around systematically re-indexing in Solr an ever growing corpus of documents (tens of millions now, hundreds of millions in than a year) without taking the currently running index down. Re-indexing is needed on a periodic bases because:
New features are introduced around
searching the existing corpus that
require additional schema fields
which we can't always anticipate in
advance
The corpus is indexed across multiple
shards. When it grows past a certain
threshold, we need to create more
shards and re-balance documents
evenly across all of them (which
SolrCloud does not seem to yet
support).
The current index receives very frequent updates and additions, which need to be available for search within minutes. Therefore, approaches where the corpus is re-indexed in batch offline don't really work as by the time the batch is finished, new documents will have been made available.
The approaches we are looking into at the moment are:
Create a new cluster of shards and
batch re-index there while the old
cluster is still available for
searching. New documents that are not
part of the re-indexed batch are sent
to both the old cluster and the new
cluster. When ready to switch, point
the load balancer to the new cluster.
Use CoreAdmin: spawn a new core per
shard and send the re-indexed batch
to the new cores. New documents that
are not part of the re-indexed batch
are sent to both the old cores and
the new cores. When ready to switch,
use CoreAdmin to dynamically swap
cores.
We'd appreciate if folks can either confirm or poke holes in either or all these approaches. Is one more appropriate than the other? Or are we completely off? Thank you in advance.
This may not be applicable to you guys, but I'll offer my approach to this problem.
Our Solr setup is currently a single core. We'll be adding more cores in the future, but the overwhelming majority of the data is written to a single core.
With this in mind, sharding wasn't really applicable to us. I looked into distributed searches - cutting up the data and having different slices of it running on different servers. This, to me, just seemed to complicate things too much. It would make backup/restores more difficult and you end up losing out on certain features when performing distributed searches.
The approach we ended up going with was a very simple clustered master/slave setup.
Each cluster consists of a master database, and two solr slaves that are load balanced. All new data is written to the master database and the slaves are configured to sync new data every 5 minutes. Under normal circumstances this is a very nice setup. Re-indexing operations occur on the master, and while this is happening the slaves can still be read from.
When a major re-indexing operation is happening, I remove one slave from the load balancer and turn off polling on the other. So, the customer facing Solr database is now not syncing with the master, while the other is being updated. Once the re-index is complete and the offline slave database is in sync, I add it back to the load balancer, remove the other slave database from the load balancer, and re-configure it to sync with the master.
So far this has worked very well. We currently have around 5 million documents in our database and this number will scale much higher across multiple clusters.
Hope this helps!