In redis, a value can have a max of the length of 512MB. According to this link, the length can be increased. But the procedure is not given properly to increase the size. Can anyone give the steps to increase the value size? I am using redis version 5.
The link is to an Issue that discusses the possibility of increasing the string sizes in Redis, when you build it from source.
There is a link to a patch that does just that (https://github.com/antirez/redis/pull/4568) in one of the comments. Feel free to apply the patch and build your own Redis version that supports large values.
But there appears to be no way to just “configure” this without building Redis from source. I’m not sure what you need these large values for but beware of building from source and deploying forked/untested code to production - especially if you don’t feel confident in your understanding of the changes, the build process and the process of further keeping up with upstream changes.
Related
Problem
Currently I have an elasticsearch cluster that is running out of file descriptors, and checking elasticsearch setup documentation page I saw that is recommended to set the number of file descriptors on the machine to 32K or even 64K and digging a bit on search results I found some people that set this limit to a threshold even higher (128K or unlimited).
The exception I'm getting is quite common to the exhaustion of file descriptors:
Caused by: org.apache.lucene.store.LockReleaseFailedException: Cannot forcefully unlock a NativeFSLock which is held by another indexer component
Question
Is there an equation for the number of file descriptors we should expect to be required by elasticsearch / lucene based on the number of Indexes, Shards, Replicas and / or Documents? Or even foor the number of files for all elasticsearch indexes?
I wouldn't like to set it by try and error, and unlimited number of file descriptors isn't possible for my situation.
I know very little about elasticsearch but will try to answer this from Lucene perspective.
I'm afraid there's no easy way of finding out how many descriptors do you really need.
First, this depends on Directory implementation (which itself depends on the underlying OS, if you use FSDirectory.open(File)).
Secondly, it also depends on your merge policy (which may depend on Lucene version, unless elasticsearch overrides it).
Finally, it can even depend on various exotic circumstances, such as garbage collection behaviour (if certain bits depend on finalizers to free resources). We even had an instance of Lucene which was leaking file descriptors until we manually switched -d64 mode on.
Above said, I would recommend you to set up a monitoring script which gathers some stats over a week or so and come up with the range fitting your typical usage. Add some variance for unexpected cases.
P.S. I am struggling to imagine a case these days where file descriptors would be a genuine problem. Is this a C10K problem? Can you elaborate on this?
I have been using PyMC in an analysis of some high energy physics data. It has worked to perfection, the analysis is complete, and we are working on the paper.
I have a small problem, however. I ran the sampler with the RAM database backend. The traces have been sitting around in memory in an IPython kernel process for a couple of months now. The problem is that the workstation support staff want to perform a kernel upgrade and reboot that workstation. This will cause me to lose the traces. I would like to keep these traces (as opposed to just generating new), since they are what I've made all the plots with. I'd also like to include a portion of the traces (only the parameters of interest) as supplemental material with the publication.
Is it possible to take an existing chain in a pymc.MCMC object created with the RAM backend, change to a different backend, and write out the traces in the chain?
The trace values are stored as NumPy arrays, so you can use numpy.savetxt to send the values of each parameter to a file. (This is what the text backend does under the hood.)
While saving your current traces is a good idea, I'd suggest taking the time to make your analysis repeatable before publishing.
Ayende, my mails are not delivered to your mailing list, so I'll ask here, maybe someone else would have a solution to my problem.
I'm testing ravendb again and again :) and I think I found a little bug. On your documentation page I read
Raven/MaxNumberOfParallelIndexTasks
The maximum number of indexing
tasks allowed to run in parallel Default: the number of processors in
the current machine
But beside that, looks like RavenDB is using only a one core to do indexing tasks. And it takes too long with a single core to finish indexing large dataset. I tried overriding that configuration and assigned 3 to MaxNumberOfParallelIndexTasks, but still, it uses only single core.
take a look at this screenshot http://dl.dropbox.com/u/3055964/Untitled.gif
CPU utilization is at 25% only, and I have a quad core processor. I didn't modify affinity mask.
Am I doing something wrong or I have just crossed a bug?
Thanks
Davita,
I fixed the mailing list issue.
The problem you are seeing is likely because you are seeing only one index that has work to do. The work for a single index is always done on a single CPU.
We spread the work of indexing across multiple CPUs on index boundary.
My rails application always reaches the threshold of the disk I/O rate set by my VPS at Linode. It's set at 3000 (I up it from 2000), and every hour or so I will get a notification that it reaches 4000-5000+.
What are the methods that I can use to minimize the disk IO rate? I mostly use Sphinx (Thinking Sphinx plugin) and Latitude and Longitude distance search.
What are the methods to avoid?
I'm using Rails 2.3.11 and MySQL.
Thanks.
did you check if your server is swapping itself to death? what does "top" say?
your Linode may have limited RAM, and it could be very likely that it is swapping like crazy to keep things running..
If you see red in the IO graph, that is swapping activity! You need to upgrade your Linode to more RAM,
or limit the number / size of processes which are running. You should also add approximately 2x the RAM size as Swap space (swap partition).
http://tinypic.com/view.php?pic=2s0b8t2&s=7
Since your question is too vague to answer concisely, this is generally a sign of one of a few things:
Your data set is too large because of historical data that you could prune. Delete what is no longer relevant.
Your tables are not indexed properly and you are hitting a lot of table scans. Check with EXAMINE on each of your slow queries.
Your data structure is not optimized for the way you are using it, and you are doing too many joins. Some tactical de-normalization would help here. Make sure all your JOIN queries are strictly necessary.
You are retrieving more data than is required to service the request. It is, sadly, all too common that people load enormous TEXT or BLOB columns from a user table when displaying only a list of user names. Load only what you need.
You're being hit by some kind of automated scraper or spider robot that's systematically downloading your entire site, page by page. You may want to alter your robots.txt if this is an issue, or start blocking troublesome IPs.
Is it going high and staying high for a long time, or is it just spiking temporarily?
There aren't going to be specific methods to avoid (other than not writing to disk).
You could try using a profiler in production like NewRelic to get more insight into your performance. A profiler will highlight the actions that are taking a long time, however, and when you examine the specific algorithm you're using in that action, you might discover what's inefficient about that particular action.
The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.