I'm piping the result of mysqldump to gzip, the speed of gzip seems to lag behind greatly
gzip: 34.9MiB 0:01:54 [ 218kiB/s]
mysqldump: 735MiB 0:01:54 [5.73MiB/s]
2 questions:
1. would this eventually break the pipe if gzip can't catch up? does pipe hold all these data in memory?
2. how would I speedup gzip(already know about -9 vs -1 compression options)?
GZIP is CPU-bound. So you can lower the compression rate as you said, but you won't gain much speed. Try to poke around LZO, which is a lot faster (compression rate is not as good, but I find it to be a good trade off)
You can find a good benchmark here: http://stephane.lesimple.fr/blog/2010-07-20/lzop-vs-compress-vs-gzip-vs-bzip2-vs-lzma-vs-lzma2xz-benchmark-reloaded.html
As for your first question, the OS will do the buffering for youm you don't have to worry about it. Even if your RAM isn't big enough, the buffering will write to disk but GZIP will still be slower.
As for speeding up gzip, you could try pigz, which uses multiple processors/cores.
Related
We are working on choosing a better compression technique. We tried with bzip2 but its taking more time for compression.
I think there will be no direct answer to your question. What will be better or right will depend on your infrastructure, requirements and data flow.
You may have a look into "Performance comparison of different file formats and storage engines in the Hadoop ecosystem" or "Hadoop Compression. Choosing compression codec.".
Just from the perspective of speed, Snappy might be a good try.
I have a directory with ~50k files. Each file has ~700000 lines. I have written an awk program to read each line and print only if there is an error. Everything is running perfectly fine, but the time taken is huge - ~4 days!!!! Is there a way to reduce this time? Can we use multiple cores (processes)? Did anyone try this before?
awk and gawk will not fix this for your by themselves. There is no magic "make it parallel" switch. You will need to rewrite to some degree:
shard by file - the simplest way to fix this is to run multiple awks' in parallel, one per file. You will need some sort of dispatch mechanism. Parallelize Bash script with maximum number of processes shows how you can write this yourself in shell. It will take more reading, but if you want more features check out gearman or celery which should be adaptable to your problem
better hardware - it sounds like your probably need a faster CPU to make this go faster, but it could also be an I/O issue. Having graphs of CPU and I/O from munin or some other monitoring system would help isolate which is the bottleneck in this case. Have you tried running this job on an SSD based system? That is often an easy win these days.
caching - there are probably some amount of duplicate lines or files. If there are enough duplicates it would be helpful to cache the processing in some way. If you calculate the CRC/md5sum for a file and store it in a database you could calculate the md5sum for a new file and skip processing if you've already done so.
complete rewrite - scaling this with awk is going to get ridiculous at some point. Using some map-reduce framework might be a good idea.
I wrote a JSON-API in NodeJS for a small project, running behind an Apache webserver. Now I'd like to improve performance by adding caching and compression. Basically, the question is what should be done in NodeJS itself and what is better handled by Apache:
a) The API calls have unique URLs (e.g. /api/user-id/content) and I want to cache them for at least 60 seconds.
b) I want the output to be served as Gzip (if it's understood by the client). NodeJS's HTTP module usually delivers content as "chunked". As I'm only writing a response in one place, is it enough to adjust the Content-encoding header to serve it as one piece so it can be compressed and cached?
a) I recommend caching but without a timer, just let the replacement strategy remove entries. I don't know what you are actually serving, maybe caching the actual JSON or its source data might be useful. here is a simple cache I wrote including a small unit test to give you some inspiration.
Simple Cache
b) How big is your JSON data? You have to compress it yourself, and keep in mind to not do it blocking. You can stream compress it and deliver it already. I never did that with node.
> I wrote a JSON-API in NodeJS for a small project, running behind an
> Apache webserver.
I would just run the API on different port and not behind apache(proxy??). If you want to proxy I would advice you to use NGINX. See Ryan Dahl's slides discussing Apache vs NGINX(Slides 8+). NGINX can also do compression/caching(fast). Maybe you should not compress all your JSON(size? few KB?). I recommendt you to read Google's Page Speed "Minimum payload size" section(good read!) explaining that, which I also quote below:
Note that gzipping is only beneficial for larger resources. Due to the
overhead and latency of compression and decompression, you should only
gzip files above a certain size threshold; we recommend a minimum
range between 150 and 1000 bytes. Gzipping files below 150 bytes can
actually make them larger.
> Now I'd like to improve performance by adding caching and compression
You could do compression/caching via NGINX(+memcached) which is going to be very fast. Even more prefered would be a CDN(for static files) which are optimized for this purpose. I don't think you should be doing any compressing in node.js, although some modules are available through NPM's search(search for gzip) like for example https://github.com/saikat/node-gzip
For caching I would advice you to have a look at redis which is extremely fast. It is even going to be faster than most client libraries because node.js fast client library(node_redis) uses hiredis(C). For this it is important to also install hiredis via npm:
npm install hiredis redis
Some benchmarks with hiredis
PING: 20000 ops 46189.38 ops/sec 1/4/1.082
SET: 20000 ops 41237.11 ops/sec 0/6/1.210
GET: 20000 ops 39682.54 ops/sec 1/7/1.257
INCR: 20000 ops 40080.16 ops/sec 0/8/1.242
LPUSH: 20000 ops 41152.26 ops/sec 0/3/1.212
LRANGE (10 elements): 20000 ops 36563.07 ops/sec 1/8/1.363
LRANGE (100 elements): 20000 ops 21834.06 ops/sec 0/9/2.287
> The API calls have unique URLs (e.g. /api/user-id/content) and I want
> to cache them for at least 60 seconds.
You can achieve this caching easily thanks to redis's setex command. This is going to be extremely fast.
Ok, as my API has only a very very basic use, I'll go with an little in-memory key/value store as basic cache (based on the inspiration Simple Cache gave me). For this little development experiment, that should be enough. For an API in production use, I'd stick to Alfred's tipps.
For the compression I'll use Apache's mod_deflate. It's robust and I don't need async gzipping at this point. Furthermore you can change compression settings without changing the app itself.
Thank you both for your help!
I am trying to speed up my file I/O using MPI-2, but there doesn't appear to be any way to read/write formatted files. Many of my I/O files are formatted for ease of pre and post-processing.
Any suggestions for an MPI-2 solution for formatted I/O?
The usual answer to using MPI-IO while generating some sort of portable, sensible file format is to use HDF5 or NetCDF4 . There's a real learning curve to both (but also lots of tutorials out there) but the result is you hve portable, self-describing files that there are a zillion tools for accessing, manipulating, etc.
If by `formatted' output you mean plain human-readable text, then as someone who does a lot of this stuff, I wouldn't be doing my job if I didn't urge you enough to start moving away from that approach. We all by and large start that way, dumping plain text so we can quickly see what's going on; but it's just not a good approach for doing production runs. The files are bloated, the I/O is way slower (I routinely see 6x slowdown in using ascii as vs binary, partly because you're writing out small chunks at a time and partly because of the string conversions), and for what? If there's so little data being output that you actually can feasibly read and understand the output, you don't need parallel I/O; if there are so many numbers that you can't really plausibly flip through them all and understand what's going on, then what's the point?
Following yahoos performance teams advice, I decided to enable mod_deflate on Apache. In checking the results (using HTTPWatch), the gzipped responses took on average a 100 milliseconds more than the non-gzipped?
The server is on average load using <5% of CPU. Compression level is at minimum?
have you guys experienced results as such or read about it? I very much appreciate any input. Thanks.
What kind of responses are you sending? You won't notice any benefits in compressing certain kinds of binary data, e.g. images, Flash animations and other such assets; GZip works best for text.
Also, compressing data will incur a slight performance overhead on both server and client, but you expected that, right?
I don't think Yahoo's point is that gzipping will be faster. It's that if you look at the marginal cost of bandwidth versus CPU power, you're better off using more CPU if it allows you to use less bandwidth.
I'd agree with Rob that you need to figure out if the delay is due to Apache not serving the file as quickly because it has to go through compression or if its something else. Just watching the HTTP response is not going to tell you WHY its slower, just that it is.