supporting lz4 compression with json format - google-bigquery

it's more a feature request than a question.
LZ4 is one of fastest known LZ77 compressors today and is especially efficient when decompressing data. Its compression ratio is not like of gzip but is very good on text data. It's so good that it can actually save user time when writing a file to a persistent disk compared to a raw write in IO scarce environments like Google cloud.
Also it can save lots (lots!) of CPU cycles to Google and its clients if you guys support lz4 for uploading the data in addition to gzip.

This question is a feature request which would belong in the Cloud Platform Public Issue Tracker [1]. As a stackoverflow question, though, it's somewhat out of place. I encourage you to make a feature request which explains your use-case in the public issue tracker, while asking specific technical questions on stackoverflow. If you have any questions, you should refer to the Help Center [2] [3] here on stackoverflow or check out the kind of issues and feature requests which get reported in the App Engine public issue tracker [4], for example.

Related

Is manifold cf a good option for Google Drive indexing?

I am using apache manifoldcf open source project for indexing documents from Google Drive into my solr. Often I have seen it is quite inconsistent in indexing the data. Also it takes time to reflect even small number of documents in solr . Do you really think its a good option to index Google Drive using it?
It is currently bit on slow side, due to response time and throttling constraints from google drive itself. But this limit can probably relieved if you buy additional bandwidth from google. With current setup if you are looking to index a large set of documents in google drive it may not be quick as you may expect
Manifold CF is good for crawling through file-system. You can go for Apache Nutch if you are interested in web crawling.
Yes ManifoldCF does take a lot of time to reflect a small number of document. Also it has very less documentation. Although, you can join the mailing list where you can ask questions to the lead developer "Karl". He is very helpful and usually answers withing a few hours.
P.S. :I have worked using ManifoldCF over a project for a span of 10 months.

Concurrent page request comparisons

I have been hoping to find out what different server setups equate to in theory for concurrent page requests, and the answer always seems to be soaked in voodoo and sorcery. What is the approximation of max concurrent page requests for the following setups?
apache+php+mysql(1 server)
apache+php+mysql+caching(like memcached or similiar (still one server))
apache+php+mysql+caching+dedicated Database Server (2 servers)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/single dbserver)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/multi dbserver)
+distributed (amazon cloud elastic) -- I know this one is "as much as you can afford" but it would be nice to know when to move to it.
I appreciate any constructive criticism, I am just trying to figure out when its time to move from one implementation to the next, because they each come with their own implementation feat either programming wise or setup wise.
In your question you talk about caching and this is probably one of the most important factors in a web architecture r.e performance and capacity.
Memcache is useful, but actually, before that, you should be ensuring proper HTTP cache directives on your server responses. This does 2 things; it reduces the number of requests and speeds up server response times (if you have Apache configured correctly). This can also be improved by using an HTTP accelerator like Varnish and a CDN.
Another factor to consider is whether your system is stateless. By stateless, it usually means that it doesn't store sessions on the server and reference them with every request. A good systems architecture relies on state as little as possible. The less state the more horizontally scalable a system. Most people introduce state when confronted with issues of personalisation - i.e serving up different content for different users. In such cases you should first investigate using the HTML5 session storage (i.e store the complete user data in javascript on the client, obviously over https) or if the data set is smaller, secure javascript cookies. That way you can still serve up cached resources and then personalise with javascript on the client.
Finally, your stack includes a database tier, another potential bottleneck for performance and capacity. If you are only reading data from the system then again it should be quite easy to horizontally scale. If there are reads and writes, its typically better to separate the read write datasets into a separate database and have the read only in another. You can then use more relevant methods to scale.
These setups do not spit out a single answer that you can then compare to each other. The answer will vary on way more factors than you have listed.
Even if they did spit out a single answer, then it is just one metric out of dozens. What makes this the most important metric?
Even worse, each of these alternatives is not free. There is engineering effort and maintenance overhead in each of these. Which could not be analysed without understanding your organisation, your app and your cost/revenue structures.
Options like AWS not only involve development effort but may "lock you in" to a solution so you also need to be aware of that.
I know this response is not complete, but I am pointing out that this question touches on a large complicated area that cannot be reduced to a single metric.
I suspect you are approaching this from exactly the wrong end. Do not go looking for technologies and then figure out how to use them. Instead profile your app (measure, measure, measure), figure out the actual problem you are having, and then solve that problem and that problem only.
If you understand the problem and you understand the technology options then you should have an answer.
If you have already done this and the problem is concurrent page requests then I apologise in advance, but I suspect not.

Looking for a lossless compression api similar to smushit

Anyone know of an lossless image compression api/service similar to smushit from yahoo?
From their own FAQ:
WHAT TOOLS DOES SMUSH.IT USE TO SMUSH IMAGES?
We have found many good tools for reducing image size. Often times these tools are specific to particular image formats and work much better in certain circumstances than others. To "smush" really means to try many different image reduction algorithms and figure out which one gives the best result.
These are the algorithms currently in use:
ImageMagick: to identify the image type and to convert GIF files to PNG files.
pngcrush: to strip unneeded chunks from PNGs. We are also experimenting with other PNG reduction tools such as pngout, optipng, pngrewrite. Hopefully these tools will provide improved optimization of PNG files.
jpegtran: to strip all metadata from JPEGs (currently disabled) and try progressive JPEGs.
gifsicle: to optimize GIF animations by stripping repeating pixels in different frames.
More information about the smushing process is available at the Optimize Images section of Best Practices for High Performance Web pages.
It mentions several good tools. By the way, the very same FAQ mentions that Yahoo will make Smush.It a public API sooner or later so that you can run at it your own. Until then you can just upload images separately for Smush.It here.
Try Kraken Image Optimizer: https://kraken.io/signup
The developer's plan is free - but only returns dummy results. You must subscribe to one of the paid plans to use the API, however, the Web Interface is free and unlimited for images of up to 1MB.
Find out more in the Kraken documentation.
See this:
http://github.com/thebeansgroup/smush.py
It's a Python implementation of smushit that can be run off-line to optimise your images without uploading them to Yahoo's service.
As I know the best image compression for me is : Tinypng
They have also API : https://tinypng.com/developers
Once you retrieve your key, you can immediately start shrinking
images. Official client libraries are available for Ruby, PHP,
Node.js, Python and Java. You can also use the WordPress plugin, the
Magento 1 extension or improved Magento 2 extension to compress your
JPEG and PNG images.
And First 500 images per month is for free
Tip : Via using their API, you have no limit about file-size (not max 5MB each as their online tool)

Random-access data object in J2ME

I'm planning to develop a small J2ME utility for viewing local public transport schedules using a mobile phone. The data part for those is mostly a big bunch of numbers representing the times when the buses arrive or leave.
What I'm trying to figure out is what is the best way to store that data. The representation needs to
be considerably small (because of persistent storage limitations of a mobile phone)
fit into a single file (for the ease of updating the schedule database afterwards over HTTP) fit into a constant number of files, i.e. (routes.dat, times.dat, ..., agencies.dat), and not (schedule_111.dat, schedule_112.dat, ...)
have a random access ability (unserializing the whole data object into memory would be just too much for a mobile phone :))
if there's some library for accessing that data format, a Java implementation should be present
In other words, if you had to squeeze a big part of GTFS-like data into a mobile device, how would you do that?
Google Protocol Buffers seemed like a good candidate for defining data but it didn't have random access.
What would you suggest?
Persistent storage on J2ME is a tricky business; see this related question for more general background: Best practice for storing large amounts of data with J2ME
In my experience, J2ME persistent storage tends to work best/most reliably with many small records rather than a few monolithic ones. Think about how the program is going to want to access the data, then try to store it in those increments in the J2ME persistent store.
I'd generally recommend decoupling your client-server protocol for downloading updates from the on-device storage format. You can change the latter with every code update, but you're pretty much stuck supporting a client-server protocol forever, unless you want to break older clients out in the field.
Finally, I know there are some people on the Transit Developers group who have built offline transit apps in J2ME, so it's worth asking for tips there.
I made app like this and I used xml-s generated with php. This enabled us to have a single provider for 3 presentation layers which were:
j2me app
website for mobile phones
usual website
We used xslt to convert xml to html on websites and kXML - very light pull parser to do it on j2me app. This worked well even on very old phones with b/w screens and small amounts of memory.
Besides on j2me there is no concept of file. You have the db in which you can store information.
This is a link to "mobile" website.
http://mobi.krakow.pl/rozklady/
and here to the app:
http://www.mobi.krakow.pl/rozklady/j2me/rjk.jar
This is in polish, but I think it's not hard to figure out what's this and that.
If you want, I can provide you with more help and advice or if this is a commercial product then I think we can figure out something too ;)
I think your issue is requirement 2.
Updating 10MB of data just because 4 digits changed somewhere in the middle of the file seems highly inefficient.
Spliting the data into several files allows for a better update granularity that will be well worth the added code complexity.
Real-time public transport schedules are usually modified one bus/train/tram line at a time.

ArXiv replication brainstorming

The arXiv e-print archive has several terabytes of papers from various fields of science. Some users would like to maintain a full copy of this data on their own computers, while others just want to download the most recent papers in a particular category. They are looking to reduce bandwidth load using some kind of distributed download system (e.g. BitTorrent). I'm looking for ideas for a program or set of programs that would cover all of this.
full pdf content is in the amazon cloud.
while there are > 600k papers on arXiv the total size of the pdf is < 1/2 TB
http://arxiv.org/help/bulk_data_s3
T.
arXiv recommends squid in httpd accelerator mode for precisely this purpose. Any particular reason why this is not good enough?
My first idea is that this looks an awful lot like Usenet newsgroups, with infinite persistence for messages on the servers. I don't know how well it works with PDFs, though.