Store tesseract OCR traineddata on S3? - pdf

I have an app hosted on Heroku. I seek to extract text from various PDFs. I'm currently using tesseract for this.
Since Heroku does not offer that much storage space and the .traineddata files are big in size (need to use all of them), is it possible to somehow store the tessdata language data on S3? I was not able to find any solution to this yet.
All I could find is that I can define the --tessdata-dir PATH, but that's for a directory.

Sadly, I'm not sure Heroku is a good fit for your needs if you can't make all the data fit within the heroku slug. Even if you could get it to work, it would be quite a performance hit.
You'd probably be better off setting the Tesseract as an API with it's own server(s), then sending whatever you need to that API from heroku (or moving the entire app over). Depending on the size of the rest of your app and how quickly Tesseract is growing in size, that might just mean Tesseract gets it's own heroku app with absolutely minimal dependencies or might mean moving that part of the app to AWS or something.

Related

Slow loading of images from amazon s3 and cloudFront

I am using amazon s3 services for hosting images. I have allot of images on my website.
I am also using CloudFront Distributions as cdn.
Image url's are fine.
But my images are still loading slowly as compared to some other top and competitors website.
Is there any way load images more fast?
Thanks
There could be numerous of other problems with images:
Loading too many images on the page. Make sure that you have lazy loading of your images that are not visible on initial render.
Using wrong size of images. This can be fixed by resizing images to correct size. Also, do not forget about responsive images. You can read more about them here
Using next generation formats. For instance, look at using WEBP for Chrome browser and JPEG2000 for Safari.
You can use Lighthouse tool to test your website on all problems listed above.
Also, it might be worth to consider using specialized CDN for images like pixboost.com.
Using a CDN like Cloudfront is the first step towards accelerating images. It addresses the challenges of Global distribution (your website is hosted in Europe but you have visitors from Australia => images will load from Cloudfront's node in Australia and be obviously faster than traveling from Europe). Also, it helps absorbing traffic peaks, for example during sales, Christmas, ...
To go further with image acceleration, you need to work on the images themselves and focus on 2 things:
resize the images to the target size (thumbnail, preview, full size, ...) and have different sizes for different screen sizes.
use image compression algorithms to "shrink" your images. You can use JPEG compression or alternative image formats like WebP, JPEG 2000, JPEG XR, ... These formats usually perform (shrink) better JPEG, however they come a big limitation: they are only supported by specific browser. Check caniuse.com for Browser support information: https://caniuse.com/#feat=webp
Bottom line, you will end up needing 15-20 versions of the same image to get the maximum optimisation across all browsers, device screen sizes, use cases, ...
There are multiple ways for automating this, for example by using ImageMagick. It's a great lib, but it requires coding and maintenance as it evolves quite dynamically.
Another option, is to use a Cloud-based image acceleration and delivery service. These services usually bundle image resizing and CDN delivery together and probably get you better CDN pricing as they negotiate big contracts with multiple CDN vendors.
We use https://cloudimage.io, but there are other great tools out there. Google is your best friend :).
Good luck with accelerating your page, faster images will definitely have a great impact.

Storing NLP Models in Git Repo vs S3?

What is the best way to store NLP Models? I have multiple NLP models which are about 800MB in size in total. My code will load the models in memory at start up time. However I am wondering what is the best way to store the models. Should I store it in git repo and then I can load directly from local system or should I store in an external location like S3 and load it from there? What are the advantages/disadvantages of each? Or do people use some other method which I haven't considered?
Do your NLP models need to be version controlled? Do you ever need to revert back to a previous NLP model? If these are not the case, storing your artifacts in an S3 bucket is certainly sufficient. If you are planning on storing many NLP models for a long period of time, I also recommend AWS Glacier. Glacier is an extremely cost effective for long term storage.
Very good question, while very few people pay attention to it.
Here are a few factors I point out:
Cost of (1) storing files (2) bandwidth: cost of
downloading/uploading resources (models, etc)
Lazy download: Not all the resources are required for running an NLP systems. It's a headache for the end-point user to download many resources that are not nearly useful for their purpose. In other words, the system should download (ideally itself) any resource needed for its purpose, when it's required.
Convenience.
And options are:
S3: The benefit is that if you have it working, it's convenient. But the issue is that someone familiar with S3 and Amazon AWS has to monitor the system for failures/payments/etc. And it's often expensive. Not only you pay for having the space, more importantly you also pay for band-width. If you have resources like word-embeddings or dictionaries (in addition to your models), each of which taking a few GB, it's not hard to hit terabytes of bandwidth usage. AI2 uses S3 and they have a simple Scala system for their usage. Their system is "lazy" i.e. your program downloads (and caches) a given resource only when it's required.
Keep it in the repo: certainly checking in big binary files in the repo is not a good idea, unless you use LFS to keep the big files outside your git history. Even with this, I'm not sure how you'll make programmatic calls to your files. Like you have to have scripts and instructions for users to manually download the files, etc (which is ugly).
I'm adding these two options too:
Maven dependency: Basically package everything in Jar files, deploy them and add them as dependencies. We used to use this, and some ppl still use it (e.g. StanfordNLP ppl, they ask you to add models as maven dependency). I personally do not recommend it, mainly because maven is not designed to take care of big resources (Like sometimes it hangs, etc). And this approach is not lazy, meaning that maven downloads EVERYTHING at once at compile/run time (e.g. when trying StanfordCoreNLP for the first time, you'll HAVE TO download a few Gigabytes of files that you might never need to use, which is a headache). Also, if you're a Java user you know that working with classpath is a BIGx10 headache.
Your own server: Install file manager server (like Minio), store your files there and whenever required, send programmatic calls to the server in your desired language (their APIs are available for different languages in their github page). We've written a convenient Java system to access it in Java that might come handy to you. This gives you the lazy behavior (like S3), while not being expensive (unlike S3) (Basically you'd get all the benefits of S3).
Just to summarize my opinion: I've tried S3 in past, and it was pretty convenient, but it was expensive. Since we have a server that's often idle we are using Minio and we're happy about it. I'd go with this option, if you have a reliable remote server to store your files.

Post Processing of Resized Image In clustered environment

Been playing with ImageResizer for a bit now, and trying to do something, I am having trouble understanding the way to go about it.
Mainly I would like to stick to the idea of using the pipeline, and not trying to cheat it.
So.... Let's say, I pretty standard use ImageResizer For something like:
giants_logo.jpg?w=280&h=100
The File giants_logo.jpg
Processing Request is for a resized version of 'w=280&h=100'
In a clustered environment, what will happen is if this same request is served by 3 machines.
All 3 would end up doing the resize, and then storing their cached version in a local folder on disc. I could leverage a shared drive or something, but that has it's own limitations.
What I am looking to do, is get the processed file, and then copy it back up to the DB or S3 where the main images are served from.
My thought is.... I might have to write somehting like DiscCache, but with a complelty different guts, using the DB or S3 as the back end instead of the file system.
I realize the point of caching is speed, and what I am suggesting is negating that aspect..... but that's not the case if we layer the things maybe.
Anyway, What I am focused on is trying to keep track of the files generated, as well as avoid processing on multiple servers.
Any thoughts on the route I should look at to accomplish this?
TLDR; When DiskCache actually stops working well (usually between 1 and 20 million unique images), then switch to a CDN (unless it's too expensive), or a reverse proxy (unless your data set is really too huge to be bound by mortal infrastructure).
For petabyte data sets on the cheap when performance isn't king, it's a good plan. But for most people, it's premature. Even users with upwards of 20TB (source images) still use DiskCache. Really. Terabyte drives are cheap.
Latency is the killer.
To make this work you would need a central Redis server. MSSQL won't cut it (at least not on a VM or commodity hardware, we've tried). Given a Redis server, you can track what is done and stored (and perhaps even what is in progress, to de-duplicate effort in real time, as DiskCache does).
If you can track it, you can reuse it, and you can delete it. Reuse will be slower, since you're doubling the network traffic, moving the result twice. (But also decreasing it linearly with the number of servers in the cluster for source image fetches).
If bandwidth saturation is your bottleneck (very common), this could make performance worse. In fact, unless your read/write ratio is write and CPU heavy, you'll likely see worse performance than duplicated CPU effort under individual disk caches.
If you have the infrastructure to test it, put DiskCache on a SAN or shared drive; this will give you a solid estimate of the performance you can expect (assuming said drive and your blob storage system have comparable IO perf).
However, it's a fair amount of work, and you're essentially duplicating a subset of the functionality of reverse proxy (but with worse performance, since every response has to be proxied through the unlucky cluster server, instead of being spooled directly from disk).
CDNs and Reverse proxies to the rescue
Amazon CloudFront or Varnish can serve quite well as reverse proxies/caches for a web farm or cluster. Now, you'll have a bit less control over the 'garbage collection' process, but... also less code to maintain.
There's also ARR, but I've heard neither success nor failure stories about it.
But it sounds fun!
Send me a Github link and I'll help out.
I'd love to get a Redis-coordinated, cloud-agnostic poor-man's blob cache system out there. You bring the petabytes and infrastructure, I'll help you with the integration and troublesome bits. Efficient HTTP proxying is probably the hardest part; the rest is state management and basic threading.
You might want to have a look at a modified AzureReader2 plugin at https://github.com/orbyone/Sensible.ImageResizer.Plugins.AzureReader2
This implementation stores the transformed image back to the Azure blob container on the initial requests, so subsequent requests are redirected to that copy.

Bigger cookie-like files for local data storage (browser "caching" of complex structures)

I am developing a browser based game, and I have a big map there. The terrain of the map is static. Therefore, I have some thousands of tiles that will not change (whether they represent a forest, a desert, whatever), just the players above it can change.
Hence, I wanted to store all my map in the player's computer. I am working with Ruby on Rails, and those map information are passed from the server to the javascript that runs on the user browser, in order to render a pretty map. But it makes me pretty sad to have a 200kb .html file, containing all those map related information.
What would be the simplest way to solve this issue? Cookies! Well. That's what I thought. A complete map information can get to almost 200kb (they are pretty big). A cookie can have at most 4kb.. I don't feel that the right way to achieve my objective is to create tons of cookies, one for each row of the map, for instance. Is there any more elegant way to have this static information lie on the player's browser, without creating lots of cookies? A way to cache it on his browser? I mean.. I can cache a 400kb image, why can't I cache a 200kb map structure?
Thanks in advance!
Fernando.
Well, HTML Local Storage gives you 5 MB (though data is stored *as strings*, so the actual amount of data you can fit in the container is likely a lot less than 5 MB.
This limit is oddly fluid. For one thing, it's just a recommended limit; and for another, i.e., Webkit-based browsers use UTF-16, which immediately cuts that in half (2.5 MB).
Browser support for Local Storage is good: IE, Firefox, Safari 4.0+, Chrome 4.0+, and Opera 10.5+. Both iPhone and Android are supported above versions 3.0 and 2.0. respectively.
Using Local Storage to preserve game state appears to be a proto-typical use case.
Finally, Paul Kinlan published an excellent step-by-step tutorial on HTML5Rocks, which i highly recommend (though it's a little more than a year old).
Have you considered storing it in a js file? Most browser will cache linked js files, allowing you to only serve it every once in a while. It would be very simple to deploy.

Looking for a lossless compression api similar to smushit

Anyone know of an lossless image compression api/service similar to smushit from yahoo?
From their own FAQ:
WHAT TOOLS DOES SMUSH.IT USE TO SMUSH IMAGES?
We have found many good tools for reducing image size. Often times these tools are specific to particular image formats and work much better in certain circumstances than others. To "smush" really means to try many different image reduction algorithms and figure out which one gives the best result.
These are the algorithms currently in use:
ImageMagick: to identify the image type and to convert GIF files to PNG files.
pngcrush: to strip unneeded chunks from PNGs. We are also experimenting with other PNG reduction tools such as pngout, optipng, pngrewrite. Hopefully these tools will provide improved optimization of PNG files.
jpegtran: to strip all metadata from JPEGs (currently disabled) and try progressive JPEGs.
gifsicle: to optimize GIF animations by stripping repeating pixels in different frames.
More information about the smushing process is available at the Optimize Images section of Best Practices for High Performance Web pages.
It mentions several good tools. By the way, the very same FAQ mentions that Yahoo will make Smush.It a public API sooner or later so that you can run at it your own. Until then you can just upload images separately for Smush.It here.
Try Kraken Image Optimizer: https://kraken.io/signup
The developer's plan is free - but only returns dummy results. You must subscribe to one of the paid plans to use the API, however, the Web Interface is free and unlimited for images of up to 1MB.
Find out more in the Kraken documentation.
See this:
http://github.com/thebeansgroup/smush.py
It's a Python implementation of smushit that can be run off-line to optimise your images without uploading them to Yahoo's service.
As I know the best image compression for me is : Tinypng
They have also API : https://tinypng.com/developers
Once you retrieve your key, you can immediately start shrinking
images. Official client libraries are available for Ruby, PHP,
Node.js, Python and Java. You can also use the WordPress plugin, the
Magento 1 extension or improved Magento 2 extension to compress your
JPEG and PNG images.
And First 500 images per month is for free
Tip : Via using their API, you have no limit about file-size (not max 5MB each as their online tool)