Optimal EC2 Instance to handle/edit PDF files - pdf

So I have two applications that are processing PDF files:
The first one combines two files and then captures digital signature and fingerprint to then return a single file with the images of the captures.
The second one is a cloud-signer that inserts qr codes for all of the pages of the pdfs and then digitally signs the document placing also some bar codes and legend. This application will eventually receive requests with 100 documents.
The applications both work now with t2.micro instances (some operations take a while e.g. signing 200 docs around 4 minutes) but I would like to know the optimal type to go smoothly without trouble, would M or C instances make a significant difference?
Which one do you recommend, M, C or other?
Thanks,
Felipe

t-type instances are "burstable performance" which means you cannot really utilize them 100% over a long period of time.
I worked on a project where I extracted text from a PDF and it seemed that it was pretty much a CPU-bound task, so c-type instances may be a good choice. However, what you really should take in the account (especially if the workload will be uneven let's say during the day) is elasticity.
It may take some effort, but let you consider using either lambda functions or an auto-scaling groups so you can start more servers when you need it and stop them later. Actually, it may be a significantly more work, but it may save lots of money if you will use this function for a long time.
Also, I would invest some time to make it simple to deploy with various parameters. For example if you create a CloudFormation template and you need only occasionally sign 200 documents fast, you can deploy 24 t2.micro and run them over the period when you need it (like 1 hour) and then shut them down. It will cost you about the same money as running 1 t2.micro the full day.
One service which may help with it is SQS - you can put documents into it and then just let the machines pick up a document, sign it and put it to its destination.

Related

Static files as API GET targets

I'm creating a RESTful backend API for eventual use by a phone app, and am toying with the idea of making some of the API read functions nothing more than static files, created and periodically updated by my server-side code, that the app will simply GET directly.
Is this a good idea?
My hope is to significantly reduce the CPU and memory load on the server by not requiring any code to run at all for many of the API calls. However, there could potentially be a huge number of these files (at least one per user of the phone app, which will be a public app listed in the app stores that I naturally hope will get lots of downloads) and I'm wondering if that alone will lead to latency issues I'm trying to avoid.
Here are more details:
It's an Apache server
The hardware is a hosting provider's VPS with about 1gb memory and 20gb free disk space
The average file size (in terms of content and not disk footprint) will probably be < 1kb
I imagine my server-side code might update a given user's data once a day or so at most.
The app will probably do GETs on these files just a few times a day. (There's no real-time interaction going on.)
I might password protect the directory the files will be in at the .htaccess level, though there's no personal or proprietary information in any of the files, so maybe I don't need to, but if I do, will that make a difference in terms of the main question of feasibility and performance?
Thanks for any help you can give me.
This is generally a good thing to do: anything that can be static rather than dynamic is a win for performance and cost (it's why we do caching!), but the main issue with with authorization (which you'll still need to do for each incoming request).
You might also want to consider using a cloud service for storage of the static data (e.g., Amazon S3 or Google Cloud Storage). There are neat ways to provide temporary authorized URLs that you can pass to users so that they can read the data for a short time and then must re-authorize to continue having access.

Good idea to host data that will be downloaded internationally using S3?

I don't have any experience regarding server hosting performance and how slow it gets so I wanted to ask this question.
My situation is, I want to host a ~1MB data file that needs to be downloaded by clients occasionally (once every 2-3 days). Of course I would like to minimize costs as long as it does not hurt user experience too much. I have data to indicate that I have clients globally.
I wanted to ask what the ballpark figure would be for the amount of time it would take to download a file of this size from other parts of the world (data is hosted in the US). Does anyone have any idea, for instance, how long it would take to download a 1MB file from locations such as Japan?
In case people are wondering, I personally would consider it OK if it takes under 10s to download in most parts of the world.
The first thing to do when you don't know how well something works... is to try it. Create buckets in all of the regions, store a file, and then download it and see.
The official AWS-centric answer for global content distribution is to connect a CloudFront distribution to an S3 bucket, and set things up so that your content is downloaded from S3 via CloudFront. This will tend to improve download speeds more when the user is distant from the bucket, even if the content isn't cached at a CloudFront edge, because most of the distance the download has to travel, it will be traveling on the AWS "Edge Network," a global network connecting CloudFront to the AWS regions, with fewer unknowns than the Internet at large between here and wherever.
I have a global client base, but -- for example -- my shopping pages' catalog images are stored in S3 in Oregon (us-west-2), but with links pointing to CloudFront.
Interestingly, the pricing for using both services together sometimes works out a little bit less expensive than using only S3. A possible explanation for this is that edge network egress traffic represents a lower cost to AWS and the rates are set accordingly. It's not a major difference, but once you understand the pricing tables, you'll see it.
1MB in 10s equals 800kbps. I'd be very surprised if any reputable hosting provider couldn't keep up with that speed of delivery. Looking at Akamai rankings (2015)*, in Japan (as in your example) the average user's speed is 15Mbps: your file would then be downloaded in 0.53 seconds.
( *Looking at the rankings, keep in mind that in countries where fast internet infra is yet to be ubiquitous, the "average speed" will be an average of fast corporate pipes and other premium links, with actual mainstream users having substantially slower speeds.)
Then in most cases, this will be up to the user's connection speed, and further, their ISP's international links, which can be much slower than their national or regional pipes. More so in countries with less developed internet infrastructure, where operators are cutting costs and corners.
In deciding if you need to deploy S3 or other CDN solutions, or no extra solutions at at all, you'll have to start with mapping up your user demographics. If there's a substantial sector from far-away countries with weak net infra, it makes sense. Otherwise, it doesn't seem likely that your target speed of 1MB/10s wouldn't be matched even without a special means of delivery.
If you have some but not substantial traffic from countries/regions where you reckon int'l traffic might be slower, and if you want to eliminate extra costs, I figure your users will survive even if it takes 15-20 seconds once in a blue moon as their speeds fluctuate. (This is opinion-based relative to how picky your users are!) In such a case, I'd only bother with a CDN if I wanted to improve speeds across the board, e.g. for all requests for static resources, not just a single file requested every couple of days. Would make a more substantial contribution towards the general user experience.

Ruby script to count access and complete downloads of the file

I am open for the suggestions on the following:
have a file on S3
this file will be randomly downloaded by random people
the volume of downloads is low, maybe 200-300 at most per day, on a spike, but usually might be as low as 5-10.
file size is ~10-20mb.
I need somehow to count a) how many accesses to the file happened b) how many full (completed) downloads happened.
I believe the only good day is just have some Ruby or Node.js script. It'll count accesses, then somehow supply the file, and on final byte do the completed count.
Unfortunately, doesn't seem like a too nice of the approach.
Any better ideas?
I was also thinking about enabling access logging on S3 and then parsing logs, but that doesn't seem too good neither, as requires downloading and parsing logs.
I would stick with your first idea. Having some sort of server-side logic handling the counting stuff.
I don't know which type of clients are accessing your system, but with this approach you can parse additional data coming from your clients, like HTTP headers (if applicable), etc, and that can help you identifying the profile of your clients. That might not be useful at all to you, though.
Also, if you ever need to add more complicated logic (authentication, privileges, permissions, uploading files, etc), it will be much easier once you already have a backend application/script in place.

Simple DB and Amazon S3 performance tools

Is there any benchmark tools that i can use to test the Amazon Simple DB performance and Amazon S3 performance?
help needed please.
Its going to depend on you usage and whether you're running in EC2 or not. There are some benchmarks somewhere for S3 access from EC2, but your mileage will vary with object sizes, the SDK library you're using and where you're accessing from.
Roll your own tests and then you'll know that you're testing something close to your end goal...
You need to write your own code that approximates what you want to do.
Having said that: In my experience, S3 is about as fast as your connection. You may have to upload/download more than one item at a time to hit your local bandwidth limit, but you can get there.
Listing performance is also pretty good on S3, but the results are uncompressed XML, so they are little large. If you want to do 'something' to say a million files, you need to run several requests in parallel. This goes for SimpleDb too. The number of requests 'in flight' that works best is a mix of ping, bandwidth, AWS service response and other factors.
SimpleDB on the other hand I find to be pretty slow for many tasks. It totally depends on your needs, though. Selecting a record and getting back the attributes when you know the db item name is usually ping time limited, but searching with the %like% operator is usually quite slow (seconds is easy to hit).
Add to this that its all much faster if you are running on EC2 vs a local machine, and also add in the delay/bandwidth if your app is in say Singapore and you are trying to use the US Standard location to store everything. There is just too much to figure in.

Distributed datastore

We're trying to add some kind of persistence in our app.
The app generates about 250 entries per second. Each of these entries belong to one of 2M files. For each file, we want to keep the last 10 entries, so we can look them up later.
The way our client application works :
it gets a stream of all the data
it fetches the right file (GET)
it adds the new content
it saves the file back (PUT)
We're looking for an efficient way to store this data that can scale horizontally as the amount of data we're getting is doubling every few weeks.
We initially looked at S3. It works fine, but becomes very expensive very fast (>$1000 monthly just in PUT operations!)
We then gave a shot at Riak. But it seems we can't get more than 60 write/sec on each node, which is very very slow.
Any other solution out there?
There are lots of knobs you can turn in Riak - ask the mailing list if you haven't already and we'll figure out a sane configuration for you. 60 writes/sec is not within the norm.
See: http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
What about Hadoop's HDFS spread over Amazon EC2 instances? I know each instance has a good amount of storage space, and you don't have to pay for put/get, only the inbound transfer.
I would suggest looking at CloudIQ Storage from Appistry. Its a fully distributed file store. Its accessible via a REST-based API, and can run on commodity hardware. You can define the number of copies retained on a file by file basis. It supports an Eventually Consistent model so you can balance file consistency with performance.