Protect from bots creating multiple free accounts and uploading files - amazon-s3

I am developing a web for my university where users can create an account and upload images. Images are private and can only be seen by the person who uploaded them. For instance, is like a cloud file system.
Each user have a free account with 500MB. I am using Amazon S3 to store the images, that is to say storage implies costs.
How can I avoid that bots upload millions of MB? How can I avoid that a bot creates million of new accounts and upload 500MB per account without affecting the user experience?
On one hand I definitely don't want to put a CAPTCHA in the registration form because it negatively affects the conversion rate. On the other, I don't want to pay thousands of dollars because a bot upload million of dummy images.
Does anyone know whether Dropbox, Google Drive, etc, suffers from this (content uploaded by bots)? It seems that is not a problem because I couldn't find anything about it. All spam related problems I could read about only covered spam in forums. It makes sense also. Spam in forums can be read by other users. Spam in a service like Dropbox or Google Drive reaches no one. Nonetheless I have to protect it to avoid cost surprises.

As far as I can see, without using CAPTCHAs this can be done:
Set up monitoring systems that warn for specific abuse patterns (the same IP uploading lots of data and creating new accounts repeatedly).
Throttle users that follow those patterns; this will hopefully make them realize and make the process worthless. If this fails, then disable those accounts and have their owners mail/talk to you in order to explain what's happening.
Since you say it's a system for your university, make users provide proof of enrollment (e.g. an university e-mail address) in case of abuse.
Have this forbidden usage explicit in your terms of use.
Of course, a smart enough bot can work around all those problems.
For a more advanced solution, you might try some machine learning or AI that learns about normal and abnormal usage patterns, then applies that information to judge a possible abuser.

I would recommend to :
make users register using their email
don't allow multiple accounts for a single email
send them an email registration confirm, and deactivate the "unconfirmed" accounts after a short amount of time (eg 3 days)
AFAIK, Drupal embeds this kind of controls out-of-the-box or with little effort (and no programming).
This won't solve all your problems, but in fact it will reduce the risk of bot exploits.

As you said you need a registration, there are two points to tackle this problem - make sure no bots register and/or limit the number of uploads.
I personally would use both points. For the user signup, design a login form where the user has to enter its email address, send them a mail with a link in it and activate their account only after clicking this link. Or let the user solve a simple math question on signup.
For the second point, you can store the number of uploaded bytes per user and time. You can then set a quota on allowed upload usage per time, for example you may not upload more than 10MB per hour. If a user hits this limit more than n times, you can deactivate his account.
And: set up and alerting and monitoring system. For example monitor the number of non-activated users, monitor the amount of uploads etc. and set up alerts if these exceed a certain threshold.
The above mentioned methods may not be perfect and probably won't block out all bots, but they will at least make it way harder for bots to upload unwanted data. Also these methods are quite simple, so you can start of with your project and see if this is really a problem. And if you get bots to upload data, you will at least receive alerts and can invent a better solution afterwards.

Related

Youtube API's maximum number of video uploads per day

We are building an app with a video upload functionality. We were wondering if we could use a Youtube account to upload all of our user videos. They should only be accessible via our app... we don't mind if ads show up while viewing them.
If the app grows, we're looking at potential thousands of uploads per day.
Does Youtube support this? If a few videos get flagged, will the "master" account be shut down?
Finally, if Youtube is the not right choice, do you have any recommendation? We would like to avoid hosting them as much as possible... Since streaming large amounts of videos is an enormous challenge for a start up.
Thank you!
Some information on the video uploads:
https://developers.google.com/youtube/v3/docs/videos/insert
This method supports media upload. Uploaded files must conform to
these constraints: Maximum file size: 128GB Accepted Media MIME types:
video/*, application/octet-stream
You can get the qouta information here: https://developers.google.com/youtube/v3/getting-started#quota
Projects that enable the YouTube Data API have a default quota
allocation of 1 million units per day, an amount sufficient for the
overwhelming majority of our API users.
...
Different types of operations have different quota costs.
A simple read operation that only retrieves the ID of each returned
resource has a cost of approximately 1 unit. A write operation has a
cost of approximately 50 units. A video upload has a cost of
approximately 1600 units.
Yes, youtube can block API access, not only on flagged videos, but at any time as described here: https://developers.google.com/youtube/terms/api-services-terms-of-service#termination
24.2 Termination by YouTube. Notwithstanding anything to the contrary, YouTube reserves the right to (i) suspend or terminate access to, or
use of, any aspects of the YouTube API Services by you, your API
Client(s) and those acting on your behalf), and (ii) terminate the
Agreement (or any portion thereof), as applied to any specific user or
API Client, category of users or API Clients, or all users or API
Clients at any time. For example, we may need to exercise such rights
in instances of your breach of this Agreement, court order, when we
believe there to have been misconduct or conduct which may create
potential liability for YouTube or its Affiliates. Although we will
try to give you reasonable notice, we have no obligation to do so.

How can I acquire data from social media to analyze it using machine learning?

I have a project where I'm required to predict future user location so that we can provide him with location specific services as well as collect data from his device that would be used to provide a service for another user etc...
I have already developed an android app that collects some data but as social media is the richest in terms of information, I would like to make use of that. For example, if the user checks in in a restaurant and gives it a good review (on fb for example) then he is likely to go back there. Or if he tweets a negative tweet about a place then he is unlikely to go back there... these are just examples I thought of.
So my main issue is: how do I even get access to that information? I mean it's not like the user is going to send me a copy of every social media activity they have so how do I get it and is that even possible? Because I know fb, twitter and other social medias have security policies so I initially thought it couldn't be done and that only facebook gets access to their users' information to predict their likes and dislikes and show them adds and sponsored posts accordingly but when googling it, I found a lot of tools that claim to be able to provide that sort of data. How did they even acquire it and is it possible for me to do the same?
Facebook, Twitter, etc. have well-documented APIs that may or may not allow you to access the data.
For the APIs, see the official documentation of each, because anything I write here will likely be outdated in a year or two, as their APIs change.
Don't rely on web scraping. The web sites change design more often than the API, and you will likely violate the terms-of-service.

Solutions to protecting game high-scores

My friend proved it to me by taking the WP7 papertoss games and getting the .xap from it and then posting his own high scores.
Is there any fool proof way to stop this ? (I think xbox live integration makes hacking the high scores impossible but that is for special people )
It depends first of all how the high-scores are sent. I can only assume that what your friend did was take the XAP and modify some internal file or track the HTTP web requests that are used to send the scores to the centralized locations. I have two recommendations for you.
Encrypt. Don't keep scores in plaintext. There are plenty of strong encryption methods that you can take advantage of that will render the scoreboard useless unless the person who tries to read it has the key.
If you send the scores to a web service, never send it in plaintext (once again). From my own experience I can say that web requests can be easily altered and sniffed. So if I see that the app sends http://yourservice/sendscore?user=Den&score=500, I might as well invoke http://yourservice/sendscore?user=Den&score=99999999. Same applies if you plan on using headers.
Be aware, that using the Xbox Live services is only possible if you are a registered Xbox developer, and this is not easy to get.
First of all - is a high score list really that critical that you're worried about an edge case (the common person isn't going to have a dev unlocked phone with ability to modify the *.xap file)?
Second of all, no. There's no fool-proof way to protect your high score list if it is being stored locally on the device. The only way to protect the high score list would be to store it in the cloud via a web service or some other mechanism.
It is tricky to have a secure high score system since users can always modify information on the client side. It's impossible to prevent a determined hacker from looking at your code, but you can make it more difficult by obfuscating your code. PreEmptive's Dotfuscator is currently free for Windows Phone 7 developers and also has analytics built in if you want to use it. This will obfuscate your code and make it harder to read your code. Although it's not fool proof, it's an extra hurdle for hackers to overcome.
The obfuscation would make it harder to find the encryption key you're using to authenticate the high score.

How do "professional" IM bots avoid being kicked off line or locked out?

I'm looking to develop a scalable IM bot (aka Automated Service Agent). It's been done before and I'm wondering what methods are used to maintain reliability. I see two immediate problems with scaling:
1) On AIM, you can be kicked off if too many users warn you. My bot does not spam or do anything malicious but the vulnerability is still there.
2) If there are network problems, and the bot signs on/off too many times in a row, AOL will lock it out for an unknown period of time.
Here are some preventative measures for detection:
A bot can use several user accounts, so that its activity is less likely to be detected.
A bot can use proxy servers to hide detection even further by obscuring its real IP address.
A bot can be programmed with the network's rules in mind, and is simply prevented from breaking those rules in its logic.
Also, in response to your first problem, fewer people than you might expect will actually report a problem.
Additionally, and this is purely speculative, depending on the network's rules, it could be possible to simulate enough legitimate activity between two or more bots (and several user accounts), so as to offset the actual reports that are made.
In response to problem number two, with multiple accounts, the bot will just move to the next account when a failure occurs.
Just some thoughts.
Regarding #1, you're dealing with human interaction. If your bot doesn't annoy or piss people off, then I doubt that most people will care. The #1 rule with chat bots (IMHO) is to test it with a number of people from different backgrounds. Record their responses, and how they feel about interacting with the bot. You can also collect good data to improve your bots comprehension skills this way.
Regarding #2, you need to code an effective rate limiter. If there's a small number of flake outs in a short period of time its probably ok to reconnect right away, but if they become more frequent, then you'll need to back off more. This is actually good for the service in general, because if they're experiencing server side problems, and there's a horde of bots pouncing on them when they try to bring things up that's a pain.

Design an API for a web service without "selling the farm"?

I'm going to try to phrase this as a generic question.
A company runs a website that has a lot of valuable information on it. This information is queried from an internal private database. So technically, the information in the database is the valuable part.
If this company wished to develop an API that developers could use to access their database of valuable & useful information, what approach should the company take?
It's important to give developers what they need. But it is also important to keep competing websites from essentially using the API to steal everything and essentially steal all traffic from the company's website.
Is there was some way the API could be used in a way that drives traffic back to the original company's website somehow? Something that gives users a reason to keep going there.
This is a design consideration that my company is struggling with that I can imagine other web-based services have come across before.
Institute API keys - don't make it public. Maybe make the signup process more complex than "anyone with an e-mail address".
Rate limit the API based on keys. If you're running more than X requests a minute, you're likely mining the database.
Don't provide a "fetch everything" API. Make the users know something to get information on it. Don't reveal what you know.
I've seen a lot of companies giving out API keys and stating a TOS that all developers must adhere to. For example, any page that uses data from the API must include your logo and a link back to your website. If any developer is found breaking the rules, the API key can be cancelled and your data is safe again.
Who is meant to use the API?
A good general method of solving this problem is to limit access to the data to end users (rather than allow applications or developers at it). Provide applications and users with identification, each, and make sure that to access a subset of the data, a combination of both user and application key is required.
Following this pattern, each user will have access to a very limited subset of the data (presumably, the data that they require for their own specific use), and you can put measures in place to enforce this. Any attempts at data-mining will become obvious.
This type of approach meshes well with capability-type security models on the server side.