Why randomize your file names for cloud storage/CDN? - amazon-s3

When you look at a profile picture on a social networking site like Twitter, they store image files like:
http://a1.twimg.com/profile_images/1082228637/a-smile_twitter_100.jpg
or even with a date somewhere in the path like 20110912. The only immediate benefit I can think of is preventing a bot from going through and downloading all files in your storage in a linear fashion. Am I missing any other benefits? What is the best way to go about randomizing it?
I am using Amazon S3 so I will have one subdomain serving all my static content. My plan was to store an integer ID in my database and then just concat the URL with the id to form the location.

One reason I cryptographically scramble identifiers in public URLs is so that the business' rate of growth is not always public.
If the current ids can be deduced simply by creating a new user account or uploading an image, then an outside person can calculate the growth rate (or an upper limit) by doing this on a regular basis and seeing how many ids were used during the elapsed time.
Whether it's stagnating or whether it's exploding exponentially, I want to be able to control the release of this information instead of letting competitors or business analysts be able to deduce it for themselves.
Offline examples of this are invoice and check numbers. If you get billed by or paid by a company on a regular basis, then you can see how many invoices or checks they write in that time period.
Here's a CPAN (Perl) module I maintain that scrambles 32-bit ids using two way encryption based on SkipJack:
http://metacpan.org/pod/Crypt::Skip32
It's a direct translation of the Skip32 algorithm written in C by Greg Rose:
http://www.qualcomm.com.au/PublicationsDocs/skip32.c
Use of this approach maps each 32-bit id into an (effectively random) corresponding 32-bit number which can be reversed back into the original id. You don't have to save anything extra in your database.
I convert the scrambled id into 8 hex digits for displaying in URLs.
Once your ids approach 4.29 billion (32-bits) you'll need to plan for extending the URL structure to support more, but I like having shorter URLs for as long as possible.

Changing URLs is a safe way to invalidate outdated assets.
It is also a necessity if you want to allow users storing private images. Using a path deductible from the users account name/id/path would render privacy settings useless as soon as you store assets on a CDN.

Mainly, it prevents name collisions. More than one person might upload "IMG_0001.JPG", for example. You also avoid limits on the number of files in one directory, and you can shard images across multiple servers - there's no way a huge site like Twitter or Facebook could store all photos on one server, no matter how large.

Related

Are Firebase dynamic links short url info exposable?

Can an attacker view the query parameters of a shortened firebase dynamic link?
If yes, is it secure enough to use let’s say invite links that contain a group ID to access that certain group.
In that case wouldn’t there technically be the security issue of someone having a program attempt all ids till they get a correct ID?
After some research, the url parameters are indeed exposed and viewable.
Secondly, firebase document ID's consist of 20 characters, each character could be 26 * 2 (Alphabet Capital and small) + 10 number possibilities. Meaning that an ID has 62^20 combinations, Good luck to anyone trying that amount out.
Thirdly, i believe App attest would block a user if he's abusing resources.

real estate large scale 301 redirect

Trying to work out what to do in regard to a redirect of a new client real estate website.
We have no access at all to the old site and the url structure on new is forcibly different due to randomly generated property IDs (our system generates a different ID from old)
The old url structure is www.mydomain.com/property/view?=1111
The new url structure is www.mydomain.com/property/street-name/2222
My instinct is to do manual 301s for every property (about 6000), matching by page title, but sadly I cant as I have no access to the structure of the old website and despite spidering it numerous times I cant get a pull of all properties off.
If any could give me any advice on what best to do to avoid bad user experience and a google frying, would really appreciate it.
Thanks in advance.
Mark
It depends on what the 1111 is. If that corresponds to an MLS ID number (some sort of UID) then you should be able to use regex to get it to work. Most of the IDX vendors offer a way to grab listings via an MLSID.
If that 1111 is instead just a GUID of the previous IDX vendor, then you might be out of luck and would need to do everything manually.

How do I store 3rd-party API data after user interaction?

The project that I'm currently on is consuming a large volume of 3rd-party information exposed via APIs. These datasets are constantly changing and in the order of millions of entries for each.
Users are to denote their favorites and recall that data when they need it. An example may be that the user wants to "bookmark" an inventory level to their "analyze later" list.
My current thinking is that during actions like searching users are presented with "live" data from the 3rd parties. If they flag something they're interested in I copy that data to a database I control. Subsequent views of that info are served from my database, not the 3rd party, since the 3rd party entry may change (or cease to exist entirely).
Is this good API practice? What object keys are sent to the client-facing application on search? The 3rd party keys? Or do I preprocess the results of a search and determine which items I have locally, thus returning local keys in those instances? Or do I completely abstract the 3rd party sources and generate unique local keys for every returned item, which is then subsequently used if someone saves [that seems REALLY heavy, tho]? Or do I put that processing off and do the lookup as to whether something exists locally to after someone bookmarks something?

Prevent multiple purchases with the same credit card?

I'm thinking through how to develop a validation on my Rails app that essentially checks to make sure that the credit card used for any given transaction by any user is unique in our system, such that all credit card may be used to purchase an item only once across the entire application for all users, for all time.
The thinking behind this restriction is that this app will sometimes run time-sensitive promotional deals, and we want to do our best to institute a "one purchase per credit card" system for these deals.
I was thinking of hashing the credit card number and just storing that hash in the db, then cross-referencing it at the time of each new purchase (so my payment gateway keeps the actual number, and I just keep a hash in the DB), but on further research, this seems like a bad idea.
So I'm back to the drawing board and looking for new ideas. Anyone know a good approach to this problem, while keeping as PCI-compliant as I can be?
I'm developing with Rails 3 and using ActiveMerchant to integrate with my payment gateway, Authorize.net, if that helps at all.
Certainly some hashing is a bad idea - either because it's low security, has some intercepts, or so commonly used there's rainbow tables. That doesn't mean all hashing is a bad idea - the only way to cross reference is going to be some way of uniquely and predictably identifying the information. Unless PCI specifically prohibits it - hashing is still the way to go.
Salt
Make sure you salt your hash - this prevents rainbow attacks, or at least requires the rainbow-attacker build a table with your salt in mind. In particular if you can keep the salt reasonably secure {I say only reasonably because in order to generate you need to have the salt which means it'll be in code somewhere}.
Choose a Good Algorithm
While MD5 is now infamous, and implemented in all kinds of languages, it's also so common that you can find pre-made rainbow tables. It's also extremely quick to generate a hash. Your system can probably tolerate a small amount of delay, and use a much more processor-intensive hash. This makes the cost of generating a rainbow table much more expensive. Check out the Tiger algorithm for example.
Hash more than once
If you have multiple related data points, multiple hashes are going to make it way harder to do a rainbow attack. For example: Hash(Hash(Card#+salt1)+expireDate+salt2) - requires knowledge of both the card # and the date to generate (easy for you) but can't easily be reverse-engineered (rainbow requires for every card # * every useful expire date + knowledge of both salts).
Edit: (your comments)
Reasonably secure: Only transmit it over an encrypted connection (SFTP, SSH), don't store it unencrypted - including live/iterative and backup copies, keep the file with the salt outside of the web tree (cannot be directly accessed/accidentally released), make sure permissions on the file are as restrictive as possible (don't allow group/global file access).
Dynamic salt throwing a random value into the hash is great for reducing rainbow attacks - you store that random piece in the table with the hashed value - now when building a rainbow you have to build one for every dynamic salt. However for your needs you can't do this - you need to know the right random salt to use the second time you create the hash (otherwise you'll never get an intercept on the second card use)... for that to be predictable/repeatable you'd then have to base the dynamic salt on some part of the number... which is effectively what multiple hashing with another data point does. The more data points you have the more you can hash in this direction - if you have the CVV for example (3 hashes), or perhaps you hash 8 digits at a time (for a total of 3 hashes: hash(hash(hash(1..8+salt1)+9..16+salt2)+expDate+salt3)).
Best Hash it's a moving target, but there's a good discussion on security.stackexchange. Which points to SHA-512.
faking your true credit card number online is the best way to prevent this from happening. Citibank clients can login and make use of this tool provided with all accounts. Just generate a number and exp date for use online, and all is fine , for now.
I think you are looking in the wrong direction. I would just check last 4 of card, ip and shipping addresses. The risks of storing that data versus the damage if a small number of users gamed the last 4 & ip solution is not worth it. (He says not knowing the nature of the purchases.)
Since address isn't collected...First 4, Last 4 and 4 Digit Expiration (all hashed of course) should provide the uniqueness you need to ensure that card was only used once.
If you want a "one purchase per user" system then why don't you just check the user's purchase history whenever they try to buy a special-purchase item to ensure that they haven't bought it previously?
user could register for multiple accounts.
although by checking users history, as well as enforcing 1 item per address for each purchase- you will likely be fine- you could also limit things by users name/ birthday/ whatever other identifying information.
Credit Card information can also change by the way- its actually very easy to purchase 100 gift credit cards with unique numbers so if you want to police things down to the most minute level... I dont think you will be able to just by cc numbers

eCommerce Third Party API Data Best Practice

What would be best practice for the following situation. I have an ecommerce store that pulls down inventory levels from a distributor. Should the site, for everytime a user loads a product detail page use the third party API for the most up to date data? Or, should the site using third party APIs and then store that data for a certain amount of time in it's own system and update it periodically?
To me it seems obvious that it should be updated everytime the product detail page is loaded but what about high traffic ecommerce stores? Are completely different solutions used for that case?
In this case I would definitely cache the results from the distributor's site for some period of time, rather than hitting them every time you get a request. However, I would not simply use a blanket 5 minute or 30 minute timeout for all cache entries. Instead, I would use some heuristics. If possible, for instance if your application is written in a language like Python, you could attach a simple script to every product which implements the timeout.
This way, if it is an item that is requested infrequently, or one that has a large amount in stock, you could cache for a longer time.
if product.popularityrating > 8 or product.lastqtyinstock < 20:
cache.expire(productnum)
distributor.checkstock(productnum)
This gives you flexibility that you can call on if you need it. Initially, you can set all the rules to something like:
cache.expireover("3m",productnum)
distributor.checkstock(productnum)
In actual fact, the script would probably not include the checkstock function call because that would be in the main app, but it is included here for context. If python seems too heavyweiaght to include just for this small amount of flexibilty, then have a look at TCL which was specifically designed for this type of job. Both can be embedded easily in C, C++, C# and Java applications.
Actually, there is another solution. Your distributor keeps the product catalog on their servers and gives you access to it via Open Catalog Interface. When a user wants to make an order he gets redirected in-place to the distributor's catalog, chooses items then transfers selection back to your shop.
It is widely used in SRM (Supplier Relationship Management) branch.
It depends on many factors: the traffic to your site, how often the inventory levels change, the business impact of displaing outdated data, how often the supplers allow you to call their API, their API's SLA in terms of availability and performance, and so on.
Once you have these answers, there are of course many possibilities here. For example, for a low-traffic site where getting the inventory right is important, you may want to call the 3rd-party API on every call, but revert to some alternative behavior (such as using cached data) if the API does not respond within a certain timeout.
Sometimes, well-designed APIs will include hints as to the validity period of the data. For example, some REST-over-HTTP APIs support various HTTP Cache control headers that can be used to specify a validity period, or to only retrieve data if it has changed since last request.