How do URL shorteners guarantee unique URLs when they don't expire? [closed] - url-shortener

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
There are a lot of questions here on stackoverflow regarding URL shorteners as well as elsewhere on the internet, e.g.
How to code a URL shortener?
How do URL shortener calculate the URL key? How do they work?
http://www.codinghorror.com/blog/2007/08/url-shortening-hashes-in-practice.html
However, there is one thing I don't understand. For instance, http://goo.gl uses four characters at the moment. However, they claim their short URLs don't expire. As mentioned in the article on coding horror, if they can't recycle URLs, the only possible solution is at one point to add an additional character.
Ok, so far so good. With 4 characters that means about 15 million unique addresses. For something like Google Maps, I don't think that is very much and if you can't recycle, my guess is they run out of available addresses fairly quickly.
Now for the part I don't get. While handing out addresses, they start to run out of unused addresses. They have to check if a newly generated address has not been issued yet. The chance this has happened and the address is already in use increases. The straightforward solution of course is to generate a new URL over and over again until they find a free one or until they have generated all 1.5 million alternatives. However, this surely can't be how they actually do it, because this would be far too time-consuming. So how do they manage this?
Also, there are probably several visitors at once asking for a short URL, so they must have some synchronization going on as well. But how should the situation be managed when the fifth character needs to be added?
Finally, when doing some research on how the URLs from http://goo.gl work, of course I requested a short URL for a map on Google Maps several times. None of them will ever be used. However, when Google would strictly enforce the policy of URLs not expiring once issued, this means that there are lots and lots of dormant URLs in the system. Again, I assume Google (and the other services as well) have come up with a solution to this problem as well. I could imagine a clean up service which recycle URLs which have not been visited in the first 48 hours after creation or less then 10 times in the first week. I hope someone can shed some light on this issue as well.
In short, I get the general principle of URL shorteners, but I see several problems when these URLs cannot expire. Does anyone know how the problems mentioned above might be resolved and are there any other problems?
EDIT
Ok, so this blog post sheds some light on things. These services don't randomly generate anything. They rely on the underlying database's auto-increment functionality and apply a simple conversion on the resulting id. That eliminates the need to check if an id already exists (it doesn't) and the database handles synchronization. That still leaves one of my three questions unanswered. How do these services "know" if a link is actually used once created?

Why URL shorteners don't delete entries
I did write TinyURL (a decade ago) to give back an entry that I didn't need. Their reply made me realize how ridiculous I was: they told me "just create all the URLs you need". And the figures speak themselves:
A - With 26 low-case + 26 upper case letters + 10 digits (the choice of reasonable sites), using ONE character gives you 62 positions (i.e. 62 shortened URLs), then each additional char multiplies the position number by 62:
0 char = 1 URLs
1 char = 62 URLs
2 chars = 3,844 (1 URL for each human in a village)
3 chars = 238,328 (idem, in a city)
4 chars = 14,776,336 (in Los Angeles area)
5 chars = 916,132,832 (in Americas, N+Central+S)
6 chars ~ 56,800,235,580 (8 URLs for each human in the world)
7 chars ~ 3,521,614,606,000 (503 for each human, 4 for each web page in the world)
8 chars ~ 218,340,105,600,000 (31,191 URLs for each human)
9 chars ~ 13,537,708,655,000,000 (~2 million URLs for each human)
10 chars ~ 839,299,365,900,000,000 (~120 billions URLs for each human)
11 chars ~ 52,036,560,680,000,000,000
B - Actually the needs and uses are lower than what one could expect. Few people are creating short URLs, and each person creates few URLs. Original URLs are enough in most cases. Results are that the most popular shorteners, after years, still cover the needs today with just 4 or 5 chars, and adding another one when needed will cost nearly zero. Apparently goo.gl and goo.gl/maps are each using 5 chars, youtube is using 11 (using the 62 chars above, plus the dash and maybe a few others).
C - The cost of hosting (storing + operating) an URL is, say, $1000/year for 1Terabyte, with each TB able to contain 5 billions URLs, hence 1 URL costs 0.2 micro-dollar/year in hosting. However the benefit for the Shortener is also very thin, whence the business is not very strong. For the user, the benefit of an URL is hard to evaluate, however a missed link would cost far more than the hosting.
D - There is no point for the user to create a short URL if it risks to become inoperative in the next years, so persistency is a major appeal of a Shortener, and a serious Shortener will probably never cease to serve them unless they are forced out of business; yet this happened already, and anyway Short URLs have their load of drawbacks as well as benefits, as well explained in Wikipedia "URL shortening" (risks of all sorts of hacking, against the user, or the target sites, or the Shortener; e.g. one could attack a Shortener by bot-requesting giga-numbers of URLs, a threat surely fended off by most Shorteners).
Versailles, Tue 12 Mar 2013 20:48:00 +0100, edited 21:01:25

Related

Primary Key Type Guid or Int? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am wondering what is the recommended type for PK in sql server? I remember reading a long time ago this article but now I am wondering if it is still a wise decision to use GUID still.
One reason that got me thinking about it is, these days many sites use the id in the url for instance Course/1 would get the information about that record.
You can't really do that with a guid, which would mean you would need some new column that would be unique and use that, what is more work as you got to make sure each record has a unique number.
There is never a "one solution fits all". You have to carefully design your architecture and select the best options for your scenario. Both INT and GUID types are valid options like they've always been.
You can absolutely use GUID in a URL. In fact, in most scenarios, it is better to use a GUID (or another random ID) in the URL than a sequential numeric ID for security reason. If you use sequential ID, your site visitors will be able to easily guess other users' IDs and potentially access their contents. For example, if my profile URL is /Profiles/111, I can try Profile/112 and see if I can access it. If my reservation URL is Reservation/444, I can try Reservation/441 and see what happens. I can easily guess other IDs in the system. Of course, you must have strong permissions, so I should not be able to see those other pages that don't belong to my account, but if there is any issues or holes in your permissions and security, a breach can happen. While with GUID and other random IDs, there is no way to guess other IDs in the system, so such a breach is much more difficult.
Another issue with sequential IDs is that your users can guess how many accounts or records you have and their order in your database. If my ID is 50269, I know that you must have almost this number of records. If my Id is 4, then I know that you had a very few accounts when I registered. For that reason, many developers start the first ID at some random high number like 1529 instead of 1. It doesn't solve the issue entirely, but it avoid the issues with small IDs. How important all that guessing is depends on the system, so you have to evaluate your scenario carefully.
That's on the top of the benefits mentioned in the article that you mentioned in your question. But still, an integer is better in some areas, so choose the best option for your scenario.
EDIT To answer the point that you raised in your comment about user-friendly URLs. In those scenarios, sequential numbers is the wrong answer. A better solution is a unique string in the URL which is linked to your numeric ID. For example, the Cars movie has this URL on IMDB:
https://www.imdb.com/title/tt0317219/
Now, compare that to the URL of the same movie on Wikipedia, Rotten Tomatoes, Plugged In, or Facebook:
https://en.wikipedia.org/wiki/Cars_(film)
https://www.rottentomatoes.com/m/cars/
https://www.pluggedin.ca/movie-reviews/cars/
https://www.facebook.com/PixarCars
We must agree that those URLs are much friendlier than the one from IMDB.
I've worked on small, medium, and large scale implementations(100k+ users) with SQL and Oracle. The major of the time PK type of INT is used when needed. The GUID was more popular 10-15 years ago, but even at its height was not as populate as the INT. Unless you see a need for it I would recommend INT.
My experience has been that the only time a GUID is needed is if your data is on the move or merged with other databases. For example, say you have three sites running the same application and you merge those three systems for reporting purposes.
If your data is stationary or running a single instance, int should be sufficient.
According to the article you mention:
GUIDs are unique across every table, every database, every server
Well... this is a great promise, but fails to deliver. GUID are supposed to be unique snowflakes. However, reality is much more complicated than that, and there are numerous reasons why they end up not being unique.
One of the main reasons is not related to the UUID/GUID specification, but by poor implementations of it. For example some Javascript implementations rank as the worst ones, using pseudo random numbers that are quite predictable. Other implementations are much more decent.
So, bottom line, study the specific implementation of UUID/GUID you are and will be using. Don't just read and trust the specification. Otherwise you may be up for a surprise, when you get called at 3 am on a Saturday night by angry customers.

reason for SAP length restrictions

I am currently working in a company which uses SAP ERP. The average organization name length of our business partners is about 100 chars, which may be domain specific. The SAP way now seems to be that users have to split such names into chunks of 40 chars length (no joke). This certainly does not reflect reality at all and leads to the weirdest problems.
When I ask people about the reason of this restriction, they say things like “you know, the core of this software is from the 1970s, when every bit was expensive”. These kind of answers are not satisfying at all.
So what is the real reason?

url keywords as parameter or part of the url seo [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
When creating a url, I am thinking of these ways of how to do that
example.com/restaurants-in-new-york
example.com/restaurants/in-new-york
example.com/restaurants/in/new-york
So, the question is how Google considers this ? If I search restaurants in new york to which type the preference is given.
Which will have higher ranking ?
They'll all probably have the exact same ranking, no matter what. More important: Don't screw up real users when doing SEO.
Simple reasons:
You might do everything for nothing (i.e. no or no significant change).
Talking URLs are for users to know what content they can expect. They're not for search engines. if you optimize your overall appearance just for a high search engine ranking, even if people won't want your page contents, you might profit off this at first, but they won't return.
When using path separators (/) in your URL, expect users to use these. If they can't use them, you generate a bad user experience and they might rather leave your page and look elsewhere for what they're using.
Similar reason: Only use path separators if there's a real reason for them, for example, if they're some logical grouping. In your example above, the second and third variation would be bad, because the separation doesn't make any sense at all.
How I'd do it:
If we're talking about a single blog post or something like that, don't use any path separators at all: example.com/restaurants-in-new-york. That's it. Don't touch it or change it.
If we're talking about some listing, you can provide the user hints on how to quickly change their actual query or results. Here I'd use something like example.com/restaurants/new-york.
I hope you understand what my intention here is. Let's assume the user now wants to know what restaurants are there in Chicago. He won't have to navigate or look for some link. He could just edit the URL: example.com/restaurants/chicago. There he is. Good user experience. If you look at your third example, you've got that unnecessary filler in: It doesn't serve any purpose at all (hint: Google most likely ignores such fillers anyway).
TL;DR: Don't optimize for Google or any other search engine. Optimize for your users. Modern search engines are built to see pages similar to users. You can have the prettiest key words in your URLs, yet they won't help you at all if they're strictly confusing or obviously just made for SEO.

Get word frequency using search engines [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Are there any good services which can give me the amount of web pages a word occurs on ?
I need this to calculate Normalized Google Distance. A few years ago there was the google web search API which one could call and get the occurrences and the search results(which i don't actually need).
This web search API has now been replaced with google custom search API, but the cost of this service is too high for my purpose.
The Bing Search API and Yahoo! BOSS Search API is not an option either since they only return a max of 50 search results and not an estimate of word occurences.
Already did quite some searching on the internet but i can't seam to find anything which gives me the information i want.
Thanks for any suggestions.
First you should read:
http://searchengineland.com/why-google-cant-count-results-properly-53559
I would offer you blekko API results to do this, but I consider the entire technique invalid because of the inaccuracy of the counts any major search engine provides.
(Late I know, but I only just found this while trying to solve the same problem)
Maybe a decent substitute would be Google's Web1T Corpus. It's definitely not perfect for your use case, but it's probably better than nothing. In particular, since the corpus only includes 5-grams, the f(x, y) counts will only be derivable from words separated by at most three other words, which will most likely mean you'll get counts of zero in many cases, when you would expect a higher count from the actual google results (assuming such as number exists, which as Greg's link told us, it may not). Another potential problem is that it only includes data up to 2006 (you may not care though), and it only includes English (although a version with 10 European languages is also available). Oh, and it's $150, which is not obscene, although it could mean you have to deal with the accounts department.
I do it in R using Rcurl
search_result_adress <- sprintf("http://www.google.com/search?q=%s",searched_expression)
result_page_source_as_string <- getURL(search_result_adress,.opts = list(ssl.verifypeer = FALSE))[[1]]
Then your result is located in the string between "About" and "results", and I'm too ashamed of my regex skills to display my own solution but i'm sure you'll figure it out :).
the counts of pages are indeed not accurate but you can get more stable results by removing from the search a word that doesn't exist anyway, so google will search harder. I'd tend to trust those more.
exxample searching "character"
character returns 290,000,000 results.
character -potato returns 931,000,000
character -hincbhjvmzsslzlkjed returns 1,780,000,000
character -zzzanjbedlkjzd returns 1,780,000,000 too, showing a stabilization
for less general queries the estimations are better.
"google frustrates me" returns 3,920 results.
"google frustrates me" -potato returns 2,870.
"google frustrates me" -hincbhjvmzsslzlkjed returns 2,860.

SEO URL Hierarchy

Currently I have a site developed in cakephp that has the following type of URL's:
http://www.travelenvogue.com/clubs/page/accommodations/1-Ritz_Carlton_Club_Bachelor_Gulch
I have heard that because our most valuable keywords "Ritz Carlton Club Bachelor Gulch" are so far to the right of the beginning of the URL that they may not be helping us for SEO purposes. My first question is if this is accurate?
Secondly, my programmer told me he could change it for less time/money to:
Ex:travelenvogue.xxx/1-Ritz_Carlton_Club_Bachelor_Gulch/accommodations
(with the 1 before the keywords)
or (for more significantly more time/money) to:
Ex:travelenvogue.xxx/Ritz_Carlton_Club_Bachelor_Gulch/accommodations
Is the URL without the 1 in front of the keywords much more helpful than the one with the 1 in front of the keywords.
Any help is appreciated, I'm so confused! :)
The problem with rewriting the urls in backwards order like this is that it makes less sense to humans, especially since CakePHP's pretty-url structure is designed to conform to the accepted informal standard.
Here are Google's own recommendations: http://www.google.com/support/webmasters/bin/answer.py?answer=76329&hl=en
A site's URL structure should be as simple as possible. Consider organizing your content so that URLs are constructed logically and in a manner that is most intelligible to humans (when possible, readable words rather than long ID numbers). For example, if you're searching for information about aviation, a URL like http://en.wikipedia.org/wiki/Aviation will help you decide whether to click that link. A URL like http://www.example.com/index.php?id_sezione=360&sid=3a5ebc944f41daa6f849f730f1, is much less appealing to users.
The thing to remember is that Google are good at picking up keywords from your URLs and from your pages. So long as your pages and URLs follow a semantic, logical structure, there is very little to worry about.
Edit: As an addendum to the above - the 1 is redundant as far as both users and search engines are concerned, since it doesn't add any keyword value and is apparently some kind of identifier. It's the sort of thing that should be separated from the keywords somehow (usually by using a directory structure - http://example.com/accommodations/1/hotel-name ). Probably too late to change it now if it's a mature app, though. It would be better if it were a real keyword, say a particular country name or a location group or similar.
Yes it is right. More your main keyword close to the root folder more points it will get in Search engine.
This is not the only SEO thing.
in On page optimisation. your main keyword must be present in following.
Page title
H1 Tag
URL(in domain if possible)
In Image alt tag)
in Links on your home page.
meta keywords and description. (still some search count it)
first sentence of each paragraph
end of page.
you keyword must be sparse 20% in the whole page content in different places.
on off page optimisation, How popular you site with your keyword is on other sites.
Generally, there is more SEO weight for the page higher in the site hierarchy. For example, in order from good to bad.
www.mysite.com/page1
www.mysite.com/sub/page2
www.mysite.com/sub/sub/page3
Exactly how much weight depends the search engine. But keep in mind there are other factors.
In my opinion, the 1 before the title would not hurt you any more or less than the other example.
I will say the best would be: travelenvogue.com/1-Ritz_Carlton_Club_Bachelor_Gulch
In the end, SEO can be a bit of black magic. That is to say this particular optimization doesn't mean your page will appear ahead of another page that is under several sub directories. So you will have to decide time and budget.