Best methods to make urls friendly? - seo

We're working on revising the url structure for some of our movie content, but we aren't quite sure on the best way to handle odd characters. For example,
'303/302'
'8 1/2 Women'
'Dude, Where's My Car?'
'9-1/2 Weeks'
So far, we're thinking:
/movies/303-302
/movies/8-1-2-women
/movies/dude-wheres-my-car
/movies/9-1-2-weeks
Is this the best solution? Is there anything we're forgetting?

Use this format: /movies/123456/8-1-2-women
Set up your web server so that movies are identified by the numeric id (123456), and the rest of the path is ignored (only serves for SEO).
(Stackoverflow uses this approach)

We always use dashes.
I don't have a source off hand, but I have heard that the dash character is good for SEO purposes, better so than something like camel caps (i.e. dudeWheresMyCar) but not sure how it compares to underscores, ampersands, or percentage signs. Apparently with dashes (and maybe other separation characters too) search bots can "read" the links and add it as just one more factor on determining content relevance.

From Seomoz: "When creating URLs with multiple words in the format of a phrase, hyphens are best to separate the terms (e.g. /brands/dolce-and-gabbana/), followed (in order) by, underscores (_), pluses (+) and nothing."
This has been confirmed by Matt Cutts, Google too.

Related

How can you query a SQL database for malicious or suspicious data?

Lately I have been doing a security pass on a PHP application and I've already found and fixed one XSS vulnerability (both in validating input and encoding the output).
How can I query the database to make sure there isn't any malicious data still residing in it? The fields in question should be text with allowable symbols (-, #, spaces) but shouldn't have any special html characters (<, ", ', >, etc).
I assume I should use regular expressions in the query; does anyone have prebuilt regexes especially for this purpose?
If you only care about non-alphanumerics and it's SQL Server you can use:
SELECT *
FROM MyTable
WHERE MyField LIKE '%[^a-z0-9]%'
This will show you any row where MyField has anything except a-z and 0-9.
EDIT:
Updated pattern would be: LIKE '%[^a-z0-9!-# ]%' ESCAPE '!'
I had to add the ESCAPE char since you want to allow dashes -.
For the same reason that you shouldn't be validating input against a black-list (i.e. list of illegal characters), I'd try to avoid doing the same in your search. I'm commenting without knowing the intent of the fields holding the data (i.e. name, address, "about me", etc.), but my suggestion would be to construct your query to identify what you do want in your database then identify the exceptions.
Reason being there are just simply so many different character patterns used in XSS. Take a look at the XSS Cheat Sheet and you'll start to get an idea. Particularly when you get into character encoding, just looking for things like angle brackets and quotes is not going to get you too far.

How to convert foreign characters to English characters in SQL Query?

I have to create sql function that converts special Characters, International Characters(French, Chinese...) to english.
Is there any special function in sql, can i get??
Thanks for your help.
If you are after English names for the characters, that is an achievable goal, as they all have published names as part of the Unicode standard.
See for example:
http://www.unicode.org/ucd/
http://www.unicode.org/Public/UNIDATA/
Your task then is to simply turn the list of unicode characters into a table with 100,000 or so rows. Unfortunately the names you get will be things like ARABIC LIGATURE LAM WITH MEEM MEDIAL FORM.
On the other hand, if you want to actually translate the meaning, you need to be looking at machine translation software. Both Microsoft and Google have well-known cloud translation offerings and there are several other well-thought of products too.
I think the short answer is you can't unless you narrow your requirements a lot. It seems you want to take a text sample, A, and convert it into romanized text B.
There are a few problems to tackle:
Languages are typically not romanized on a single character basis. The correct pronunciation of a character is often dependent on the characters and words around it, and can even have special rules for just one word (learning English can be tough because it is filled with these, having borrowed words from many languages without normalizing the spelling).
Even if you code rules for every language you want to support you still have homographs, words that are spelled using exactly the same characters, but that have different pronunciations (and thus romanization) depending on what was meant - for example "sow" meaning a pig, or "sow" (where the w is silent) meaning to plant seeds.
And then you get into the problem of what language you are romanizing: Characters and even words are not unique to one language, but the actual meaning and romanization can vary. The fact that many languages include loan words from those language they share characters with complicates any attempt to automatically determine which language you are trying to romanize.
Given all these difficulties, what it is you actually want to achieve (what problem are you solving)?
You mention French among the languages you want to "convert" into English - yet French (with its accented characters) is already written in the roman alphabet. Even everyday words used in English occasionally make use of accented characters, though these are rare enough that the meaning and pronunciation is understood even if they are omitted (ex. résumé).
Is your problem really that you can't store unicode/extended ASCII? There are numerous ways to correct or work around that.

Accented character regex

I'm trying to create a regex that will look for french words whether a user specifies the accented characters or not. So if the the user has searched for "déclaré" but types in declare instead I would like to be able to match the text still. I'm having difficulty making this more dynamic so that it can be matched on any french word...
Closest example from another user from a different post was:
d[eèéê]cl[aàáâ]r[eèéê]
Is it even possible to write a regex for something like this?
Any advice would be much appreciated.
I had once to create something like that.
The best thing I could come up with was something akin to having a dictionary of known letters with diacritics and replace them on the search terms, before creating a pattern for a regular expression.
Pretty much like you did on your own example.

Accented character replacement for search then reinserted afterwards

Basically my issue is that users would like to search for a french word that has accented characters but without typing in the accented characters and then have the actual accented word appeared highlighted if found... So for example they would type in "declare" but in the result sets it would look like "déclare" and if found "déclare" would be highlighted.
My first thought was to just simply replace the characters with a regex but then I remembered that I would need to re-insert the replaced characters after the search... I was thinking of then using some sort of character map that would track position and the character so that when the search was finshed I could put the result set back to the way it was. This seems a little brute force to me and I was wondering if anyone had a better alternative? I'm using Visual Studio 2005 with this app.
Any advice would be much appreciated!
Thanks
A regular expression by default matches text. The "replacement" mode is not the normal mode. So, what you want is in fact the default. The precise syntax will depend on your Regex engine, e.g. in .Net you'd use Regex.IsMatch()

Special Characters in URL

We're currently replacing all special characters and spaces in our URLs with hypens (-). From an SEO and readability point-of-view this works fine. However, in some cases, we are feeding parts of the URL into a search after stripping the hyphens out. The problem occurs when the search term should have hyphens as it returns no results when they get stripped. We could modify the search algorithm we're using but this will slow it down (especially bad as we're using it with an AJAX-ed search box and this needs to be fast).
The best option to deal with this, as far as we can tell is to replace pre-existing hyphens with pipes (|). I have a feeling that this will have a negative impact on SEO for those terms as the pipe character will be treated as a part of the word and not as a separator. As far as I can tell, the only characters that are considered to be separators are hyphens and forward slashes (/).
So my questions are:
Are there alternative characters we can use to represent hyphens?
If we can't use any other characters, how much impact will using a pipe character have on a search engine?
Cheers,
Zac
Would ~ (tilde) work?
Edit: Google now treats underscores and dashes as word separators so you can use dashes as dashes and underscores as spaces.
Why not use Url Encoding? Most frameworks have built in utilities to do this.
I was going to say the same thing about URL encoding, but if you're trying to get rid of the special characters, I suppose you don't want URLs with percent signs, right?
What about altering the algorithm that "feeds parts of the URL into a search"? Couldn't you add some logic to not replace hyphens within the search query part of the URL?