We're currently replacing all special characters and spaces in our URLs with hypens (-). From an SEO and readability point-of-view this works fine. However, in some cases, we are feeding parts of the URL into a search after stripping the hyphens out. The problem occurs when the search term should have hyphens as it returns no results when they get stripped. We could modify the search algorithm we're using but this will slow it down (especially bad as we're using it with an AJAX-ed search box and this needs to be fast).
The best option to deal with this, as far as we can tell is to replace pre-existing hyphens with pipes (|). I have a feeling that this will have a negative impact on SEO for those terms as the pipe character will be treated as a part of the word and not as a separator. As far as I can tell, the only characters that are considered to be separators are hyphens and forward slashes (/).
So my questions are:
Are there alternative characters we can use to represent hyphens?
If we can't use any other characters, how much impact will using a pipe character have on a search engine?
Cheers,
Zac
Would ~ (tilde) work?
Edit: Google now treats underscores and dashes as word separators so you can use dashes as dashes and underscores as spaces.
Why not use Url Encoding? Most frameworks have built in utilities to do this.
I was going to say the same thing about URL encoding, but if you're trying to get rid of the special characters, I suppose you don't want URLs with percent signs, right?
What about altering the algorithm that "feeds parts of the URL into a search"? Couldn't you add some logic to not replace hyphens within the search query part of the URL?
Related
I have a lot of lucene queries that contains a lot of characters with special meaning like colons, slashes, quotation marks, etc.
I am aware that it is possible to escape single character by using '\', but is it possible to enclose whole sentence into something to be matched exactly in a query, without any of the symbols being interpreted?
Thanks.
Yes, QueryParser.escape escapes everything in the string passed in to it.
Also, using phrase queries generally makes most query syntax irrelevant (myfield:"I +do +not have:to /worry/ about^22 -query -syntax here~2"), with the exception of quotes. If a phrase is what you are attempting to search, that is.
Using prepared statements with PDO, I understand it as there's two paths,
either ? or :name.
What are the limitations regarding the named parameters? White spaces? Non ASCII-chars?
(I'm well acquainted with the hell of non-ASCII in field names. So please stick to the topic.)
Those are tokens. Limits are probably A-Z, 0-9 characters, not starting with 0-9.
From that switch starting on the line of 304 I would say it is [a-zA-Z0-9_]
Lately I have been doing a security pass on a PHP application and I've already found and fixed one XSS vulnerability (both in validating input and encoding the output).
How can I query the database to make sure there isn't any malicious data still residing in it? The fields in question should be text with allowable symbols (-, #, spaces) but shouldn't have any special html characters (<, ", ', >, etc).
I assume I should use regular expressions in the query; does anyone have prebuilt regexes especially for this purpose?
If you only care about non-alphanumerics and it's SQL Server you can use:
SELECT *
FROM MyTable
WHERE MyField LIKE '%[^a-z0-9]%'
This will show you any row where MyField has anything except a-z and 0-9.
EDIT:
Updated pattern would be: LIKE '%[^a-z0-9!-# ]%' ESCAPE '!'
I had to add the ESCAPE char since you want to allow dashes -.
For the same reason that you shouldn't be validating input against a black-list (i.e. list of illegal characters), I'd try to avoid doing the same in your search. I'm commenting without knowing the intent of the fields holding the data (i.e. name, address, "about me", etc.), but my suggestion would be to construct your query to identify what you do want in your database then identify the exceptions.
Reason being there are just simply so many different character patterns used in XSS. Take a look at the XSS Cheat Sheet and you'll start to get an idea. Particularly when you get into character encoding, just looking for things like angle brackets and quotes is not going to get you too far.
I am able to store values in couchdb-lucene with whatever key I like, but it seems that if the key includes any chars outside of [0-9a-zA-Z_] any search fails.
Does anyone know what chars are valid and/or how to properly escape special chars in searches such that special chars can be used?
This shows how to escape special characters and also gives a list of such characters.
All UTF-8 characters should work. I've just verified that I can search for items with é, for example.
A little more information on how you're querying would help, though given the age of this ticket perhaps you've moved on.
We're working on revising the url structure for some of our movie content, but we aren't quite sure on the best way to handle odd characters. For example,
'303/302'
'8 1/2 Women'
'Dude, Where's My Car?'
'9-1/2 Weeks'
So far, we're thinking:
/movies/303-302
/movies/8-1-2-women
/movies/dude-wheres-my-car
/movies/9-1-2-weeks
Is this the best solution? Is there anything we're forgetting?
Use this format: /movies/123456/8-1-2-women
Set up your web server so that movies are identified by the numeric id (123456), and the rest of the path is ignored (only serves for SEO).
(Stackoverflow uses this approach)
We always use dashes.
I don't have a source off hand, but I have heard that the dash character is good for SEO purposes, better so than something like camel caps (i.e. dudeWheresMyCar) but not sure how it compares to underscores, ampersands, or percentage signs. Apparently with dashes (and maybe other separation characters too) search bots can "read" the links and add it as just one more factor on determining content relevance.
From Seomoz: "When creating URLs with multiple words in the format of a phrase, hyphens are best to separate the terms (e.g. /brands/dolce-and-gabbana/), followed (in order) by, underscores (_), pluses (+) and nothing."
This has been confirmed by Matt Cutts, Google too.