I'm using the Lucene.Net implementation packaged with the Kentico CMS. The site that we're indexing has articles in various languages. If a user is viewing the Japanese version of the site (for example) and runs a search for 'VPN', we'd like them to see Japanese articles about VPN first, but also see other language articles in the results.
I'm trying to achieve this with query-time boosting of the _culture field. Since we're using the standard analyzer (really don't want to change that), and the standard analyzer treats hyphens as whitespace, I thought I'd try appending '(_culture:jp)^4' to the user's query. As you can see from the Luke tool's Explain output, that isn't doing anything to boost the documents with 'jp' in the field. What gives?
I've also tried:
_culture:"en-jp"
_culture:en AND _culture:jp
_culture:"en jp"
Update: It's something with the field. There's another field in the index named 'documentculture' that contains the same data (don't know why). But when I try '(documentculture:jp)^4', it works as I expect. That solves my problem, but I still have an academic question of how the fields are different.
Even though the standard analyzer ignores hyphens I don't believe it will treat the two parts of your culture code as separate terms. Therefore under normal circumstances a wildcard would help you here. For example, the query vpn (_culture:en*)^4 would boost all documents with a culture starting with en.
However, in your case you want to match the end of the term. Unfortunately, Lucene syntax doesn't support wildcards at the start of terms for some reason (according to this reference). Therefore I think you're going to have to consider changing the analyzer you're using. I generally find the Whitespace analyzer fits my needs best. I've just tried your scenario using Whitespace analyzer and have found vpn (_culture:en-jp)^4 will give you what you need.
I understand if you don't accept this answer though since you stated you didn't want to change the analyzer!
All the articles I Googled on this subject are dated back in 2004-2005.
Basically I am structuring precanned searches, and it is based off of categories the client will input.
Example
content/(term name)/index.htm
Does it matter if I used the raw term with a space, which is converted to %20 in the URL, or should I convert the link to '-' and remove that before querying for results?
I already have it working, but does anyone know if this definitely has a negative impact on SEO and ranking?
No impact on SEO. A - just looks nicer, that's all.
You'd use %20 if you needed to preserve the exact term including a proper space when you read it back from the URL. Probably you don't.
I personally think it should be "-"
I don't remember seeing a website that was using %20
"-" is one character and %20 is three, so you can put more stuff visible in the address bar
for an example, what is better
Do spaces in your URL (%20) have a negative impact on SEO?
or
Do spaces in your URL (%20) have a negative impact on SEO?
Yes don't use them - Google, Yahoo and bing does not know how to leverage the spaces and more importantly you are wasting good opportunity to communicate both with the consumer and search engines more about your product or page URL and what the topic of the content is all about.
However, sometimes it can't be helped because you have a website / ecommerce site for years and the site is indexed and already on good page ranking.
In that case, if you do want to get better naming convention, you will want to re-name the urls but take all of the existing url with space and place it into 301 redirect and map them to the new urls.
%20 does not effects SEO but it will destroy the readability of your URL. since the CMS have taken all the intention, so now it's easy to set-up dynamic URL structure. I recently read an article on SEO Friendly URLS which will help you to avoid Google penaltyimprove your chances to rankandmake your links meaningful hope it helps.
As mentioned, it really doesn't matter from a search engine perspective. With that being said, however, it's generally not good practice to use spaces in URLs (%20). Replace it with a dash or concatenate it.
I use blogger and while adding labels to blog post, the link to that label page has space which is converted to "%20" but i have no control over that with blogger. When I try to make the labels with '-' instead of space they are not nice to humans, so i go with spaces and "%20" in urls, i think this should not affect SERPs.
We use "%20" all over the place on our website and have not experienced any negative effects. We began doing this about two years ago, and at that time a few search engines had problems, but they have since disappeared. Some browsers will display a "%20" in the address bar, while others will display an empty space, but this really doesn't matter.
We're not so sure though that this has any positive effect on ranking, though it definitely has no negative effect. The thing to remember about Google is that while having a keyword as part of the base url, such as www.greatwidgets.com, is very helpful, using keywords as part of the page url, example: www.myexample.com/widgets.htm does not appear to result in any advantage. What matters is the page content and how many other pages out there have the exact same content. Also, incoming links from relevant websites with high rankings, without the rel="nofollow" tag are extremely important.
You cannot "trick" Google with fancy-looking URLs and h1 headers. That's right, h1 headers mean nothing, because Google doesn't require your input to tell them what's important.
Remember, if you're selling products and copying content from the manufacturer's website (or the competitor's website), Google's PANDA is going to be very angry. You'll need to reword your content so that it's not a verbatim copy from some other website. Google rewards originality, and severely punishes plagiarism. Seriously, PANDA will put the offending page on page 50 until it's brought into conformity with Google's policy on duplicate content.
Always use sitemaps to help the search engines.
I believe it looks better in a link if an underscore (_) is used.
content/term_name/index.htm
content/term-name/index.htm
content/term%20name/index.htm
It's better to use "-" instead of %20 since it shows unprofessional coding to the search engines and to the visitors. You really think a visitor could remember a URL with %20 ? Make the pages for the users and not for the search engines. You will get the most benefit form this and SE will appreciate it.
according to my view spaces in url should not be there as this is not good practice. we should use hypens between the URLS. the website should have sitemap.xml file.
according to my view spaces do have negative impact on seo. and secondly when creating a url structure hypens should be placed instead of underscores.
yes they do have negative effect as it effects the user experiences. the users would like to have easy to remember urls. google suggest you should seperetae your words with ' - ' and ideally not to use '_' or spaces '%20' .
Something else to consider is that if you use spaces in your URLs, it will break automatic URL detection in many software (e.g. emails, chat, etc) where they think that a space is the end of URL. This might impact negatively the "sharability" of your URLs.
Using spaces in URLs is still not common practice in 2020 and Google still recommends to use - instead:
https://support.google.com/webmasters/answer/76329?hl=en
Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)
Does Apaches Solr search engine provide approximate string matches, e.g. via Levenshtein algorithm?
I'm looking for a way to find customers by last name. But I cannot guarantee the correctness of the names. How can I configure Solr so that it would find the person
"Levenshtein" even if I search for "Levenstein" ?
Typically this is done with the SpellCheckComponent, which internally uses the Lucene SpellChecker by default, which implements Levenshtein.
The wiki really explains very well how it works, how to configure it and what options are available, no point repeating it here.
Or you could just use Lucene's fuzzy search operator.
Another option is using a phonetic filter instead of Levenshtein.
Great answer by Mauricio, my only "cheapo" addition is to just append the ~ character to all terms that you want to fuzzy match on the way in to solr. If you are using the default set up, this will give you fuzzy match.
I am looking for url encoding tips for SEO compliant site.
I have a list of variables I need!
hypen = used to split locations, Leeds-UK-England
space = underscore for where spaces occur
hypen = plus sign used in some british locations (stafford-upon-avon)
forward slash = exclamation used in house for names of things.
Are the ones chosen bad or good? Are there any better ones, I'm pretty sure I need all the data, in order to decode the url's properly.
My "SEO" gave me a list of things which are bad, but not good. I've searched these and google seems to give the same type of results.
Cheers, Sarkie
Google used not to recognise underscores as word separators - see this article from 2005. This has entered into received wisdom and most of the 'experts' and articles you will find on SEO will still be recommending this.
However, last year this changed: underscores are now recognised as word separators so it opens things up for URL design. This now allows using dashes as dashes and underscores as spaces which some consider more natural. I've not found many people who have caught up with this, including SEO consultants I deal with professionally.
As to a good system for your use case, I would recommend asking around some non technical people (colleagues, friends, family, etc) to see what they like.
Hyphens for spaces is the usual and preferred method.