SEO and magic numbers in URL - seo

which URL is more relevant, 1 or 2?
1: http://site.com/language/countrcy/city/category/title
2: http://site.com/language/country/city/category/articleId(number)/title
the thing is I have to design my DB in ineffective way for (1) doing textual search and table joins, but I'm not sure how (2) where I'm just putting a direct table ID loses relevance in search results.

The first would be the most relevant, as it doesn't contain any irrelevant data, such as the articleId.
If you are concerned about keeping unique titles, have a 2nd database column called filename for example, which is a URL encoded version of the title. If the title is already in use, then append an incremented value at the end.
For example, if the title 'SEO' was already in use by another article, loop through your string and call it SEO-1 etc..
That way you are only applying irrelevant values when two titles clash.

Related

How do I perform a query in Postgres using a URL slug?

Let's say I have a URL as Follows:
www.somewebsite.com/dining/caseys+grille
I have a business_listings table in Postgres that contains a column business_name. I have a record in the table with 'Casey's Grille'
How can I query 'caseys+grill' against 'Casey's Grille'?
Would I need to use full text search? How would I go about doing this?
Since you are not searching for regular words, but for proper names, and you probably also want to find results that are similar in spelling, you should use trigram GIN indexes and similarity search.
This problem looks simple at first, but it is a can of worms.
The solution should consider all the use cases: is it only a matter of removing/rewriting special characters? Do you need to consider typos (is casey grill the same)? Do you need to consider distinctive marks (is Casey's Grill #2 the same)? Do you need to consider abbreviations (is NY Grill the same as New-York Grill?) Do you need to consider numbers (is 1st av. Grill the same as first avenue grill)?
If it is your database + website, the simplest is to record/compare the URL slug directly.
Else, or if you don't control the URL (like if it is the result of a search box), you may want to store/compare a parsed name. Using both the DB title and the URL slug, you transform the name to common elements. For example you change common abbreviations to their full text, you remove all special characters, you remove/add space, if your language has accents you can remove them, standardize the casing etc. Only you can find and apply the suitable transformations.
Then you can compare the two parsed named, using any suitable comparison method (trigram, plain equality, like queries etc)
I assume you actually want a single slug of the text value in business_name and you want this to be a unique identifier for this particular business.
You can create an additional column business_name_slug and create a unique index on this column.
Then you can create a before insert or update trigger that writes the slug created from business_name into this column.
The tricky part is to create a logic that
generates an url friendly version of the the business name (there should be some example in Blog Posts,Githuhib Gists etc., for example)
avoids naming collisions so your unique constraint will not raise an error when inserting/updating

Openrefine: cross.cell for similar but not identical values

I have two dataset:
one dataset has names of countries, but dirty ones like
Gaule Cisalpine (province romaine)
Gaule belgique
Gaule , Histoire
Gaule
ecc.
the second dataset has two columns with the names of countries (clean) and a code like
Gaule | 1DDF
Is there a way to use cell.cross with value.contains() ? I tried to use reconcile-csv but it didn't work properly (it matches just the exact ones).
I've not been able to think of a great way of doing this, but given the substring you want to match between the two files is always the first thing in the 'messy' string, and if you want to do this in OpenRefine, I can see a way that might work by creating a 'match' column in each project for the cross matching.
In the 'clean' project use 'Add column based on this column' on the 'Country name' column, and in the GREL transform use:
value.fingerprint()
The 'fingerprint' transformation is the same as the one used when doing clustering with key collision/fingerprint and basically I'm just using it here to get rid of any minor differences between country names (like upper/lower case or special characters)
Then in the 'messy' project create a new column based on the dirty 'name of country' column again using the 'Add column based on this column' but in this case use the GREL transform something like:
value.split(/[\s,-\.\(\)]/)[0].fingerprint()
The first part of this "value.split(/[\s,-.()]/)" splits the string into individual words (using space, comma, fullstop, open or closed bracket as a separator). Then the '[0]' takes the first string (so the first word in the cell), then again uses the fingerprint algorithm.
Now you have columns in each of the projects which should match on the exact cell content. You can use this to do the look up between the two projects.
This isn't going to be completely ideal - for example if you have some country names which consist of multiple words it isn't going to work. However you could add some additional key columns to the 'messy' project which use the first 2,3,4 strings etc. rather than just the first one as given here.
e.g.
filter(value.split(/[\s,-\.\(\)]/),v,isNonBlank(v)).get(0,2).join(" ").fingerprint()
filter(value.split(/[\s,-\.\(\)]/),v,isNonBlank(v)).get(0,3).join(" ").fingerprint()
etc. (I've done a bit more work here to make sure blank entries are ignored - it's the get() command that's the key bit for getting the different numbers of words).
I'm guessing that most country names are going to be only a few words long, so it would only be a few columns needed.
I've not been able to come up with a better approach so far. I'll post some more here if I come up with anything else. You could also try asking on the OpenRefine forum https://groups.google.com/forum/#!forum/openrefine

How to handle a "keyword search" via Stored Procedure?

I'm creating a self-help FAQ type application and one of the requirements is that the end user has to be able to search for FAQ topics. I have three models of note, listed below with their relevant (i.e. searchable) columns:
Topic: Name, Description
Question: Name, Answer
Problem: Name, Solution
All three tables are linked to Topic via a TopicID column. The idea is to provide a single textbox where the user can enter a search query, something either as a sentence (e.g. "How do I perform X") or a phrase (e.g. "Performing X" or "Perform X"), and provide all Topics/Questions/Problems that have any of the words they entered in either the name or description/answer/solution fields; the model will only ever have those columns searchable and I don't have to worry about filtering out the common words like "How" and such (It would be nice but isn't a requirement as it's not an exact match but a fuzzy match).
For reasons outside of my control, I have to use a Stored Procedure. My question is what would be the most appropriate way to handle a search like this; I've seen similar questions regarding multiple columns but really there is not a variable number of columns, there are always two columns per table that are actually searchable. The issue is that the search query could, in theory, be nearly anything - a sentence, a phrase, a comma-separated list of terms (e.g. "x,y,z"), so I would have to split the search term into components (e.g. split on whitespace) and then search each pair of columns for every term? Is that reasonably easy to do in SQL Server? The alternative, a little messier, is to just pull all the data back and then split the query and filter the results in the server-side code as there shouldn't ever be that many items entered, but I would feel a little dirty doing something like that ;-)
Suggest creating a new Full Text Catalog, and assign the table and columns to that catalog. Ensure your catalog is being updated at the right frequency for your needs.
You can then query this catalog using the FREETEXT predicate. It sounds like you need to match on those suffixes like 'ing', so suggest FREETEXT over CONTAINS in this case.
You can use a variable in this search, so it'll be easy to fit into a stored proc.
declare #token varchar(256);
select #token = 'perform';
select * from Problem
where freetext(Name, #token)
or freetext(Solution, #token);
--this will match 'perform' and 'performing'

Tag suggestion (not tag autocomplete)

AJAX autocomplete is fairly simple to implement. However, I wonder how to handle smart tag suggestion like this on SO.
To clarify the difference between autocomplete and suggestion:
autocomplete: foo [foobar, foobaz]
suggestion: foo [barfoo, foobar, foobaz], or even better, with 'did you mean' feature: [barfoo, foobar, foobaz, fobar, fobaz]
I suppose I need some full text search in tags (all letters indexed, not just words). There would be no problem to do it witch regex or other patterns for limited number of tags (even client side).
But how to implement this feature for big number of tags?
Is there any particular reason (besides URL) the tags on SO are dash separated? What about Unicode characters in tags?
I store the tags in the table with the following columns: id, tagname.
My SQL query returns objects with following fields: id, tagname, count
(I use Doctrine ORM and pgsql as default db driver.)
I would go with SELECTING them from database by REGEXP at every keypress. I did this on my sites and the was no prefrormance problem (I do not have heavy loaded server thought). If you do not like this idea, I would cash all 1-5 letters combinations which will users enter and refresh them on daily basis in separate table. If this table is indexed than you have very fast implementation.
To elaborate more on the second appreach:
Briefly: 1. Make a table SEARCHTABLE representing 1-n relationship betwean keywords (limit it to 3-4 letters) and primary IDs of tags. 2. INDEX on both fields. 3. Everytime the user makes a search do look at the SEARCHTABLE and if the combination is there, use that - very fast, as everything is indexed. If not do the regexp search and put all results to SEARCHTABLE.
Notes:
You should invalidate the table if
you add tags, but this should much
less often than a search. When
invalidating table you do not
necesarilly TRUNCATE it, you can
easily rebuild it taking all
keywords into account.
If you want to speed it up, you can "pregenerate" all two or even three
letters searches.
If you care enough, you should be using information from n-1 letter kewords to generate
the n letter keyword. It speeds the things tremendously. Imagine that user has typed "mo"
and you have shown them appropriate result from SEARCHTABLE. Than when she types "n"
giving it "mon" you need only serach trough already selected items to generate new
response.
Hope it is more comprehensive now.

Need Pattern for dynamic search of multiple sql tables

I'm looking for a pattern for performing a dynamic search on multiple tables.
I have no control over the legacy (and poorly designed) database table structure.
Consider a scenario similar to a resume search where a user may want to perform a search against any of the data in the resume and get back a list of resumes that match their search criteria. Any field can be searched at anytime and in combination with one or more other fields.
The actual sql query gets created dynamically depending on which fields are searched. Most solutions I've found involve complicated if blocks, but I can't help but think there must be a more elegant solution since this must be a solved problem by now.
Yeah, so I've started down the path of dynamically building the sql in code. Seems godawful. If I really try to support the requested ability to query any combination of any field in any table this is going to be one MASSIVE set of if statements. shiver
I believe I read that COALESCE only works if your data does not contain NULLs. Is that correct? If so, no go, since I have NULL values all over the place.
As far as I understand (and I'm also someone who has written against a horrible legacy database), there is no such thing as dynamic WHERE clauses. It has NOT been solved.
Personally, I prefer to generate my dynamic searches in code. Makes testing convenient. Note, when you create your sql queries in code, don't concatenate in user input. Use your #variables!
The only alternative is to use the COALESCE operator. Let's say you have the following table:
Users
-----------
Name nvarchar(20)
Nickname nvarchar(10)
and you want to search optionally for name or nickname. The following query will do this:
SELECT Name, Nickname
FROM Users
WHERE
Name = COALESCE(#name, Name) AND
Nickname = COALESCE(#nick, Nickname)
If you don't want to search for something, just pass in a null. For example, passing in "brian" for #name and null for #nick results in the following query being evaluated:
SELECT Name, Nickname
FROM Users
WHERE
Name = 'brian' AND
Nickname = Nickname
The coalesce operator turns the null into an identity evaluation, which is always true and doesn't affect the where clause.
Search and normalization can be at odds with each other. So probably first thing would be to get some kind of "view" that shows all the fields that can be searched as a single row with a single key getting you the resume. then you can throw something like Lucene in front of that to give you a full text index of those rows, the way that works is, you ask it for "x" in this view and it returns to you the key. Its a great solution and come recommended by joel himself on the podcast within the first 2 months IIRC.
What you need is something like SphinxSearch (for MySQL) or Apache Lucene.
As you said in your example lets imagine a Resume that will composed of several fields:
List item
Name,
Adreess,
Education (this could be a table on its own) or
Work experience (this could grow to its own table where each row represents a previous job)
So searching for a word in all those fields with WHERE rapidly becomes a very long query with several JOINS.
Instead you could change your framework of reference and think of the Whole resume as what it is a Single Document and you just want to search said document.
This is where tools like Sphinx Search do. They create a FULL TEXT index of your 'document' and then you can query sphinx and it will give you back where in the Database that record was found.
Really good search results.
Don't worry about this tools not being part of your RDBMS it will save you a lot of headaches to use the appropriate model "Documents" vs the incorrect one "TABLES" for this application.