SQL - Referencing a table at a specific position in content? - sql

I am not sure if this is possible, but I'm also not sure if I am using the correct terminology here, so forgive and correct me if I don't. Also, this question is more about database design more generally.
Say I have something like Article:
Title: Stem cells are soon being used for stuff
Text: "Here is the content for an article about stuff. Here is some more info on stemm cells and stuff. [To the uninitiated, here comes an Info Box on Stem Cells in general, you can expand it!] Now some more text about stem cells and stuff"
In my app I would like to display the article, and then at an exact position (here after sentence no. 2, but this will vary from article to article) insert an info-box on stem cells, which is in its own SQL table.
I know that the idea of SQL in general is that I reference InfoBox in my Article and simply point to it. That would be the relation between article and infoBox.
But how do I specify that infoBox should come exactly after Sentence No. 2? (as in the example). And this will not always be the case. Sometimes there might be no infoBox for an article or multiple, sometimes it will come after sentence no. 25 or 100 -etc.
I don't want to mix relations/fields and content, but I don't understand how I would realise something like this in SQL.

A table (base table or query result) holds rows of values that participate in a relation(ship)/association. Those are the rows that make the table's associated (characteristic) predicate (sentence template parameterized by column names) into a true proposition (statement).
You seem to want a base table for triples where
infoBox [i] should come exactly after Sentence No. [n] of article [a]
PS Time to read a published academic textbook on information modeling & database design. (Manuals for languages & tools to record & use designs are not textbooks on doing information modeling & database design.)

Related

SQL Server - Creating a "Search library" of terms to use in a query

Firstly I apologise in advance if this question is a bit bare bones or has misleading/confusing terminology but I'm not sure how else to phrase it.
I have a few tables which capture the language of interactions based on a few different factors. What I would like to do is set up a sort of temporary library of language based terms that I can reference in a query so that I can search the various tables and find matches against the terms stored in the library.
I'll try and give an example:
The library might consist of the following terms:
English, German, French, Italian, Spanish
I then want to search these tables:
teacherSpokenLanguages, courseLanguages, studentLanguages
And find all the rows that contain the search terms in any particular field (and specify which field that term is being found).
I hope there's enough information to piece together my request. Is this even remotely possible? Could I create a temporary table to contain these values perhaps? I can't do anything permanent on the database, it all has to be housed within this one query and has to be non-destructive.

How to handle a "keyword search" via Stored Procedure?

I'm creating a self-help FAQ type application and one of the requirements is that the end user has to be able to search for FAQ topics. I have three models of note, listed below with their relevant (i.e. searchable) columns:
Topic: Name, Description
Question: Name, Answer
Problem: Name, Solution
All three tables are linked to Topic via a TopicID column. The idea is to provide a single textbox where the user can enter a search query, something either as a sentence (e.g. "How do I perform X") or a phrase (e.g. "Performing X" or "Perform X"), and provide all Topics/Questions/Problems that have any of the words they entered in either the name or description/answer/solution fields; the model will only ever have those columns searchable and I don't have to worry about filtering out the common words like "How" and such (It would be nice but isn't a requirement as it's not an exact match but a fuzzy match).
For reasons outside of my control, I have to use a Stored Procedure. My question is what would be the most appropriate way to handle a search like this; I've seen similar questions regarding multiple columns but really there is not a variable number of columns, there are always two columns per table that are actually searchable. The issue is that the search query could, in theory, be nearly anything - a sentence, a phrase, a comma-separated list of terms (e.g. "x,y,z"), so I would have to split the search term into components (e.g. split on whitespace) and then search each pair of columns for every term? Is that reasonably easy to do in SQL Server? The alternative, a little messier, is to just pull all the data back and then split the query and filter the results in the server-side code as there shouldn't ever be that many items entered, but I would feel a little dirty doing something like that ;-)
Suggest creating a new Full Text Catalog, and assign the table and columns to that catalog. Ensure your catalog is being updated at the right frequency for your needs.
You can then query this catalog using the FREETEXT predicate. It sounds like you need to match on those suffixes like 'ing', so suggest FREETEXT over CONTAINS in this case.
You can use a variable in this search, so it'll be easy to fit into a stored proc.
declare #token varchar(256);
select #token = 'perform';
select * from Problem
where freetext(Name, #token)
or freetext(Solution, #token);
--this will match 'perform' and 'performing'

Pattern for searching entire DB record, not specific field

More and more, I'm seeing searches that not only find a substring in a specific column, but they appear to search in all columns. An example is in Amazon, where you can search for "Arnold" and it finds both the movie Running Man starring Arnold Schwarzeneggar, and the Gund toy Arnold the Snoring Pig. I don't know what the term is for this type of search (Wide search? Global search?), and that bugs me. But what I really want to know is what is the normal pattern for accomplishing this type of search in a QUICK way.
The obvious, and slow, way to do it would be to search for the substring "Arnold" in the title, "Arnold" in the author, "Arnold" in the description, etc.
The first quick solution that comes to mind is to store a mapping for each word used to describe a product to the product itself, and then search that word mapping. That could be quick, but doesn't seem very space-efficient to me.
There are probably a hundred ways to accomplish this, some of which probably don't even use a database. But what is the norm?
I've done this in the past by storing an XML version of items in an XML column in the table, then searching in that column instead of the others.
Maybe they're not storing the data the way you expect.
They could, for example, store all titles, authors, descriptions, and every other searchable field in one table with an attribute to distinguish the field's type.

Searching with words one character long (MySQL)

I have a table Books in my MySQL database which has the columns Title (varchar(255)) and Edition (varchar(20)). Example values for these are "Introduction to Microeconomics" and "4".
I want to let users search for Books based on Title and Edition. So, for example they could enter "Microeconomics 4" and it would get the proper result. My question is how I should set this up on the database side.
I've been told that FULLTEXT search is generally a good way to do things like this. However, because the edition is sometimes just a single character ("4"), full text search would have to be setup to look at individual characters (ft_min_word_len = 1).. This, I've heard, is very inefficient.
So, how should I setup searches of this database?
UPDATE: I'm aware the CONCAT/LIKE could be used here.. My question is whether it would be too slow. My Books database has hundreds of thousands of books and a lot of users are going to be searching it..
here are the steps for solution
1) read the search string from user.
2) make the string in to parts according to space(" ") between the words.
3) use following query for getting the result
SELECT * FROM books WHERE Title LIKE '%part[0]%' AND Edition LIKE '%part[1]%';
here part[0] and part[1] are separated words from the given word
the PHP code for the above could be
<?php
$string_array=explode(" ",$string); //$string is the value we are searching
$select_query="SELECT * FROM books WHERE Title LIKE '%".$string_array[0]."%' AND Edition LIKE '%".$string_array[1]."%';";
$result=mysql_fetch_array(mysql_query($select_query));
?>
for $string_array[0] it could be extended to get all the parts except last one which can be applied for the case "Introduction to Microeconomics 4"
For your application, where you're interested in just title and edition, I suspect that using a FULLTEXT index with MATCH/AGAINST and reducing the ft_min_word_len to 1 would not have that much impact performance-wise (if you were data was more verbose or user written content, then I might hesitate).
The easiest way to check is to change the value, REPAIR the table to account for the new ft_min_word_len and rebuild the index, and do some simple benchmarking.
Having said that, for your application, I might consider looking into Sphinx. It's definitely going to be magnitudes faster, and your content is relatively static, so a delay between re-indexing (Sphinx's main drawback IMO) isn't an issue. Plus, with careful usage of the wordforms and exceptions, you could map things like 4/four/fourth/IV all to the same token for improved searching.

First Name Variations in a Database

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.
Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.
As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.
Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!
Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.
The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?
Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!
In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).
In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.
The term you are looking for is Hypocorism:
http://en.wikipedia.org/wiki/Hypocorism
And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.
I would go with a structure like this:
create table given_names (
id int primary key,
name text not null unique
);
create table hypocorisms (
id int references given_names(id),
name text not null,
primary key (id, name)
);
insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');
Then you could write a function/sproc to normalize a name:
normalize_given_name('Bill'); --returns William
One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)
I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.
Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
and here is the download link for SQL Server 2005 With Advanced Services:
http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en
Hope this helps,
Andrew
You can use the SQL Server Full Text Search and do an inflectional search.
Basically like:
SELECT ProductId, ProductName
FROM ProductModel
WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')
Check out:
http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches
http://msdn.microsoft.com/en-us/library/ms345119.aspx
http://www.mssqltips.com/tip.asp?tip=1491
Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)
You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.
Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.
It's a known problem, and there are open source stemmers available.
-Adam
No, Full Text searches will not help to solve your problem.
I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)
SoundEx - MSDN
SoundEx - Google results
InformIT - Tolerant Search algorithms
Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.
As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.
Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.
Here are some pointers.