How to efficiently build and store semantic graph? - sql

Surfing the net I ran into Aquabrowser (no need to click, I'll post a pic of the relevant part).
It has a nice way of presenting search results and discovering semantically linked entities.
Here is a screenshot taken from one of the demos.
On the left side you have they word you typed and related words.
Clicking them refines your results.
Now as an example project I have a data set of film entities and subjects (like wolrd-war-2 or prison-escape) and their relations.
Now I imagine several use cases, first where a user starts with a keyword.
For example "world war 2".
Then i would somehow like to calculate related keywords and rank them.
I think about some sql query like this:
Lets assume "world war 2" has id 3.
select keywordId, count(keywordId) as total from keywordRelations
WHERE movieId IN (select movieId from keywordRelations
join movies using (movieId)
where keywordId=3)
group by keywordId order by total desc
which basically should select all movies which also have the keyword world-war-2 and then looks up the keywords which theese films have as well and selects those which occour the most.
I think with theese keywords I can select movies which match best and have a nice tag cloud containing similar movies and related keywords.
I think this should work but its very, very, very inefficient.
And its also only one level or relation.
There must be a better way to do this, but how??
I basically have an collection of entities. They could be different entities (movies, actors, subjects, plot-keywords) etc.
I also have relations between them.
It must somehow be possible to efficiently calculate "semantic distance" for entities.
I also would like to implement more levels of relation.
But I am totally stuck. Well I have tried different approaches but everything ends up in some algorithms that take ages to calculate and the runtime grows exponentially.
Are there any database systems available optimized for that?
Can someone point me in the right direction?

You probably want an RDF triplestore. Redland is a pretty commonly used one, but it really depends on your needs. Queries are done in SPARQL, not SQL. Also... you have to drink the semantic web koolaid.

From your tags I see you're more familiar with sql, and I think it's still possible to use it effectively for your task.
I have an application where a custom-made full-text search implemented using sqlite as a database. In the search field I can enter terms and popup list will show suggestions about the word and for any next word only those are shown that appears in the articles where previously entered words appeared. So it's similar to the task you described
To make things more simple let's assume we have only three tables. I suppose you have a different schema and even details can be different but my explanation is just to give an idea.
Words
[Id, Word] The table contains words (keywords)
Index
[Id, WordId, ArticleId]
This table (indexed also by WordId) lists articles where this term appeared
ArticleRanges
[ArticleId, IndexIdFrom, IndexIdTo]
This table lists ranges of Index.Id for any given Article (obviously also indexed by ArticleId) . This table requires that for any new or updated article Index table should contain entries having known from-to range. I suppose it can be achieved with any RDBMS with a little help of autoincrement feature
So for any given string of words you
Intersect all articles where all previous words appeared. This will narrow the search. SELECT ArticleId FROM Index Where WordId=... INTERSECT ...
For the list of articles you can get ranges of records from ArticleRanges table
For this range you can effectively query WordId lists from Index grouping the results to get Count and finally sort by it.
Although I listed them as separate actions, the final query can be just big sql based on the parsed query string.

Related

how do I use sqlite to search for one or more word prefixes

I want to allow users to search a sqlite-backed music database. The search should include both, artist and song title, and should be word-based (search for one or more word-prefixes) and case-insensitive. The solution should only use plain SQL, as it needs to work with different language bindings.
For example: Searching for "hea" should find both, "Heart of Gold" and "Stairway to Heaven", but "day" won't find "Yesterday".
Combining search terms should find only entries that contain both word prefixes (e.g. "hea led" will only find "Stairway to Heaven" from Led Zeppelin).
So how do I build the database in order to perform efficient queries and how should the (language independent) queries look like?
I read about the like optimization of sqlite. I can perform efficient like queries, if there is an index on a column, and I query for a prefix. As I want to find all world of both, the title and the artist, I thought about a table with two columns: one for words, and one for the corresponding index of the music table. When building the search index I would extract all worlds from the artist and title and create a new row in the query table for it.
What do you think about this approach? Will it work? It would lead to much duplicate entries for both, common words and words of artist names. But I don't care much about disk-space efficiency, if it leads to a better performance.
When I search for a single word "prefix", I could create a query like
SELECT id from query_table where word like 'prefix%'
But how can I generate a reliable and fast query when I search for e.g. 5 word prefixes?

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Should I worry about optimizing a large Solr field, with lots of duplicate terms?

I found an easy way to search through relational data in Solr, but I am not sure if I should to optimise it further.
Let me give you an example: Say, that we have a system, where users organize books in personal collections. A Book has a genre, e.g. "Drama", "Thriller", "Horror", etc. A user collection may, and in most cases, it does, contain books from different genres.
If I want to create a search, where users can search through collections by genre, I'd like to return the results which contain books most relevant to the genre query. What I did was a simple trick - I added a search field for the collection, named "genres", which is a concatenated string of the genres of all books in that collection. This string field is created at index time. It makes a lot of sense, because, if a collection contains 30 "Thriller" and 20 "Comedy" books, in a search for "Thriller" it will appear as a more relevant result than in a search for "Comedy".
As you can guess, however, the "genres" field ends up having a lot of duplicate terms. Since it is only use behind the scenes, and not displayed anywhere, this is not so much a data integrity than an optimization problem IMHO.
I am particularly new to Solr. I am aware of how it works, and I assume that at the time of building the inverted index, each and every term gets associated with a simple frequency count. Technically, if the "genres" field consists of 100 terms or 10000 terms, 9500 of which are "Thriller" it should still not matter much for the indexing and querying speed, right?
If I am wrong, then does a syntax exist, where boosts can be given even at the input text? Say, if instead of 10000 terms, the "genres" field looked like:
"Thriller^8500 Comedy^125 Drama^12"
You should use payloads feature of Solr, that allow boosting words in text.
For example check http://sujitpal.blogspot.ru/2011/01/payloads-with-solr.html
Regards to your approach: all will be good if stored, termPositions, termOffsets field attributes are set to false.

Search for multiple words, across multiple models

I'm trying to create search functionality in a site, and I want the user to be able to search for multiple words, performing substring matching against criteria which exist in various models.
For the sake of this example, let's say I have the following models:
Employee
Company
Municipality
County
A county has multiple municipalities, which has multiple companies, which have multiple employees.
I want the search to be able to search against a combination of Employee.firstname, Employee.lastname, Company.name, Municipality.name and County.name, and I want the end result to be Employee instances.
For example a search for the string "joe tulsa" should return all Employees where both words can be found somewhere in the properties I named in the previous sentence. I'll get some false positives, but at least I should get every employee named "Joe" in Tulsa county.
I've tried a couple of approaches, but I'm not sure I'm going down the right path. I'm looking for a nice RoR-ish way of doing this, and I'm hoping someone with more RoR wisdom can help outline a proper solution.
What I have tried:
I'm not very experienced with this kind of search, but outside RoR, I'd manually create an SQL statement to join all the tables together, create where clauses for each separate search word, covering the different tables. Perhaps use a builder. Then just execute the query and loop through the results, instantiate Employee objects manually and adding them to an array.
To solve this in RoR, I've been:
1) Dabbling with named scopes in what in my project corresponds to the Employee model, but I got stuck when I needed to join in tables two or more "steps" away (Municipality and County).
2) Created a view (called "search_view") joining all the tables together, to simplify the query. Then thought I'd use Employee.find_by_sql() on this table, which would yield me these nice Employee objects. I thought I'd use a builder to create the SQL, and it seemed that Arel was the thing to use, so I tried doing something like:
view = Arel::Table.new(:search_view)
But the resulting Ariel::Table does not contain any columns, so it's not usable to build my query. At this point I'm a bit stuck as I don't know how to get a working query builder.
I strongly recommend using a proper search engine for something like this, it will make life a lot easier for you. I had a similar problem and I thought "Boy, I bet setting up something like Sphinx means I have to read thousands of manuals and tutorials first". Well, that's not the case.
Thinking Sphinx is a Rails gem I recommend which makes it very easy to integrate Sphinx. You don't need to have much experience at all to get started:
http://freelancing-god.github.com/ts/en/
I haven't tried other search engines, but I'm very satisfied with Sphinx. I managed to set up a relatively complex real-time search in less than a day.

which is faster, mysql database with one table or multiple tables?

On my website you can search 'ads' or 'classifieds'. There are different categories.
Would the searches be faster with multiple tables, one for each category, or wouldn't it matter?
We are talking about around 500 thousand ads.
If it won't slow down the search, please explain yourself so that I understand why it won't, because it seems like common sense that the more ads you have, the slower the search!
Thanks
Your question is a little unclear. I'm assuming this scenario:
table: ads
id category ad_text
-- -------- ---------------------------
1 pets sample text
2 family sample ad
If you are making one search of ads, then searching multiple tables on each search is slower than searching one table.
HOWEVER, if you're proposing to break "ads" into multiple tables according to "category", leaving you with table names like
pets-ads
family-ads
programmer-ads
And, programatically, you know you're looking for programmer-ads so you can just go search the programmer-ads table, then breaking them out is faster. Barely.
Breaking them out, though, has many drawbacks. You'll need:
some cute code to know which table to
search.
a new table each time you create a new category
to rename a table if you decide a category name is wrong
Given the limited info we have, I would strongly advise one table with a category column, then go ahead and normalize that out into its own table. Slap an index on that column. Databases are built to handle tons of rows of data organized correctly, so don't worry about that so much.
Obviously, it will be nominally faster to search a smaller table (one category) than a larger table. The larger table is probably still the correct design, however. Creating multiple identical tables will simply make the developer's and manager's lives miserable. Furthermore, certain kind of searches are more difficult if you segment the data (for instance, searches across two categories).
Properly indexed, the single-table approach will yield results almost as good as the segmented approach while providing the benefits of proper design.
(Of course, when you say "single table", I assume that you mean a single table to hold the core attributes of the Advertistment entities. Presumably there will be other tables as well.)
It depends.
If you've built a single denormalised table containing text, it'll get progressively slower for a number of reasons. Indexes help to a certain point.
If you have a normalised structure with multiple tables, primary and foreign keys, indexes, etc., it can be more robust and scalable.
A database is very well equipped to deal with 500k adds. Add an index on the category, and you should be fine.
If you add the table definition and the distribution of categories to your question, you'd probably get a better answer :)