clickstream keyword dimension - sql

I'm currently trying to determine how to build out a keyword dimension table. We're tracking website visits to our website, and would like to be able to find the most used keywords used to search via a search engine for the site as well as any search terms used during the visit on the site (price > $100, review > 4 stars, etc). Since the keywords are completely dynamic and can be used in an infinite number of combinations, I'm having a hard time trying to determine how to store these keywords. I have a pageview fact table that includes a record every time a page is viewed. The source I'm pulling from includes all the search terms in a delimited list I am able to parse with a regular expression, I just don't know how to store it in the database since the number of keywords can vary so widely from pageview to pageview. I'm thinking this may be more suited for a NOSQL solution that trying to cram it into a MSSQL table, but I don't know. Any help is greatly appreciated!

Depending on how you want to analyze the data, there's a few solutions.
But for the amount of data that you are probably analyzing, I'd just create a table that uses the PK of the fact to store each keyword.
FACT_PAGEVIEW_ID bigint -- Surrogate key of fact table. Or natural key if you don't have a surrogate.
KEYWORD varchar(255) -- or whatever max len the keywords are
VALUE varchar(255)
The granularity of this table is 1 row per ID/Keyword combination. You may have to add value as well if you allow the same keyword multiple times in a querystring.
This allows you to group the keywords by pageview, or start with the pageview fact, filter it, then join to this to identify keywords.
The other option would be a keyword dimension and a many-many bridge table with a "keyword group", but since any number of combinations can be used, this is probably the quicker way and will likely get you 90% of the way there. Most questions, like "what combination of keywords are used most frequently", and "what keywords are most used by the top 10% of the user base" can be answered with this structure.

Related

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Building a MySQL database that can take an infinite number of fields

I am building a MySQL-driven website that will analyze customer surveys distributed by a variety of clients. Generally, these surveys are structured fairly consistently, and most of our clients' data can be reduced to the same normalized database structure.
However, every client inevitably ends up including highly specific demographic questions for their customers that are irrelevant to every other one of our clients. For instance, although all of our clients will ask about customer satisfaction, only our auto clients will ask whether the customers know how to drive manual transmissions.
Up to now, I have been adding columns to a respondents table for all general demographic information, with a lot of default null's mixed in. However, as we add more clients, it's clear that this will end up with a massive number of columns which are almost always null.
Is there a way to do this consistently? I would rather keep as much of the standardized data as possible in the respondents table since our import script is already written for that table. One thought of mine is to build a respondent_supplemental_demographic_info table that has the columns response_id, demographic_field, demographic_value (so the manual transmissions example might become: 'ID999', 'can_drive_manual_indicator', true). This could hold an infinite number of demographic_fields, but would be incredible painful to work with from both a processing and programming perspective. Any ideas?
Your solution to this problem is called entity-attribute-value (EAV). This "unpivots" columns so they are rows in a table and then you tie them together into a single view.
EAV structures are a bit tricky to learn how to deal with. They require many more joins or aggregations to get a single view out. Also, the types of the values becomes challenging. Generally there is one value column, so everything is stored as a string. You can, of course, have a type column with different types.
They also take up more space, because the entity id is repeated on each row (I think that is the response_id in your case).
Although not idea in all situations, they are appropriate in a situation such as you describe. You are adding attributes indefinitely. You will quickly run over the maximum number of columns allowed in a single table (typically between 1,000 and 4,000 depending on the database). You can also keep track of each value in each column separately -- if they are added at different times, for instance, you can keep a time stamp on when they go in.
Another alternative is to maintain a separate table for each client, and then use some other process to combine the data into a common data structure.
Do not fall for a table with key-value pairs (field id, field value) as that is inefficient.
In your case I would create a table per customer. And metadata tables (in a separate DB) describing these tables. With these metadata you can generate SQL etcetera. That is definitely superior too having many null columns. Or copied, adapted scripts. It requires a bit of programming, where an application uses the metadata to generate SQL, collect the data (without customer specific semantic knowledge) and generate reports.

How to efficiently build and store semantic graph?

Surfing the net I ran into Aquabrowser (no need to click, I'll post a pic of the relevant part).
It has a nice way of presenting search results and discovering semantically linked entities.
Here is a screenshot taken from one of the demos.
On the left side you have they word you typed and related words.
Clicking them refines your results.
Now as an example project I have a data set of film entities and subjects (like wolrd-war-2 or prison-escape) and their relations.
Now I imagine several use cases, first where a user starts with a keyword.
For example "world war 2".
Then i would somehow like to calculate related keywords and rank them.
I think about some sql query like this:
Lets assume "world war 2" has id 3.
select keywordId, count(keywordId) as total from keywordRelations
WHERE movieId IN (select movieId from keywordRelations
join movies using (movieId)
where keywordId=3)
group by keywordId order by total desc
which basically should select all movies which also have the keyword world-war-2 and then looks up the keywords which theese films have as well and selects those which occour the most.
I think with theese keywords I can select movies which match best and have a nice tag cloud containing similar movies and related keywords.
I think this should work but its very, very, very inefficient.
And its also only one level or relation.
There must be a better way to do this, but how??
I basically have an collection of entities. They could be different entities (movies, actors, subjects, plot-keywords) etc.
I also have relations between them.
It must somehow be possible to efficiently calculate "semantic distance" for entities.
I also would like to implement more levels of relation.
But I am totally stuck. Well I have tried different approaches but everything ends up in some algorithms that take ages to calculate and the runtime grows exponentially.
Are there any database systems available optimized for that?
Can someone point me in the right direction?
You probably want an RDF triplestore. Redland is a pretty commonly used one, but it really depends on your needs. Queries are done in SPARQL, not SQL. Also... you have to drink the semantic web koolaid.
From your tags I see you're more familiar with sql, and I think it's still possible to use it effectively for your task.
I have an application where a custom-made full-text search implemented using sqlite as a database. In the search field I can enter terms and popup list will show suggestions about the word and for any next word only those are shown that appears in the articles where previously entered words appeared. So it's similar to the task you described
To make things more simple let's assume we have only three tables. I suppose you have a different schema and even details can be different but my explanation is just to give an idea.
Words
[Id, Word] The table contains words (keywords)
Index
[Id, WordId, ArticleId]
This table (indexed also by WordId) lists articles where this term appeared
ArticleRanges
[ArticleId, IndexIdFrom, IndexIdTo]
This table lists ranges of Index.Id for any given Article (obviously also indexed by ArticleId) . This table requires that for any new or updated article Index table should contain entries having known from-to range. I suppose it can be achieved with any RDBMS with a little help of autoincrement feature
So for any given string of words you
Intersect all articles where all previous words appeared. This will narrow the search. SELECT ArticleId FROM Index Where WordId=... INTERSECT ...
For the list of articles you can get ranges of records from ArticleRanges table
For this range you can effectively query WordId lists from Index grouping the results to get Count and finally sort by it.
Although I listed them as separate actions, the final query can be just big sql based on the parsed query string.

Sphinx question: Structuring database

I'm developing a job service that has features like radial search, full-text search, the ability to do full-text search + disable certain job listings (such as un-checking a textbox and no longer returning full-time jobs).
The developer who is working on Sphinx wants the database information to all be stored as intergers with a key (so under the table "Job Type" values might be stored such as 1="part-time" and 2="full-time")... whereas the other developers want to keep the database as strings (so under the table "Job Type" it says "part-time" or "full-time".
Is there a reason to keep the database as ints? Or should strings be fine?
Thanks!
Walker
Choosing your key can have a dramatic performance impact. Whenever possible, use ints instead of strings. This is called using a "surrogate key", where the key presents a unique and quick way to find the data, rather than the data standing on it's own.
String comparisons are resource intensive, potentially orders of magnitude worse than comparing numbers.
You can drive your UI off off the surrogate key, but show another column (such as job_type). This way, when you hit the database you pass the int in, and avoid looking through to the table to find a row with a matching string.
When it comes to joining tables in the database, they will run much faster if you have int's or another number as your primary keys.
Edit: In the specific case you have mentioned, if you only have two options for what your field may be, and it's unlikely to change, you may want to look into something like a bit field, and you could name it IsFullTime. A bit or boolean field holds a 1 or a 0, and nothing else, and typically isn't related to another field.
if you are normalizing your structure (i hope you are) then numeric keys will be most efficient.
Aside from the usual reasons to use integer primary keys, the use of integers with Sphinx is essential, as the result set returned by a successful Sphinx search is a list of document IDs associated with the matched items. These IDs are then used to extract the relevant data from the database. Sphinx does not return rows from the database directly.
For more details, see the Sphinx manual, especially 3.5. Restrictions on the source data.

Add record only if doesn't exist in Access 2007

I have a table, let's say it stores staff names (this isn't the exact situation, but the concept is the same), and the primary key is set to an autonumber. When adding a a new member of staff I want to check to see if that name exists in the database before it is added, and maybe give an error if it already exists. How can I do this from the normal add form for the table?
I tried creating a query for it but that won't work because the form is based on the table and can't use a query as the control source. I saw some examples online saying how to do something like this with VB code, but I couldn't get it to work as it wasn't a simple example and some lines were left out.
Is there any simple way in which this can be done?
In the table design view, you could make the Name column Indexed with No Duplicates
Then Access itself will reject the entry. I think it will however use up one of autonumbers before rejecting the input.
You're dealing with the issue of pre-qualifying records before inserting them into the database. Simple and absolute rules that you will never violate (like never, ever, ever allowing to records with the same name) can be dealt with through database constraints — in this case creating an index on the column in question with AllowDuplicates set to No.
However, in the real world pre-qualification is generally more complex. You may need to simply warn the user of a possible duplicate, but allow them to add the record anyway. And you may need to check other tables for certain conditions or to collect information for more than one table at a time.
In these cases you need to write your interface so it is not directly bound to the table (in Access terms, create a form with the record source empty), collect the information in various controls, perform your checks in code (often using DCOUNT and DLOOKUP) and then issue a series of INSERT and UPDATE statements in code using DoCmd.RunSQL.
You can occasionally use some tricks to get around having to code this in the front end, but sooner rather than later you'll encounter cases that require this level of coding.
I'll put my vote in on the side of using an unbound form to collect the required fields and presenting possible duplicates. Here's an example from a recent app:
(source: dfenton.com)
(I edited real people's names out and put in fake stuff, and my graphics program's anti-aliasing is different from ClearType's, hence the weirdness)
The idea here is that the user puts data in any of the four fields (no requirement for all of them) and clicks the ADD button. The first time, it populates the possible matches. Then the user has to decide whether one of the matches is the intended person or not, and either click ADD again (to add it, even if it's a duplicate), or click the button at the bottom to go to the selected customer.
The colored indicators are intended to convey how close the match is. In this case, the email address put in is an exact match for the first person listed, and exact match on email by itself is considered an exact match. Also, in this particular app, the client wants to minimize having multiple people entered at the same company (it's the nature of their business), so an exact match on Organization is counted as a partial match.
In addition to that, there's matching using Soundex, Soundex2 and Simil, along with substrings and substrings combined with Soundex/Soundex2/Simil. In this case, the second entry is the duplicate, but Soundex and Soundex2 don't catch it, while Simil returns 67% similarity, and I've set the sensitivity to greater than 50%, so "Wightman" shows up as a close match with "Whiteman". Last of all. I'm not sure why the last two are in the list, but there's obviously some reason for it (probably Simil and initials).
I run the names, the company and the email through scoring routines and then use the combination to calculate a final score. I store Soundex and Soundex2 values in each person record. Simil, of course, has to be calculated on the fly, but it works out OK because the Jet/ACE query optimizer knows to restrict on the other fields and thus calls Simil for a much-reduced data set (this is actually the first app I've used Simil for, and it is working great so far).
It takes a bit of a pause to load the possible matches, but is not inordinately slow (the app this version is taken from has about 8K existing records that are being tested against). I created this design for an app that had 250K records in the person table, and it worked just fine when the back end was still Jet, and still works just great after the back end was upsized to SQL Server several years ago.
The best solution is to get the user to enter a few characters of the first and last name and show a continuous form of all the individuals based on those search criteria. Also display relevant information such as middle name, if available, phone number and address to weed out potential duplicates. Then if a duplicate isn't found then they can add the person.
There will always be two John Smith's or Jane Jones in every town.
I read of a situation where two women with the identical first, middle and last names and birth dates were in a hospital at the saem tiem. Truly scary that.
Thanks for all the info here. It is great info and I would use it as I am self-taught.
The easiest way I found as alluded to here, is to use an index on all the fields I want (with No Duplicates). The trick here is really to use multiple indexes (this basically allows a compound index, or a "virtual" index which is made up of more than one field).
The method can be found here: http://en.allexperts.com/q/Using-MS-Access-1440/Creating-Unique-Value-check.htm, but I would repeat it in the event that the link gets removed.
From Access Help:
Prevent duplicate values from being entered into a combination of fields
Create a multiple-field index using the fields you want to prohibit duplicate values for. Leave the Indexes window open when you have finished defining the index.
How?
Open the table in Design view.
Click Indexes on the toolbar.
In the first blank row in the Index
Name column, type a name for the
index. You can name the index after
one of the index fields, or use
another name.
In the Field Name column, click the
arrow and select the first field for
the index.
In the next row in the Field Name
column, select the second field for
the index. (Leave the Index Name
column blank in that row.) Repeat
this step until you have selected
all the fields you want to include
in this index. The default sort order is
Ascending. Select Descending in the
Sort Order column of the Indexes
window to sort the corresponding
field's data in descending order.
In the upper portion of the Indexes
window, click the new index name.
In the lower portion of the Indexes
window, click the Unique property
box, and then click Yes.
You should now be able to not enter records which have the same values in the indexes. I got problems where I could still enter values if one of the indexed fields has a space (there is an option to check/ignore null values when you set up the index) but it didn't work for me, but the solution worked because I won't have null values anyway.
This is called an 'UPSERT' in the SQL world. The ISO/ANSI SQL-99 Standard defines a MERGE syntax which was recently added to SQL Server 2008 with proprietary extensions to the Standard. Happily this is the way the SQL world is going, following the trail blazed by MySQL.
Sadly, the Access Database Engine is an entirely different story. Even the simple UPDATE does not support the SQL-92 scalar subquery syntax, instead has its own is proprietary with arbitrary (unpredictable? certainly undocumented) results. The Windows team scuppered the SQL Server's attempts to fix this in Jet 4.0. Even now that the Access team has its own format for ACE they seem uninterested in making changes to the SQL syntax. So they chances of the product embracing a Standard SQL-99 -- or even their own alternative -- construct is very remote :(
One obvious workaround, assuming performance is not an issue (as always, needs to be tested), is to do the INSERT, ignoring any key failure errors, then immediately after do the UPDATE. Even if you are of the (IMO highly dubious) 'autonumber PK on every table' persuasion, you should have an unique key on your natural key, so all should be fine.