Is there a way to compare the similarity between sentences in sql? - sql

Is there a way to compare the similarity between sentences in sql? I have large dataset and I need to identify instances where there are similar words in a two or more setences.
How do I tell SQL to only return the values below?
From what I have googled, there may be a way to do this using a Full-Text Search and Semantic Search, but I have been able to find an article that addresses what I am trying to achieve.
Could someone in the group, provide me example or point to an article that could help me? Better yet, is what I am trying to do even achievable in SQL.

No, there is not.
Part of the problem is that "similarity" is a complex setup and this requires a program to analyze the sentence POSSIBLY with months of programming. You give pretty simplistic examples - grats. Even that is not as easy as you think. What about "the small boy wear red t-shirt" - would small boy be a difference or not?
This requires a LOT of work, and a LOT of definition, or a LOT of training of possibly a multi layer neural network.
SQL generally is awful at string manipulation - the best you get is SOUNDEX and that just compares 4 letters of the first word (RTFM, it is actually QUITE interesting how it works, but it makes it absolutely unsuitable for anything like comparing sentences.
So, no - this is simply way outside the scope of anything in SQL, you will have to download the data and use an out of SQL approach (which is also a LOT more fit for this type of work).
You can obviously work around that with simplistic SQL such as #ASH was suggesting - but this is not looking for "similar sentences" but working around specific markers that ARE SPECIFIC FOR YOUR DATA SET. THis is overfitting and bypassing answering the question you have asked.

You can try SOUNDEX function. Google SOUNDEX and then understand if this works for your case. The query is:
SELECT *
FROM your_table
WHERE SOUNDEX(Sentence) = SOUNDEX(Sentence);

Related

Using Lucene QueryAPI to access SQL

Can you advise on whether I can use just the Query functionality from Lucene to generate SQL queries? Something like an SQLQueryBuilder?
I have a massive SQL database of logs from a webserver cluster containing the original request and response strings plus some other useful/less bits and bobs. What I need to do is analyse the parameters in the original request and compare with the generated responses, looking at ratios, volatility, variability, consistency etc.
This question does not relate to the analysis stage, but only the retrieval of data from database which matches the parameters I'm interested in. So, I could just do this in good old sql queries, manually building the exact queries I need on a case-by-case basis. But that's kinda lame; I reckon we can be a bit smarter than that. Particularly as I can already see large numbers of similar but subtly different queries being useful. And as I'm hoping that I can expose a single search box via a web interface to non-technical end-users, adding sql queries seems like a bad idea... and a recipe for permanent maintenance requests (and can I be the first to say, er no thanks!).
In an ideal world I expose a search form, with the option to write simple queries like
request:"someAttribute=\"someValue\"" AND response="some hoped for result" AND daterange:30
which would then hopefully find all instances of requests which contain someAttribute="someValue" over the last 30 days. The results will then be put through standard statistical analyses on the given response text and printed out on-screen. At least, that's the idea.
Much of the actual logic to determine how to handle custom field definitions or special words I'll need to write myself, and that's ok. And NB, my non-technical end users are familiar enough with xml that they can handle a bit of attr="value" syntax, at least for the first iteration of the tool :D
In summary, I want to:
1) allow users to use google-like search syntax (e.g. via Lucene's QueryAPI) to specify text to match in the logs
2) allow a layer to manipulate the query based on special words or fields (e.g. this layer could be during a Java object phase)
3) convert the final query into an sql query appropriate for my database schema
4) query the database and spit back the resultset for statistical analysis
5) pretty-print on website:)
Am I completely barking up the wrong tree? It looks like it should be possible, but I can't seem to find much on it. I've been googling for a bit on this, for example trying "Lucene SQLQueryBuilder" as a possible start but didn't really find much by way of a lead.
So, my questions are:
Has anyone tried using Lucene's QueryAPI like this before? Did it work? Any gotchas?
Are there better query api libraries out there?
Examples, finished discussions and open-source implementations would be most helpful.
Many thanks.
NB: I don't think I want Lucene's search capabilities as such, as I'm only ever looking for exact matches. I just need a query layer on top of the database.
Lucene and SQL have very little in common as they're using totally different syntax (as HefferWolf mentioned) and different underlying data models. As you said yourself, I'm afraid you're barking the wrong tree.
There are however attempts, such as Hibernate Search to bridge this gap. These are interesting experiments as such, but I would be very careful to use any of that code in production.
You could possibly use Full Text Search features available in some SQL databases, or reindex all data in Lucene and use it without database.
I doubt you can reuse any code from lucene for this. Lucene does an internal rewrite of such queries but into a syntax which wouldn't be of much help for SQL I think.
name: Phil AND lastname: Miller AND NOT age: 26
would be rewritten to
+name Phil +lastname: Miller -age: 26
So I think you would have to write your on transition into a SQL Query syntax.
But maybe you can use Lucene as such for this. Have a look into hibernate-search which is quite handy to easily create a lucene index of a sql table.

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Math needed for Sql Server

I've been working in Sql server jobs since 2 years now. Although I like it, sometimes I get the feeling that at certain times, I stall too much on some tasks, and I seem to be discouraged easily from things that involve relatively simple logic. It's like, at some point I must repeat a logical condition inside my head more than 2 or 3 times in order to understand it completely.
I have the feeling that this might be of my lack of math knowledge. Can anyone please let me know what area of mathematics I can study, that would improve my Sql server coding skills?
Thank you.
The field of maths most likely to be useful to you is Boolean logic
Set Theory is good for second place however it will often go into more detail that you are likely to need/use in understanding most sql queries.
A quick cheat that you may find useful is if you feed a boolean expression into wolfram alpha it will spit out a truth table for you which some find a much easier way of visualising the expression.
http://www.wolframalpha.com/input/?i=a+or+not+b
I recommend you study symbolic logic.
I'd suggest reading up on Set based Math.
See this link: http://weblogs.sqlteam.com/jeffs/archive/2007/04/30/thinking-set-based-or-not.aspx
Set theory helped me somewhat. Studied it in college years before I got into SQL, but being able to think of a bunch of numbers as a semi-amorphous blob of data and not as an ordered list of items really helps.
Get a copy of this book. It should prove to be most useful: The Art of SQL, by Stephane Faroult.

Fastest way to become a MySQL expert?

I have been using MySQL for years, mainly on smaller projects until the last year or so. I'm not sure if it's the nature of the language or my lack of real tutorials that gives me the feeling of being unsure if what I'm writing is the proper way for optimization purposes and scaling purposes.
While self-taught in PHP I'm very sure of myself and the code I write, easily can compare it to others and so on.
With MySQL, I'm not sure whether (and in what cases) an INNER JOIN or LEFT JOIN should be used, nor am I aware of the large amount of functionality that it has. While I've written code for databases that handled tens of millions of records, I don't know if it's optimum. I often find that a small tweak will make a query take less than 1/10 of the original time... but how do I know that my current query isn't also slow?
I would like to become completely confident in this field in the ability to optimize databases and be scalable. Use is not a problem -- I use it on a daily basis in a number of different ways.
So, the question is, what's the path? Reading a book? Website/tutorials? Recommendations?
EXPLAIN is your friend for one. If you learn to use this tool, you should be able to optimize your queries very effectively.
Scan the the MySQL manual and read Paul DuBois' MySQL book.
Use EXPLAIN SELECT, SHOW VARIABLES, SHOW STATUS and SHOW PROCESSLIST.
Learn how the query optimizer works.
Optimize your table formats.
Maintain your tables (myisamchk, CHECK TABLE, OPTIMIZE TABLE).
Use MySQL extensions to get things done faster.
Write a MySQL UDF function if you notice that you would need some
function in many places.
Don't use GRANT on table level or column level if you don't really need
it.
http://dev.mysql.com/tech-resources/presentations/presentation-oscon2000-20000719/index.html
The only way to become an expert in something is experience and that usually takes time. And a good mentor(s) that are better than you to teach you what you are missing. The problem is you don't know what you don't know.
Research and experience - if you don't have the projects to warrant the research, make them. Make three tables with related data and make up scenarios.
E.g.
Make a table of movies their data
make a table of user
make a table of ratings for users
spend time learning how joins work, how to get movies of a particular rating range in one query, how to search the movies table ( like, regex) - as mentioned, use explain to see how different things affect speed. Make a day of it; I guarantee your
handle on it will be greatly increased.
If you're still struggling for case-scenarios, start looking here on SO for questions and try out those scenarios yourself.
I don't know if MIT open courseware has anything about databases... Well whaddya know? They do: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-830Fall-2005/CourseHome/
I would recommend that as one source based only on MITs reputation. If you can take a formal course from a university you may find that helpful. Also a good understanding of the fundamental discrete mathematics/logic certainly would do no harm.
As others have said, time and practice is the only real approach.
More practically, I found that EXPLAIN worked wonders for me personally. Learning to read the output of that was probably the biggest single leap I made in being able to write efficient queries.
The second thing I found really helpful was SQL Tuning by Dan Tow, which describes a fairly formal methodology for extracting performance. It's a bit involved, but works well in lots of situations. And if nothing else, it will give you a much better understanding of the way joins are processed.
Start with a class like this one: https://www.udemy.com/sql-mysql-databases/
Then use what you've learned to create and manage a number of SQL databases and run queries. Getting to the expert level is really about practice. But of course you need to learn the pieces before you can practice.

How to think in SQL?

How do I stop thinking every query in terms of cursors, procedures and functions and start using SQL as it should be? Do we make the transition to thinking in SQL just by practise or is there any magic to learning the set based query language? What did you do to make the transition?
A few examples of what should come to your mind first if you're real SQL geek:
Bible concordance is a FULLTEXT index to the Bible
Luca Pacioli's Summa de arithmetica which describes double-entry bookkeeping is in fact a normalized database schema
When Xerxes I counted his army by walling an area that 10,000 of his men occupied and then marching the other men through this enclosure, he used HASH AGGREGATE method.
The House That Jack Built should be rewritten using a self-join.
The Twelve Days of Christmas should be rewritten using a self-join and a ROWNUM
There Was An Old Woman Who Swallowed a Fly should be rewritten using CTE's
If the European Union were called European Union All, we would see 27 spellings for the word euro on a Euro banknote, instead of 2.
And finally you can read a lame article in my blog on how I stopped worrying and learned to love SQL (I almost forgot I wrote it):
Click
And one more article just on the subject:
Double-thinking in SQL
The key thing is you're manipulating SETS & elements of sets; and relating different sets (and corresponding elements) together. That's really the heart of it, imho. That's why every table should have a primary key; why you see set operators in the language; and why set operators like UNION won't (by defualt) return duplicate rows.
Of course in practice, the rules of sets are bent or broken but it's not that hard to see when this is necessary (otherwise, SQL would be TOO limited). Imho, just crack open your discrete math book and reacquaint yourself with some set exercises.
Best advice I can give you is that every time you think about processing something row-by-row, that you stop and ask yourself if there is a set-based way to do this.
Joe Celko's Thinking in Sets (book)
Perfectly intelligent programmers
often struggle when forced to work
with SQL. Why? Joe Celko believes the
problem lies with their procedural
programming mindset, which keeps them
from taking full advantage of the
power of declarative languages. The
result is overly complex and
inefficient code, not to mention lost
productivity.
This book will change the way you
think about the problems you solve
with SQL programs.. Focusing on three
key table-based techniques, Celko
reveals their power through detailed
examples and clear explanations. As
you master these techniques, you’ll
find you are able to conceptualize
problems as rooted in sets and
solvable through declarative
programming. Before long, you’ll be
coding more quickly, writing more
efficient code, and applying the full
power of SQL.
When people ask me about joins I send them here it has a great visual representation on what they are!
The way that I learned was by doing a lot of queries, and working at a job that required you to think in terms of result sets.
From your question, it seems like you've been writing lots of front-end code that uses sequential/procedural/iterative data manipulation. If you don't get on any projects that require you to use result set skills, I personally wouldn't worry about it.
One thing you might want to try is by trying to write analytical queries, e.g., generating simplistic reports on your data. In those cases you are trying to summarize large amounts of data by cordoning them off into sets.
Another good way would be to read a book on the theoretical/mathematical foundations to RDBMSes. Those deal strictly with set theory and how parts of the SQL query syntax relate directly with the math behind it. Of course, this requires you to like math. :)
I found that the Art Of SQL was a useful kick in the head for getting into the right mindset.
Part of this, however, comes down to style.
Obviously, you need to start thinking in result sets and not just procedurally.
However, once you've start that, you will often find decisions have to be made.
Do you write the incredibly complex update statement that may be difficult to understand by anyone but yourself, and difficult to maintain, or do you write a less efficient, but easier to manage procedure?
I would HIGHLY suggest that you remember that SQL statements can have comments in them to clarifiy what they are doing, not just stored procedures.
link: The Art Of SQL
One exercise you might want to try is this:
Take some of your existing reporting code from your application layer, preferably something that produces a single, tabular data set. Starting with the most basic elements, port it over to an SQL View.
Take all of the columns pulled from a single table and write the SQL statement to select that data. Then join on one table at a time and start figuring out the appropriate conditions and logic for your output.
You might come up against some particular task that at first seems impossible in SQL, but depending on the implementation you are programming against, there is almost always a way to get the result you're looking for. Check the documentation for your SQL implementation, or try Google.
This exercise has the benefit of giving you an original report to test against, so you know if you're getting the output you expect.
A few things to watch out for:
Recursion and graphs are fairly advanced techniques; you might want to start with something easier. (Joe Celko has a good book on the topic, if you're interested.)
There's often a big difference between a BIT and a C-style bool. At the very least, you may have to explicitly cast your output from INT to BIT.
OUTER JOINs are useful when a portion of the data might be empty, but try not to abuse them.
I think it takes a while to adjust (it was long ago for me, so I don't remember too well). But perhaps the key point is that SQL is declarative - i.e. you specify what you want done, not precisely how it should be done procedurally. So for a simple example:
"Get me the names and salaries of employees in departments located in London"
The relevant SQL is almost natural:
select name, salary
from employees
join departments on departments.deptno = employees.deptno
where departments.location = 'London';
We have "told" SQL how to join departments to employees, but only declaratively (NATURAL JOIN removes the need to do that, but is dangerous so not used in practice). We haven't defined procedurally how it should be done (e.g. "for each department, find all employees...") SQL is free to choose the optimal method to perform the query.
Thinking of rows makes sense when you use SQL to dump a table to your file system and then do whatever has to be done in your favorite programming language.
Not too much leverage of SQL; waste of disk, memory, cpu and human resources.
Think of SQL as of English (or whatever human language you prefer).
Show me all customers who ride bulls and get drunk every day but never visited Indonesia with their mother-in-law whose phone number is the same as my friend Doug's except for the area code.
You can do it (and much more) in one SQL statement, just learn how to. It's very lucrative.