Using Lucene QueryAPI to access SQL - sql

Can you advise on whether I can use just the Query functionality from Lucene to generate SQL queries? Something like an SQLQueryBuilder?
I have a massive SQL database of logs from a webserver cluster containing the original request and response strings plus some other useful/less bits and bobs. What I need to do is analyse the parameters in the original request and compare with the generated responses, looking at ratios, volatility, variability, consistency etc.
This question does not relate to the analysis stage, but only the retrieval of data from database which matches the parameters I'm interested in. So, I could just do this in good old sql queries, manually building the exact queries I need on a case-by-case basis. But that's kinda lame; I reckon we can be a bit smarter than that. Particularly as I can already see large numbers of similar but subtly different queries being useful. And as I'm hoping that I can expose a single search box via a web interface to non-technical end-users, adding sql queries seems like a bad idea... and a recipe for permanent maintenance requests (and can I be the first to say, er no thanks!).
In an ideal world I expose a search form, with the option to write simple queries like
request:"someAttribute=\"someValue\"" AND response="some hoped for result" AND daterange:30
which would then hopefully find all instances of requests which contain someAttribute="someValue" over the last 30 days. The results will then be put through standard statistical analyses on the given response text and printed out on-screen. At least, that's the idea.
Much of the actual logic to determine how to handle custom field definitions or special words I'll need to write myself, and that's ok. And NB, my non-technical end users are familiar enough with xml that they can handle a bit of attr="value" syntax, at least for the first iteration of the tool :D
In summary, I want to:
1) allow users to use google-like search syntax (e.g. via Lucene's QueryAPI) to specify text to match in the logs
2) allow a layer to manipulate the query based on special words or fields (e.g. this layer could be during a Java object phase)
3) convert the final query into an sql query appropriate for my database schema
4) query the database and spit back the resultset for statistical analysis
5) pretty-print on website:)
Am I completely barking up the wrong tree? It looks like it should be possible, but I can't seem to find much on it. I've been googling for a bit on this, for example trying "Lucene SQLQueryBuilder" as a possible start but didn't really find much by way of a lead.
So, my questions are:
Has anyone tried using Lucene's QueryAPI like this before? Did it work? Any gotchas?
Are there better query api libraries out there?
Examples, finished discussions and open-source implementations would be most helpful.
Many thanks.
NB: I don't think I want Lucene's search capabilities as such, as I'm only ever looking for exact matches. I just need a query layer on top of the database.

Lucene and SQL have very little in common as they're using totally different syntax (as HefferWolf mentioned) and different underlying data models. As you said yourself, I'm afraid you're barking the wrong tree.
There are however attempts, such as Hibernate Search to bridge this gap. These are interesting experiments as such, but I would be very careful to use any of that code in production.
You could possibly use Full Text Search features available in some SQL databases, or reindex all data in Lucene and use it without database.

I doubt you can reuse any code from lucene for this. Lucene does an internal rewrite of such queries but into a syntax which wouldn't be of much help for SQL I think.
name: Phil AND lastname: Miller AND NOT age: 26
would be rewritten to
+name Phil +lastname: Miller -age: 26
So I think you would have to write your on transition into a SQL Query syntax.
But maybe you can use Lucene as such for this. Have a look into hibernate-search which is quite handy to easily create a lucene index of a sql table.

Related

Is sorting the database via a custom function inefficient?

I have a table with Id and Text fields. The Text field holds sentences, averaging 50 words. There are >1,000,000 rows.
This is part of a web app where users need to be able to search through these sentences. Here's the twist though - I need to be able to run a custom search function written in C# that uses Machine Learning instead.
From what I understand, this means I'll have to download the entire database of >1,000,000 rows every time a user makes a search! This seems really inefficient to me.
How would you implement this in the most efficient/fast way possible?
If this is relevant, I'm using EF Core with LINQ .Where(my_custom_search_function), with a PostgreSQL database
I think I've found the solution. Postgresql full-text search currently provides two ranking functions. In this case "sorting" in the question and "ranking" here refer to the same thing.
Postgresql docs state:
However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
These functions can any of the four kinds of supported postgresql functions.
Then they answer this exact question:
Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.
Credits to #Used_By_Already for pointing me to Postgresql full-text search.

Search String parsing algorithm

I am writing a prototype of a new app for an enterprise. I want to include a great search engine, which is something they have never had before. What I am looking for is something that can translate a lucene style query language into SQL statements on a key value pair data model. (three fields, grouping id, key, value)
Ive been looking for a while now and havn't had any luck. Im about to open the source for lucene and see if I can pull the query algorithms out and have them generate sql instead of index search commands. but im not very hopefull.
I can't just run lucene or any other indexing system on this enterprise for political and regulatory reasons so thats not an option.
Does this type of system exist?
see if I can pull the query algorithms out and have them generate sql instead
Don't waste your time. SQL and Lucene queries work in a completely different way; this is because they use different underlying data structures, algorithms, etc.
The best you can do is to write SQL query parser and rewrite those queries into Lucene queries. But you'd have to be naive to think you can write full-blown SQL query parser. You can easily solve simple cases, but what are you going to do when somebody sends you a JOIN? Or a GROUP BY bar HAVING foo>3?
If you can't jump over political hurdles, just use one of the full text indexing algorithms databases can offer; this is better than nothing.

How to design a database for efficient search-ability?

I am trying to design a database with search-ability at its core. My knowledge of database design and SQL is all self-taught and still fairly beginner-level, so my questions may possibly have easy answers.
Suppose I have a single table containing a large number of records. For example, suppose that each record contains details of a different computer application (name, developer, version number, etc). A list of keywords are associated with each record, such as a list of programming languages used to write the applications.
I wish to be able to enter one or more keywords (each separated by a space) into a search box, and I wish to have all associated records returned. How should I design the database to store the keywords, and what SQL query would I need to apply to the search text? (The search should be uppercase/lowercase independent.)
My next challenge would then be to order search results by relevance, and to allow entire key-phrases as well as keywords to be associated with each record. For example, if I type "Visual Basic" into the search field, I want the first results to have exactly the key-phrase "Visual Basic" associated with them. The next results should all have both keywords "Visual" and "Basic" associated with them, and the remaining results should have only one of these keywords. Again, please could anyone advise on how to implement this?
The final challenge I believe would be much harder: how much 'intelligent interpretation' can I design my database and SQL code to handle? For example, if I search for "CSS", can I get the records with the key-phrase "Cascading Style Sheets" to appear? Can I also get SQL to identify and search for similar words, such as plurals of search phrases or, for example, "programmer" or "programming" when "program" is input? Thanks!
Learn relational algebra, normalization rules, and SQL.
Start with entity relationships. Sounds like you could have an APPLICATION table as parent for a FEATURE child table, with a one-to-many relationship between the two. You'll query them by JOINing one to the other:
SELECT A.NAME, F.NAME
FROM APPLICATION AS A
JOIN FEATURE AS F
ON F.APP_ID = A.ID
Your challenges would not suggest SQL and relations to me. I would think more in terms of a parser, an indexer and search engine like Lucene, and a NoSQL document database like MongoDB.
I've come to the conclusion, after a LOT of research, that #duffymo's answer is hinting in the right direction. For the benefit of other n00bs like me, here's the conclusion I've drawn:
Many open source search engine server apps are out there to install for free. Lucene was the first I had ever heard of them, but others do exist and I think my favourite at the moment is Sphinx. As far as I can tell, the 'indexer' that #duffymo mentions is built into it. I have learnt that the indexer is the program that will examine my database for keywords and will automatically keep a record of which results should be returned for different input queries. I have also now learnt that the terminology for the behaviour I was looking for (and which Sphinx has) is 'stemming'. I'm still not sure what role a parser plays in all this...
A more basic approach would be to use SQL itself. Whilst I was already aware of the most basic of these (ie. using the LIKE keyword with 'wildcards'), I also discovered something a little more powerful: natural language / full-text search. For anyone not interested in installing a server app, I recommend you look this up.
Also, I see no reason why I would need to use NoSQL instead of SQL (as #duffymo has suggested), and so I'm going to stick with SQL for the moment (at least until I come across some good entry-level books to learn NoSQL from). Furthermore, I have very little intention to learn relational algebra until I know why I should and how it would be useful. The message here is that other beginners shouldn't be off-put by these things, as I don't think Sphinx requires any knowledge of them.
while I like #duffymo's answer, I will also suggest you research SPARQL and the wordnet project for your semantic equivalence questions.
If you choose Oracle, you can use the spatial option triple store to implement the SPARQL endpoint and do some very nice seaching like your css = Cascading Style Sheet example.

How do you think while formulating Sql Queries. Is it an experience or a concept?

I have been working on sql server and front end coding and have usually faced problem formulating queries.
I do understand most of the concepts of sql that are needed in formulating queries but whenever some new functionality comes into the picture that can be dont using sql query, i do usually fails resolving them.
I am very comfortable with select queries using joins and all such things but when it comes to DML operation i usually fails
For every query that i never done before I usually finds uncomfortable with that while creating them. Whenever I goes for an interview I usually faces this problem.
Is it their some concept behind approaching on formulating sql queries.
Eg.
I need to create an sql query such that
A table contain single column having duplicate record. I need to remove duplicate records.
I know i can find the solution to this query very easily on Googling, but I want to know how everyone comes to the desired result.
Is it something like Practice Makes Man Perfect i.e. once you did it, next time you will be able to formulate or their is some logic or concept behind.
I could have get my answer of solving above problem simply by posting it on stackoverflow and i would have been with an answer within 5 to 10 minutes but I want to know the reason. How do you work on any new kind of query. Is it a major contribution of experience or some an implementation of concepts.
Whenever I learns some new thing in coding section I tries to utilize it wherever I can use it. But here scenario seems to be changed because might be i am lagging in some concepts.
EDIT
How could I test my knowledge and
concepts in Sql and related sql
queries ?
Typically, the first time you need to open a child proof bottle of pills, you have a hard time, but after that you are prepared for what it might/will entail.
So it is with programming (me thinks).
You find problems, research best practices, and beat your head against a couple of rocks, but in the process you will come to have a handy set of tools.
Also, reading what others tried/did, is a good way to avoid major obsticles.
All in all, with a lot of practice/coding, you will see patterns quicker, and learn to notice where to make use of what tool.
I have a somewhat methodical method of constructing queries in general, and it is something I use elsewhere with any problem solving I need to do.
The first step is ALWAYS listing out any bits of information I have in a request. Information is essentially anything that tells me something about something.
A table contain single column having
duplicate record. I need to remove
duplicate
I have a table (I'll call it table1)
I have a
column on table table1 (I'll call it col1)
I have
duplicates in col1 on table table1
I need to remove
duplicates.
The next step of my query construction is identifying the action I'll take from the information I have.
I'll look for certain keywords (e.g. remove, create, edit, show, etc...) along with the standard insert, update, delete to determine the action.
In the example this would be DELETE because of remove.
The next step is isolation.
Asnwer the question "the action determined above should only be valid for ______..?" This part is almost always the most difficult part of constructing any query because it's usually abstract.
In the above example you're listing "duplicate records" as a piece of information, but that's really an abstract concept of something (anything where a specific value is not unique in usage).
Isolation is also where I test my action using a SELECT statement.
Every new query I run gets thrown through a select first!
The next step is execution, or essentially the "how do I get this done" part of a request.
A lot of times you'll figure the how out during the isolation step, but in some instances (yours included) how you isolate something, and how you fix it is not the same thing.
Showing duplicated values is different than removing a specific duplicate.
The last step is implementation. This is just where I take everything and make the query...
Summing it all up... for me to construct a query I'll pick out all information that I have in the request. Using the information I'll figure out what I need to do (the action), and what I need to do it on (isolation). Once I know what I need to do with what I figure out the execution.
Every single time I'm starting a new "query" I'll run it through these general steps to get an idea for what I'm going to do at an abstract level.
For specific implementations of an actual request you'll have to have some knowledge (or access to google) to go further than this.
Kris
I think in the same way I cook dinner. I have some ingredients (tables, columns etc.), some cooking methods (SELECT, UPDATE, INSERT, GROUP BY etc.) then I put them together in the way I know how.
Sometimes I will do something weird and find it tastes horrible, or that it is amazing.
Occasionally I will pick up new recipes from the internet or friends, then use parts of these in my own.
I also save my recipes in handy repositories, broken down into reusable chunks.
On the "Delete a duplicate" example, I'd come to the result by googling it. This scenario is so rare if the DB is designed properly that I wouldn't bother keeping this information in my head. Why bother, when there is a good resource is available for me to look it up when I need it?
For other queries, it really is practice makes perfect.
Over time, you get to remember frequently used patterns just because they ARE frequently used. Rare cases should be kept in a reference material. I've simply got too much other stuff to remember.
Find a good documentation to your software. I am using Mysql a lot and Mysql has excellent documentation site with decent search function so you get many answers just by reading docs. If you do NOT get your answer at least you are learning something.
Than I set up an example database (or use the one I am working on) and gradually build my SQL. I tend to separate the problem into small pieces and solve it step by step - this is very successful if you are building queries including many JOINS - it is best to start with some particular case and "polute" your SQL with many conditions like WHEN id = "123" which you are taking out as you are working towards your solution.
The best and fastest way to learn good SQL is to work with someone else, preferably someone who knows more than you, but it is not necessarry condition. It can be replaced by studying mature code written by others.
Your example is a test of how well you understand the DISTINCT keyword and the GROUP BY clause, which are SQL's ways of dealing with duplicate data.
Examples and experience. You look at other peoples examples and you create your own code and once it groks, you don't need to think about it again.
I would have a look at the Mere Mortals book - I think it's the one by Hernandez. I remember that when I first started seriously with SQL Server 6.5, moving from manual ISAM databases and Access database systems using VB4, that it was difficult to understand the syntax, the joins and the declarative style. And the SQL queries, while powerful, were very intimidating to understand - because typically, I was looking at generated code in Microsoft Access.
However, once I had developed a relatively systematic approach to building queries in a consistent and straightforward fashion, my skills and confidence quickly moved forward.
From seeing your responses you have two options.
Have a copy of the specification for whatever your working on (SQL spec and the documentation for the SQL implementation (SQLite, SQL Server etc..)
Use Google, SO, Books, etc.. as a resource to find answers.
You can't formulate an answer to a problem without doing one of the above. The first option is to become well versed into the capabilities of whatever you are working on.
The second option allows you to find answers that you may not even fully know how to ask. You example is fairly simplistic, so if you read the spec/implementation documentaion you would know the answer right away. But there are times, where even if you read the spec/documentation you don't know the answer. You only know that it IS possible, just not how to do it.
Remember that as far as jobs and supervisors go, being able to resolve a problem is important, but the faster you can do it the better which can often be done with option 2.

Implementing a massive search application

We have an email service that hosts close to 10000 domains such that we store the headers of messages in a SQL Server database.
I need to implement an application that will search the message body for keywords. The messages are stored as files on a NAS storage system.
As a proof of concept, I had implemented a SQL server based search system were I would parse the message and store all the words in a database table along with the memberid and the messageid. The database was on a separate server to the headers database.
The problem with that system was that I ended up with a table with 600 million rows after processing messages on just one domain. Obviously this is not a very scalable solution.
Since the headers are stored in a SQL Server table, I am going to need to join the messageIDs from the search application to the header table to display the messages that contain the searched for keywords.
Any suggestions on a better architecture? Any better alternative to using SQL server? We receive over 20 million messages a day.
We are a small company with limited resources with respect to servers, maintenance etc.
Thanks
have a look at Hadoop. It's complete "map-reduce" framework for working with huge datasets inspired by Google. It think (but I could be wrong) Rackspace is using it for email search for their clients.
lucene.net will help you a lot, but no matter how you approach this, it's going to be a lot of work.
Consider not using SQL for this. It isn't helping.
GREP and other flat-file techniques for searching the text of the headers is MUCH faster and much simpler.
You can also check out the java lucene stuff which might be useful to you. Both Katta which is a distributed lucene index and Solr which can use rsync for index syncing might be useful. While I don't consider either to be very elegant it is often better to use something that is already built and known to work before embarking on actual development. Without knowing more details its hard to make a more specific recommendation.
If you can break up your 600 million rows, look into database sharding. Any query across all rows is going to be slow. At very least you could break up by language. If they're all English, well, find some way to split the data that makes sense based on common searches. I'm just guessing here but maybe domains could be grouped by TLD (.com, .net, .org, etc).
For fulltext search, compare SQL Server vs Lucene.NET vs cLucene vs MySQL vs PostgreSQL. Note full-text search will be faster if you don't need to rank the results. If a database is still slow look into performance tuning and if that fails look into a Linux-based db.
http://incubator.apache.org/lucene.net/
http://sourceforge.net/projects/clucene/
i wonder if BigTable (http://en.wikipedia.org/wiki/BigTable) does searching.
Look into the SQL Server full text search services/functionality. I haven't used it myself, but I once read that Stack Overflow uses it.
three solutions:
Use an already-existant text search engine (lucene is the most mentioned, there are several more)
Store the whole message in the SQL database, and use included full text search (most DBs have it these days).
Don't create a new record for each word occurrence, just add a new value to a big field in the word record. Even better if you don't use SQL for this table, use a key-value store where the key is the word and the value is the list of occurrences. Check some Inverted Index bibliography for inspiration
but to be honest, i think the only reasonable approach is #1