search using apache lucene - lucene

I want to use lucene to make searching on my database table fast.
The table query is select x,y,z from tablexyz.The searchable field is x.It has 2 million rows.I want to use it in a web application and show the data on a search page.Has anyone used Lucene to store entire table in a text file?

I think that Apache Solr is what you are looking for.
To get started:
first read the tutorial to understand the basics,
then have a look at DataImportHandler which would probably be the easiest way to index your content.
No matter what technology you are using for your web application, Solr has a lot of connectors.

Related

Splunk Database

I understand that Splunk does not need a lot of functionality that a MySQL database would provide, and to index and perform searches on Big Data it might not be a good option to use a relational database.
Does Splunk use Lucene as a search engine, or have they made their on-disk data format?
I am sorry if there are any problems in the way I am asking the question. This is my first question on Stack Overflow.
Splunk uses its own search engine, it's not based on any 3rd party.
Its search engine is based on files only, no database behind it.
It does not store fields, but raw data only. The fields are extracted during search time, and due to that are very dynamic.
Its also very fast in finding keywords in the data (needle in haystack).
Breaking the data into time-based events, attaching time for each raw event.
Marking every word found in the events and their location across the index
Storing the events in compressed format (tar.gz)
To be more detailed, Splunk is storing data in the following way:
Very fast search for keywords inside the events
Look in the original raw data
Create new fields on the raw data and use them with statistics commands.
Source:
http://www.splunk.com/web_assets/pdfs/secure/Splunk_for_BigData.pdf
http://docs.splunk.com/Documentation/Splunk/6.5.1/Indexer/Howindexingworks
+3 Years experience
Splunk architect.
Googling would have helped: http://answers.splunk.com/answers/43533/search-capabilities-of-splunk-how-powerful-is-it-really --> No Lucene
Splunk has proprietary data format for their indexes. Lucene is not used, and Splunk has it's own Search language called SPL.

SQL query for a search engine

My project is based on Questions and Answers (stackoverflow style).
I need to allow users to search for previously asked questions.
The Questions table would be like this:
Questions
-------------------------------------------------
id questions
-------------------------------------------------
1 How to cook pasta?
2 How to Drive a car?
3 When did Napoleon die?
Now when I'm going to write something to search for, I would write something like this:
When did Brazil win the world cup?
Let's say I'm gonna split this String on spaces, into an array of Strings.
What is the best SELECT SQL query to fetch all questions containing those Strings, ignoring upper case and lower case for each word, and sorting the results by the less mentioned word, why?
Because there will be so many questions which will contain When,and,will,how,etc.. , but not so many questions which will have Brazil, so Brazil would be like the Key Word.
I'm using SQL Server 2008.
You really don't want to be doing this in raw SQL.
I suggest you look into the full-text search options for your database, this might be a good place to start.
In mysql you have full-text indexes and the match() select function which allow just this,
in SQL Server you should use the function Contains()
Find more info on
http://msdn.microsoft.com/en-us/library/ms142571.aspx
Your option is not the best one. Take a look at open source Apache Solr project. http://lucene.apache.org/solr/
Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results.
Advanced Full-Text Search Capabilities Optimized for High Volume Web
Traffic Standards Based Open Interfaces - XML, JSON and HTTP
Comprehensive HTML Administration Interfaces Server statistics
exposed over JMX for monitoring Linearly scalable, auto index
replication, auto failover and recovery Near Real-time indexing
Flexible and Adaptable with XML configuration Extensible Plugin
Architecture
Take a look at Detailed Features and aspecialy Query section. There all you need for your app.

How to integrate database search with pdf search in a web app?

I've a jsp web application with a custom search engine.
The search engine is basically build on top of a 'documents' table of a SQL Server database.
To exemplify, each document record has three fields:
document id
'descripion' (text field)
'attachment', a path of a pdf file in the filesystem.
The search engine actually searches keywords in description field and returns a result list in an HTML page. Now I want to search keywords even in the pdf file content.
I'm investigating about Lucene, Tika, Solr, but I don't understand how I can use these frameworks for my goal.
One possible solution: using Tika to extract pdf content and store in a new document table field, so I can write SQL queries on this field.
Are there better alternatives?
Can I use Solr/Lucene indexing features as an integration of SQL-based search engine and not as a totally substitute of it?
Thanks
I would consider Lucene to be completely independent of an SQL Database, i.e. you will not use SQL/jdbc/whatever DB to query Lucene, but its own API and its own data store.
You could of course use Tika to extract the full text of a pdf, store it, and use whatever your SQL DB provides re. fulltext search capacity.
If you are using Hibernate, Hibernate Search is a fantastic product which integrates both an SQL store and Lucene. But you would have to go the Hibernate/JPA way, which might be overkill for your project.

Grails app: own query language for data stored in DB and files + full text search (Hibernate Search, Compass etc)

I have an application which stored short descriptive data in DB and lots of related textual data in text files.
I would like to add "advanced search" for DB. I was thinking about adding own query language like JIRA does (Jira Query Language). Then I thought about having full text search across those textual files (less priority).
Wondering which tool will better suite me to implement that faster and simpler.
I most of all I want to provide users with ability to write their own queries instead of using elements to specify search filters.
Thanks
UPD. I save dates in DB and most of varchar fields contain one word strings.
UPD2. Apache Derby is used right now.
Take a look at the Searchable plugin for Grails.
http://www.grails.org/plugin/searchable

Sql Server Full Text Search - Getting word occurances/location in text?

Suppose I have Sql Server (2005/2008) create an index from one of my tables.
I wish to use my own custom search engine (a little more tuned to my needs than Full Text Search).
In order to use it however, I need Sql Server to provide me the word positions and other data required by the search engine.
Is there anyway to query the "index" for this data instead of just getting search results?
Thanks
Roey
No. And if you could, what happens if Microsoft decide to change their internal data structures? Your code would break.
What are you trying to achieve?
You shouldnt rely on SQL servers internal data structures - they are tailored specifically for SQL servers use and aren't acessible for querying anyway.
If you want a fast indexer then you will probably have more success using a pre-written one rather than trying to write your own. Give Lucene.Net a try.