How to integrate database search with pdf search in a web app? - pdf

I've a jsp web application with a custom search engine.
The search engine is basically build on top of a 'documents' table of a SQL Server database.
To exemplify, each document record has three fields:
document id
'descripion' (text field)
'attachment', a path of a pdf file in the filesystem.
The search engine actually searches keywords in description field and returns a result list in an HTML page. Now I want to search keywords even in the pdf file content.
I'm investigating about Lucene, Tika, Solr, but I don't understand how I can use these frameworks for my goal.
One possible solution: using Tika to extract pdf content and store in a new document table field, so I can write SQL queries on this field.
Are there better alternatives?
Can I use Solr/Lucene indexing features as an integration of SQL-based search engine and not as a totally substitute of it?
Thanks

I would consider Lucene to be completely independent of an SQL Database, i.e. you will not use SQL/jdbc/whatever DB to query Lucene, but its own API and its own data store.
You could of course use Tika to extract the full text of a pdf, store it, and use whatever your SQL DB provides re. fulltext search capacity.
If you are using Hibernate, Hibernate Search is a fantastic product which integrates both an SQL store and Lucene. But you would have to go the Hibernate/JPA way, which might be overkill for your project.

Related

Lucene vs. Azure Search

I am a newbie to search engines and information retrieval. Can someone explain how different is Lucene search engine compared to Azure Search.
I read the Azure Search documents and see that Azure Search supports Lucene queries as well, so is Azure Search built on top of Lucene or inherits certain features of it?
There is no proper documentation as such, can someone point me in the right direction.
Thanks in advance.
According to this Microsoft page the full text search is built on Lucene.
https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search
"The full text search engine in Azure Search is built on Apache Lucene, an industry standard in information retrieval."
Azure Search is not built on top of Apache Lucene as such, but it does support Lucene Query syntax.
https://learn.microsoft.com/en-us/rest/api/searchservice/lucene-query-syntax-in-azure-search

Migrating a websites' knowledgebase to PostgreSQL Database for search querying

I've been tasked to create a search function for a websites' knowledgebase (which is stored in a github repo). I'm only really familiar with building databases with Django, so I'm having trouble understanding how I'm supposed to upload a bunch of html files to the database and query them with postgres. Any pointers on how the database can be structured. I've heard that html files can be stored in a text field, but how are the columns structured, does each page get its' own row, etc? and how can I do this with a fairly large knowledge base without having to manually upload each file?
The db hosting platform I am using has a migration utility that says
Uploading will accept data in any of three forms, plain text (SQL), tar archives (uncompressed), or PostgreSQL's own compressed 'custom' format.
That's assuming the database is already structured.
I've heard that html files can be stored in a text field, but how are the columns structured, does each page get its' own row, etc?
Storing html in a column is perfectly acceptable. If you're storing the html in a column, then each new page requires a new row.
and how can I do this with a fairly large knowledge base without having to manually upload each file?
You just said the hosting provider permits "PostgreSQL's own compressed 'custom' format". So install PostgreSQL locally. Get it all up and working. Insert every page locally. Then you can upload to the hosting provider using pg_dump --format=c which is not just a single action, but compressed.

search using apache lucene

I want to use lucene to make searching on my database table fast.
The table query is select x,y,z from tablexyz.The searchable field is x.It has 2 million rows.I want to use it in a web application and show the data on a search page.Has anyone used Lucene to store entire table in a text file?
I think that Apache Solr is what you are looking for.
To get started:
first read the tutorial to understand the basics,
then have a look at DataImportHandler which would probably be the easiest way to index your content.
No matter what technology you are using for your web application, Solr has a lot of connectors.

Grails app: own query language for data stored in DB and files + full text search (Hibernate Search, Compass etc)

I have an application which stored short descriptive data in DB and lots of related textual data in text files.
I would like to add "advanced search" for DB. I was thinking about adding own query language like JIRA does (Jira Query Language). Then I thought about having full text search across those textual files (less priority).
Wondering which tool will better suite me to implement that faster and simpler.
I most of all I want to provide users with ability to write their own queries instead of using elements to specify search filters.
Thanks
UPD. I save dates in DB and most of varchar fields contain one word strings.
UPD2. Apache Derby is used right now.
Take a look at the Searchable plugin for Grails.
http://www.grails.org/plugin/searchable

How best to develop the sql to support Search functionality in a web application?

Like many web applications (business) the customer wants a form that will search across each data field. The form could have 15-20 different fields where the user can select/enter/input to be used in sql (stored procedure).
These are quite typical requests by the user that most every application has to deal with.
The issue really at hand is how to provide the user with this type of interface/option AND establish fast SQL access. The above fields could span 15 different tables and respective sql statements (usually abstracted to a stored procedure) will have as many joins. The data always has to be brought back to a grid type view as well as some report format (often excel).
I/we are finding these sql statements are slow and hard to optimize as the user can enter 1 or 15 different search criteria.
How should this be done? Looking for suggestions/ideas as to how existing large applications deal with these requirements.
Does it really come down to trying to optimize the sql within the stored procedure?
thx
No, you need to employ a real search engine technology to make fulltext search have good performance. No SQL predicate (e.g. LIKE '%pattern%') is going to be scalable.
You don't identify which brand of RDBMS you're using, but every major brand of RDBMS has their own fulltext search capability:
Microsoft SQL Server: Full-Text Search
Oracle: Oracle Text (formerly ConText)
MySQL: FULLTEXT index (MyISAM only)
PostgreSQL: Text-Search data types and index types
SQLite: Full-Text Search (FTS)
IBM DB2: Text Search
There are also third-party solutions for indexing text, such as:
Apache Lucene / Solr
Sphinx Search
Xapian