i am solr newbie, and i am trying to use it for setup a faceted search from a database denormalized view (a table with a lot's of fields).
At the moment i have created the index in solr and i can query the database via solr url. I will use the solr facets to generate the search menu: a set of given fields with all possible values and with the number of occurences for each value
Now the question is, should I use solr to create the fecets and use plain old SQL to query the database or it is better to use solr also to query the database?
I use the facets to create a search refinement, if you want to suggest to you what to look for you should search for terms.
[https://cwiki.apache.org/confluence/display/solr/The+Terms+Component][1]
It is always better to query the solr for your search results because if you query the database the number of results can be different compared to what you are showing against that facet , as the results in database my not be yet updated in solr.
Another reason is performance , querying different fields spread across multiple tables is expensive compared to all the denormalized documents indexed in a search engine.
Related
I have a 2 tables that are old_test and new_test /bible database/
old_test table has 7959 rows
new_test table has 23145 rows
I want to use LIKE query to search verse from two tables.
For example:
SELECT *
FROM old_test
where text like "%'+searchword+'%"
union all
SELECT *
FROM new_test
where text like "%'+searchword+'%"
It works good but taking a lot of time to show the result.
What is the best solution to search much faster on above condition?
Thanks
Your query %searchword% cause table scan, it will get slower as number of records increase. Use searchword% query to get index base fast query.
What you need is full-text search, which is not available in websql.
I suggest my own open source library, https://github.com/yathit/ydn-db-fulltext for full-text search implementation. It works with newer IndexedDB API as well.
The main problem with your query is that you having to search entire fields segment by segment to find the string using like - building an index that can be queried instead should alleviate the problem.
Looking at Web SQL it uses the SQLite engine:
User agents must implement the SQL dialect supported by Sqlite 3.6.19.
http://www.w3.org/TR/webdatabase/#parsing-and-processing-sql-statements
Based on that, I would recommend trying to build a full-text index over the table to make these searches run quickly http://www.sqlite.org/fts3.html
I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.
I have a set of 200M documents I need to index. Every document has a free text and additional set of sparse metadata information (100+ columns).
It seems that the right tool for free text indexing is Lucene while the right tool for structured sparse metadata is HBase.
I would need to query the data and join between free text search results and the structured data results (e.g. get all books that has the phrase "good morning" in their textand were first published in 1980).
What tools/mechanism should I look at to join structured and unstrcutured queries?
Results may include millions of records (before and after the join)
Thanks
Saar
A couple of things come to mind, in addition to lucene on hbase:
1) Solr/Lucene can store multiple fields, and each field can have different types. So your date range example is plausible wholly within Solr.
2) If you are talking about truly huge data sets that require a cluster, also look at ElasticSearch: http://www.elasticsearch.org/
3) Lily attempts to answer your exact question http://www.lilyproject.org/lily/index.html
Looks like HBase would like some Lucene action as well: https://issues.apache.org/jira/browse/HBASE-3529.
I am currently trying to figure out how I could possibly use the full-text index to allow users to search for Keyword(s) within the DB data. The issue I am having is that not only do I want to know the source table, but the source column as well so I can tell the user where I found the hit.
I can do it by using the INFORMATION_SCHEMA and building a large table, that I can build an index on, but then I have to keep that table in sync with the source tables.
Any other thoughts on how to do something like this?
Thanks,
S
Could you do separate queries?
We have a SQL Server database with a million-ish records that are indexed by Lucene.net through Nhibernate.Search. When we built the index for our classes, we tried to be extensive since the cost for indexing/retrieval was really small. The goal was to offer full-text searching to users on a webpage with pagination.
Since SQL Server complains when too many parameters are sent to it (2100 parameters by default) and since we didn't want to change that parameter everytime we hit the limit (which can happen easily, some terms in our document are very common but must be searchable) we decided to handle everything from sorting to paging in Lucene. It worked like a charm.
However, recently, feature-creep is causing us some problem because new queries need to access not only fields that aren't indexed but also fields that shouldn't be accessed or can't be accessed: computed fields, recommendation lists, etc...
Since we have put all our paging and sorting in Lucene.Net and since SQL Server is picky regarding its parameters, how can we manage to have our cake and eat it too?
I'm looking into doing the sql query computation first, reducing the elements to their doc id and then feeding Lucene a gigantic OR query with all possible ids to let it choose correctly what's possible, but i worry about the query size
pseudo code
listIds = Nhibernate.Criteria.ReduceToIds.List(of MyObject)
queryIds = String.join(" ID:", l)
return NHibernate.Search(queryIds)
Apparently, it is possible to have Lucene Filters working by allowing only certain documents ID to be part of the query, so it should be possible, but i don't really see a way to do it in Nhibernate.search
Do you have any idea how i should handle the problem? Is it possible to filter the query by asking SQL the list of ids? Is it overkill? Any other solution out there?
Usually you have problems when Lucene.Net returns more than 2100 results, as NHibernate.Search will build a big SELECT * FROM T WHERE ID IN (#p0,#p1 ...)
So, if your lucene query doesn't return more than 2100 results, you should be fine.