How to retrieve results by date range and sort using SOLR with ColdFusion 9.0.1? - lucene

I'm using ColdFusion 9.0.1 and the integrated SOLR full text search engine.
I have dates stored in my SQL Server database as datetime fields for upcoming events. I took these records and inserted them into a SOLR collection with the custom3 and custom4 fields being the dateStart and dateEnd dates respectively. Users want to query the collection against a date range and sort by closest date to now.
First question: How do we set the datatype for the custom1-4 fields? Or, can we? Based on this post, Optimizing Solr for Sorting, the field should be set to either tdate or date rather than string for best performance. Or does SOLR automatically make the field have the correct datatype based on this post, Sort by date in Solr/Lucene performance problems?
Second question: How would the search criteria be structured to pull records? How about between May 1, 2011 and July 31, 2011, for example?

I don't tell too many people this, but for you, I believe it's time to ditch CFINDEX/CFSEARCH, and start using Solr directly.
CF's implementation is built for indexing a large block of text with some attributes, not a query. If you start using Solr directly, you can create your own schema, and have far more granular control of how your search works. Yes, it's going to take longer to implement, but you will love the results. Filtering by date is just the beginning.
Here's a quick overview of the steps:
Create a new index using the CFAdmin. This is the easy way to create all the files you need.
Modify the schema. The schema is in [cfroot]/solr/multicore/[your index name]/conf/
The top half of the schema is <types>. This defines all the datatypes you could use. The bottom half is the <fields>, and this is where you're going to be making most of your changes. It's pretty straightforward, just like a table. Create a field for each "column" you want to include. "indexed" means that you want to make that field searchable. "stored" means that you want the exact data stored, so that you can use it to display results. Because I'm using CF9's ORM, I don't store much beyond the primary key, and I use loadEntityByPK() on my results page.
After modifying the schema, you need to restart the solr service/daemon.
Use http://cfsolrlib.riaforge.org/ to index your data (the add method is a 'insert or modify' style method), and to perform the search.
To do a search, check out this example. It shows how to sort and filter by date. I didn't test it, so the format of the dates might be wrong, but you'll get the idea. http://pastebin.com/eBBYkvCW
Sorry this is answer is so general, I hope I can get you going down the right path here :)

Related

Is using comma separated field good or not

I have a table named buildings
each building has zero - n images
I have two solutions
the first one (the classic solution) using two tables:
buildings(id, name, address)
building_images(id, building_id, image_url)
and the second solution using olny one table
buildings(id, name, address, image_urls_csv)
Given I won't need to search by image URL obviously,
I think the second solution (using image_urls_csv column) is easier to use, and no need to create another table just to keep the images, also I will avoid the hassle of multiple queries or joining.
the question is, if I don't really want to filter, search or group by the filed value, can I just make it CSV?
On the one hand, by simply having a column of image_urls_list avoids joins or multiple queries, yes. A single round-trip to the db is always a plus.
On the other hand, you then have a string of urls that you need to parse. What happens when a URL has a comma in it? Oh, I know, you quote it. But now you need a parser that is beyond a simple naive split on commas. And then, three months from now, someone will ask you which buildings share a given image, and you'll go through contortions to handle quotes, not-quotes, and entries that are at the beginning or end of the string (and thus don't have commas on either side). You'll start writing some SQL to handle all this and then say to heck with it all and push it up to your higher-level language to parse each entry and tell if a given image is in there, and find that this is slow, although you'll realise that you can at least look for %<url>% to limit it, ... and now you've spent more time trying to hack around your performance improvement of putting everything into a single entry than you saved by avoiding joins.
A year later, someone will give you a building with so many URLs that it overflows the text limit you put in for that field, breaking the whole thing. Or add some extra fields to each for extra metadata ("last updated", "expires", ...).
So, yes, you absolutely can put in a list of URLs here. And if this is postgres or any other db that has arrays as a first-class field type, that may be okay. But do yourself a favour, and keep them separate. It's a moderate amount of up-front pain, and the long-term gain is probably going to make you very happy you did.
Not
"Given I won't need to search by image URL obviously" is an assumption that you cannot make about a database. Even if you never do end up searching by url, you might add other attributes of building images, such as titles, alt tags, width, height, etc, so you would end up having to serialize all this data in that one column, and then you would not be able to index any of it. Plus, if you serialize it with one language, then you or whoever comes after you using a different language will either have to install some 3rd party library to deserialize your stuff or write their own deserialization function.
The only case that I can think of where you should keep serialized data in a database is when you inherit old software that you don't have time to fix yet.

SQL Server Efficient Search for LIKE '%str%'

In Sql Server, I have a table containing 46 million rows.
In "Title" column of table, I want make search. The word may be at any index of field value.
For example:
Value in table: BROTHERS COMPANY
Search string: ROTHER
I want this search to match the given record. This is exactly what LIKE '%ROTHER%' do. However, LIKE '%%' usage should not be used on large tables because of performance issues. How can I achieve it?
Though I don't know your requirements, your best approach may be to challenge them. Middle-of-the-string searches are usually not very practical. If you can get your users to perform prefix searches (broth%) then you can easily use Full Text's wildcard search (CONTAINS(*, '"broth*"')). Full Text can also handle suffix searches (%rothers) with a little extra work.
But when it comes to middle-of-the-string searches with SQL Server, you're stuck using LIKE. However you may be able to improve performance of LIKE by using a binary collation as explained in this article. (I hate to post a link without including its content but it is way too long of an article to post here and I don't understand the approach enough to sum it up.)
If that doesn't help and if middle-of-the-string searches are that important of a requirement then you should consider using a different search solution like Lucene.
Add Full-Text index if you want.
You can search the table using CONTAINS:
SELECT *
FROM YourTable
WHERE CONTAINS(TableColumnName, 'SearchItem')

How could i write this code in a more performant way?

In our app people have 1 or multiple projects. These projects have a start and an end date. People have a limited amount of available days.
Now we have a page that displays the availability of a given person on a week by week basis. It currently shows 18 weeks.
The way we currently calculate the available time for a given week is like this:
def days_available(query_date=Date.today)
days_engaged = projects.current.where("start_date < ? AND finish_date > ?", query_date, query_date).sum(:days_on_project)
available = days_total - hours_engaged
end
This means that to display the page descibed above the app will fire 18(!) queries into the database. We have pages that lists the availability of multiple people in a table. For these pages the amount of queries is quickly becomes staggering.
It is also quite slow.
How could we handle the availability retrieval in a more performant manner?
This is quite a common scenario when working with date ranges in an entity. Easy and fastest way is in SQL:
Join your events to a number generated date table (see generate days from date range) so that you have a row for each day a person or people are occupied. Once you have the data in this form it is simply a matter of grouping by the week date part of the date and counting the rows per grouping.
You can extend this to group by person for multiple person queries.
From a SQL point of view, I'd advise using a stored procedure and pass in your date/range requirement, you can then return a recordset for a user or possibly multiple users. This way your code just has to access db once.
You can then output recordset data in one go, by iterating through.
Hope this helps.
USE Stored procedure to fire your query to SQL to get data.
Pass paramerts in your case it is today's date to the SQl query.
Apply your conditions and Logic in the SQL Stored procedure , Using procedure is the goood and fastest way to retrieve data from the SQL , also it will prevent your code from the SQL injection too.
Call that SP from your Code as i dont know the Ruby on raisl I cant provide you steps about how to Call the Stored procedure from it.
After that the data fdetched as per you stored procedure will be available in Data table or something like that.
After getting the data you can perform all you need
Hope this helps
see what query is executed. further you may make comand explain to your query
explain select * from project where start_date < any_date and end_date> any_date2
you see the plan of query . Use this plan to optimized your query.
for example :
if you have index using field end_date replace a condition(end_date> any_date2 and start_date < any_date) . this step will using index if you have index on this field. But it step is db dependent . example is for nysql. if you want use index in mysql you must have using index condition on left part of where
There's not really enough information in your question to know exactly what you're trying to achieve here, e.g. the code snippet doesn't make use of the returned database query, so you could just remove it to make it faster. Perhaps this is just a bug in the code you posted?
Having said that, there are some techniques you should look into to implement your functionality.
I would take a look at using data warehouse techniques. I would think of your 'availability information' as a Fact table in a star schema, with 'Dates' and 'People' as Dimension tables.
You can then use queries to get stuff like - list of users for this projects for this week, and their availability.
Data warehousing has a whole bunch of resources you can tap into to help make this perform well, but there's also a lot of terminology that can be confusing, but for this type of 'I need to slice and dice my data across several sets of things (people and time)', Data Warehousing techniques can be quite powerful.
As I dont understand ruby on rails,from sql point of view i suggest you to write a stored procedure and return a dataset.And do the necessary table operations on the dataset from front end.It will reduce the unnecessary calls to DB.

Lucene Fuzzy Search for customer names and partial address

I was going thru all the existing questions posts but couldn't get something much relevant.
I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).
For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.
I am new to Lucene, but by looking at number of posts, looks like its possible.
My questions are:
How should I index my input file? I need to build index on combination of FN, LN, country, DOB and use the index for search
How I can use Fuzzy query of Lucene here?
Is there any other way I can implement the same?
Rushik, here are a few ideas:
Consider using Solr. It is much easier to start using it, rather than bare Lucene.
Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
Consider using soundex for names as described here.
Some academic papers on this subject are well worth reading (google for the free PDFs):
A Comparison of Personal Name Matching: Techniques and Practical Issues (2006)
Overview of Record Linkage and Current Research Directions (2006)
A Parallel Open Source Data Linkage System (2004)
You should also consider the following libraries/frameworks:
Duke: https://github.com/larsga/Duke
Febrl: http://sourceforge.net/projects/febrl/
(Answered for future visitors.)

Manipulate data in the DB query or in the code

How do you decide on which side you perform your data manipulation when you can either do it in the code or in the query ?
When you need to display a date in a specific format for example. Do you retrieve the desired format directly in the sql query or you retrieve the date then format it through the code ?
What helps you to decide : performance, best practice, preference in SQL vs the code language, complexity of the task... ?
All things being equal I prefer to do any manipulation in code. I try to return data as raw as possible so its usuable by a larger base of consumers. If its very specialized, maybe a report, then I may do manipulation on the SQL side.
Another instance where I prefer to do manipulation on the SQL side is if it can be done set based.
If its not set based, and looping would be involved, then I would do the manipulation in code.
Basically let the database do what its good at, otherwise do it in code.
Formatting is a UI issue, it is not 'manipulation'.
My answer is the reverse of everyone else's.
If you are going to have to apply the same formatting logic (the same holds true for calculation logic) in more than one place in your application, or in separate applications, I would encapsulate the formatting in a view inside the database and SELECT from the view. You do not need to hide the original data, that can also be available. But by putting the logic into the database view you're making it trivially easy to have consistent formatting across modules and applications.
For instance, a Customer table would have an associated view CustomerEx with a MailingAddress derived column that would format the various parts of the address as required, combining city, state, and zip and compressing out blank lines, etc. My application code SELECTs against the CustomerEx view for addresses. If I extend my data model with, say, an Apt# field or to handle international addresses, I only need to change that single view. I do not need to change, or even recompile, my application.
I would never (ever) specify any formatting in the query itself. That is up to the consumer to decide how to format. All data manipulation should be done at the client side, except for bulk operations.
If it is just formatting and will not always need to be the same formatting, I'd do it in the application which is likely to do this faster.
However the fastest formatting is the one that is done only once, so if it is a standard format that I alawys want to use (say displaying American phone numbers as (###)###-#### ) then I'll store the data in the database in that format (this still may involve the application code, but onthe insert not the select). This is especially true if you might need to reformat a million records for a report. If you have several formats, you might considered calculated columns (we have one for full name and one for lastname, firstname and our raw data is firstname, middlename, lastname, suffix) or triggers to persist the data. In general I say store the data the way you need to see it if you can keep it in the appropriate data type for the real manipulations you need to do such as datemath or regular math for money values.
About the only thing that I do in a query that could probably be done in code also is converting the datetimes to the user's time zone.
MySQL's CONVERT_TZ() function is easy to use and accurate. I store all of my datetimes in UTC, and retrieve them in the user's time zone. Daylight savings rules change. This is especially important for client applications since relying on the native library relies on the fact that the user has updated their OS.
Even for server side code, like a web server, I only have to update a few tables to get the latest time zone data instead of updating the OS on the server.
Other than those types of issues, it's probably best to distribute most functions to the application server or client rather than making your database the bottleneck. Application servers are easier to scale than database servers.
If you can write a stored procedure or something that might start with a large dataset, do some inexpensive calculations or simple iteration to return a single row or value, then it probably makes sense to do it on the server to save from sending large datasets over the wire. So, if the processing is inexpensive, why not have the database return just what you need?
In the case of the date column, I'd save the full date in the DB and when I return it I specify in code how I'd like to show it to the user. This way you can ignore the time part or even change the order of the date parts when you show it in a datagrid for example: mm/dd/yyyy, dd/mm/yyyy or only mm/yyyy.