How to quick match an entry with the beginning of a long string? - sql

I have a table articles (:Rails Model but I think the issue is more SQL related) which have a column name permalink. To instance, some of my permalinks :
title-of-article
great-article
great-article-about-obama
obama-stuff-about-him
I want to match a request like great-article-about-obama-random-stuff to great-article-about-obama. Is it possible to do it, avoiding killing performance ?
Thanks to all,
ps : We use Rails 3 and Postgresql (or Sqlite not decided yet for production)
EDIT
We can do something like this, but the main downside is we have to fetched every single permalinks from the table articles :
permalinks = ['title-of-article','great-article','great-article-about-obama','obama-stuff-about-him']
string_to_match = 'great-article-about-obama-random-stuf'
result = permalinks.inject('') do |matched,permalink|
matched = (string_to_match.include? permalink and permalink.size > matched.size) ? permalink : matched
end
result => 'great-article-about-obama'
I'll love to find a way to do it directly in SQL for obvious performance reason.

Unless using a text-search base technology (w/ postgres : http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html + http://tenderlovemaking.com/2009/10/17/full-text-search-on-heroku/ or solr, indexTank) you can do it with :
request = "chien-qui-aboie"
article = nil
while !article do
article = Article.where("permalink like ?", request+"%").select(:id).first
request.gsub!(/-[^-]*$/) unless article
end
This will first look for chien-qui-aboie%, then chien-qui%, then chien%.
This will also match "chien_qui_mange" if there is an article "chien_qui_mange" but no one about "chien qui aboie"
That's not optimal because of the number of requests, but that's not that heavy if it's just a look up, and not the normal way of accessing a record.

Related

phalcon querybuilder total_items always returns 1

I make a query via createBuilder() and when executing it (getQuery()->execute()->toArray())
I got 10946 elements. I want to paginate it, so I pass it to:
$paginator = new \Phalcon\Paginator\Adapter\QueryBuilder(array(
"builder" => $builder,
"limit" => $limit,
"page" => $current_page
));
$limit is 25 and $current_page is 1, but when doing:
$paginator->getPaginate();
$page->total_items;
returns 1.
Is that a bug or am I missing something?
UPD: it seems like when counting items it uses created sql with limit. There is no difference what limit is, limit divided by items per page always equals 1. I might be mistaken.
UPD2: Colleague helped me to figure this out, the bug was in the query phalcon produces: count() of the group by counts grouped elements. So a workaround looks like:
$dataCount = $builder->getQuery()->execute()->count();
$page->next = $page->current + 1;
$page->before = $page->current - 1 > 0 ? $page->current - 1 : 1;
$page->total_items = $dataCount;
$page->total_pages = ceil($dataCount / 100);
$page->last = $page->total_pages;
I know this isn't much of an answer but this is most likely to be a bug. Great guys at Phalcon took on a massive job that is too big to do it properly in their little free time and things like PHQL, Volt and other big but non-core components do not receive as much attention as we'd like. Also given that most time in the past 6 months was spent on v2 there are nearly 500 bugs about stuff like that and it's counting. I came across considerable issues in ORM, Volt, Validation and Session, which in the end made me stick to other not as cool but more proven solutions. When v2 comes out I'm sure all attention will on the bug list and testing, until then we are mostly on our own. Given that it's all C right now, only a few enthusiast get involved, with v2 this will also change.
If this is the only problem you are hitting, the best approach is to update your query to get the information you need yourself without getPaginate().

read all document by using particular category name using alfresco search.luceneSearch or search.lib.js

Category Name
|
Geograpy (8)
Study Db (18)
i am implement my own advance search in alfresco. i need to read all files which related with particular category.
example:
if there is 20 file under geograpy, lucene query should read particular document under search key word "banana".
Further explanation -
I am using search.lib.js to search. I would like to analyze the result to find out to which category the documents belong to. For example I would like to know how many documents belong to the category under Languages and the subcategories. I experimented with the Classification API but I don't get the result I want. Any Idea how to go through the result to get the category name of each document?
is there any simple method like node.properties["cm:creator"]?
thanks
janaka
I think you should specify more your question:
Are you using cm:content or a customized content?
Are you going to search the keyword inside the content of the file? or are you going to search the keyword in a specific metadata(s)?
Do you want to create a webscript (java or javascript)?
One thing to take in consideration:
if you use +PATH:"cm:generalclassifiable/...." for the categorization in your lucene queries, the performance will be slow (following my experince)
You can use for example the next query to find all nodes at any depth below /cm:Languages:
var results = search.luceneSearch("+PATH:\"cm:generalclassifiable/cm:Languages//*\");
Take a look to this url: https://wiki.alfresco.com/wiki/Search#Path_Queries
Once you have all the elements, you can loop all, and get to which category below. Of course you need to create some counter per each category/subcategory:
for(i = 0; i < results.length; i++){
var node = results[i];
var categoryNodeRef = node.properties["cm:categories"];
var categoryDesc = categoryNodeRef.properties["cm:description"];
var categoryName = categoryNodeRef.properties["cm:name"];
}
This is not exactly the solution, but can be a useful idea to start.
Sorry if it's not what you're asking for, I have just arrived from my holidays.

Update more record in one query with Active Record in Rails

Is there a better way to update more record in one query with different values in Ruby on Rails? I solved using CASE in SQL, but is there any Active Record solution for that?
Basically I save a new sort order when a new list arrive back from a jquery ajax post.
#List of product ids in sorted order. Get from jqueryui sortable plugin.
#product_ids = [3,1,2,4,7,6,5]
# Simple solution which generate a loads of queries. Working but slow.
#product_ids.each_with_index do |id, index|
# Product.where(id: id).update_all(sort_order: index+1)
#end
##CASE syntax example:
##Product.where(id: product_ids).update_all("sort_order = CASE id WHEN 539 THEN 1 WHEN 540 THEN 2 WHEN 542 THEN 3 END")
case_string = "sort_order = CASE id "
product_ids.each_with_index do |id, index|
case_string += "WHEN #{id} THEN #{index+1} "
end
case_string += "END"
Product.where(id: product_ids).update_all(case_string)
This solution works fast and only one query, but I create a query string like in php. :) What would be your suggestion?
You should check out the acts_as_list gem. It does everything you need and it uses 1-3 queries behind the scenes. Its a perfect match to use with jquery sortable plugin. It relies on incrementing/decrementing the position (sort_order) field directly in SQL.
This won't be a good solution for you, if your UI/UX relies on saving the order manually by the user (user sorts out the things and then clicks update/save). However I strongly discourage this kind of interface, unless there is a specific reason (for example you cannot have intermediate state in database between old and new order, because something else depends on that order).
If thats not the case, then by all means just do an asynchronous update after user moves one element (and acts_as_list will be great to help you accomplish that).
Check out:
https://github.com/swanandp/acts_as_list/blob/master/lib/acts_as_list/active_record/acts/list.rb#L324
# This has the effect of moving all the higher items down one.
def increment_positions_on_higher_items
return unless in_list?
acts_as_list_class.unscoped.where(
"#{scope_condition} AND #{position_column} < #{send(position_column).to_i}"
).update_all(
"#{position_column} = (#{position_column} + 1)"
)
end

Apache solr - more like this score

I have a small index with ~1000 documents with only two fields:
- id (string)
- content (text_general)
I noticed that when I do MLT search by id for similar content, the original document(which id is the searched id) have a score 5.241327.
There is 1:1 duplicated document and for the duplicated content it is returning score = 1.5258181. Why? Why it is not 5.241327 when it is 100% duplicate.
Another question is can I in any way to get similarity documents by content by passing some text in the query.
Example:
/mlt/?q=content:Some encoded long text&mlt.fl=content
I am trying to check if there is similar content uploaded and the check must be performed at new content upload time.
It might be worth to try some different parameters. I also use MLT on only one field, I use the following parameters:
'mlt.boost': 'true',
'mlt.fl': 'my_field_name',
'mlt.maxqt': 1000,
'mlt.mindf': '0',
'mlt.mintf': '0',
'qt': 'mlt',
'rows': '10'
See http://wiki.apache.org/solr/MoreLikeThis for an explanation of the parameters. I think with a small index mindf might be important and I see the default mintf (term frequency) is 2, so I assume an ID is only one term, so this is probably ignored!
First, how does Solr More-Like-This works?
A regular Solr query is conducted (e.g. "?q=content:Some encoded long text&.....".
For each document returned by the above query, More-Like-This conduct More like this query...
So, the first result set "response", is just like any Solr query results set.
The More-Like-This appears below and start with something like that (Json format):
"moreLikeThis":{
"57375":{"numFound":18155,"start":0,"docs":["
For an explanation about More Like This algorithm, please read that:
http://blog.brattland.no/node/18
and: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
If you didn't solved the problem yet, please let me know and I will guide you through.

Detailed information in Lucene/Solr results

After having performed a search in Lucene/Solr without having specified a field, how can I know in which fields of a result document the search string was found (and how often)?
You could use Query Highlighting.
Try setting debugQuery=on. See this example.
As mentioned, use debugQuery=true. The response will then include an "explain" section. By default, this will give you some awful formatted text that looks like this:
0.69102794 = (MATCH) weight(body:arrai^1.5 in 6357), product of:
0.46610788 = queryWeight(body:arrai^1.5), product of:
1.5 = boost
5.591044 = idf(docFreq=55709, maxDocs=5492855)
0.055577915 = queryNorm
1.4825494 = (MATCH) fieldWeight(body:arrai in 6357), product of:
2.828427 = tf(termFreq(body:arrai)=8)
5.591044 = idf(docFreq=55709, maxDocs=5492855)
0.09375 = fieldNorm(field=body, doc=6357)
For each match in each field, you will get a block like this that explains how SOLR computed the relevancy of this document to your query. What you're asking about (how many matches in this document's field) SOLR calls term frequency "tf". You can see this on the 7th line of the output i pasted above. In this line, SOLR is telling you that it found 8 matches for arrai in the field called "body".
The other lines stand for things like inverse document frequency-"idf" (how rare the matched term is) and fieldNorm, which relates to how short the document's field is relative to the match. You can learn about these here: http://wiki.apache.org/solr/SolrRelevancyFAQ
FYI if you need this "explain" information in a structured format instead of clumsy text you can pass this parameter with your query: debug.explain.structured=true However, its still pretty hard to use = )