Merge two sets of Lucene search results without duplicates? - lucene

I have two TopDocs objects. They both contain the same results but one is ordered by relevance and the other is weighted by date. I want to alternate between showing a relevant result and showing a recent result.
I can't think of a way to do this which doesn't involve iterating over every single result. Does anyone have any ideas?
Thanks,
Joe

Set<ScoreDoc> set = new HashSet<ScoreDoc>();
set.addAll(Arrays.asList(firstScoreDoc));
set.addAll(Arrays.asList(secondScoreDoc));
Something like this?

Related

Pandas Value Counts With Constraint For More Than One Occurance

Working with the Wine Review Data from Kaggle here. I am able to return the number of occurrences by variety using value_counts()
However, I am trying to find a quick way to limit the results to varieties and their counts where there is more than one occurrence.
Trying df.loc[df['variety'].value_counts()>1].value_counts()
and df['variety'].loc[df['variety'].value_counts()>1].value_counts()
both return errors.
The results can be turned into a DataFrame and the constraint added there, but something tells me that there is a way more elegant way to achieve this.
#wen ansered this in the comments.
df['variety'].value_counts().loc[lambda x : x>1]

Filter on Count Aggregation

I have been looking for the solution on the internet from quite a while and I'm still not sure that if it is possible on Kibana or not.
Suppose I apply filter on term and it gives me count of the respective terms but I want the results to show only those terms where count equals a specific value.
Being more specific,
I want to find out the number of tills which are the most busy (most number of transactions). Currently when I apply a filter on term and count it shows me the all the tills with their respective transaction count. What I want is that to show only those tills where the count is equal to let's say 10.
In other words a similar functionality like HAVING clause in relational dbms.
I have found a lot of work arounds of the same usecase but I'm looking for a solution.
I hope I understand what you're asking. I think You can search the field in question with proper parameters. For example, for the field 'field_name' with more than 10 hits, try following Lucene query:
field_name:(*) AND count:[10 TO *]
For the exact result of field_name with count=10, query:
field_name:(*) AND count:[10]
Let me know if this was what you were looking for!

Solr: How can I get all documents ordered by score with a list of keywords?

I have a Solr 3.1 database containing Emails with two fields:
datetime
text
For the query I have two parameters:
date of today
keyword array("important thing", "important too", "not so important, but more than average")
Is it possible to create a query to
get ALL documents of this day AND
sort them by relevancy by ordering them so that the email with contains most of my keywords(important things) scores best?
The part with the date is not very complicated:
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
I know that you can boost the keywords this way:
q=text:"first keyword"^5 OR text:"second one"^2 OR text:"minus scoring"^0.5 OR text:"*"
But how do I only use the keywords to sort this list and get ALL entries instead of doing a realy query and get only a few entries back?
Thanks for help!
You need to specify your terms in the main query and then change your date query to be a filter query on these results by adding the following.
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
So you should have something like this:
q=<terms go here>&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
Edit: A little more about filter queries (as suggested by rfreak).
From Solr Wiki - FilterQuery Guidance - "Now, what is a filter query? It is simply a part of a query that is factored out for special treatment. This is achieved in Solr by specifying it using the fq (filter query) parameter instead of the q (main query) parameter. The same result could be achieved leaving that query part in the main query. The difference will be in query efficiency. That's because the result of a filter query is cached and then used to filter a primary query result using set intersection."
These should be sorted by relevancy score already, that is just the default behavior of Solr. You can see the score by adding that field.
fl=*,score
If you use the Full Interface for Make A Query on the Admin Interface on your Solr installation at http://<yourserver:port#>/<instancename>/admin/form.jspyou will see where you can specify the filter query, fields, and other options. You can check out the Solr Wiki for more details on the options and how they are used.
I hope that this helps you.
You could do a first query for:
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
which gives all documents that match the range. Then, use CachingWrapperFilter for the second query to find documents in the DocSet from first query which have at least one keyword. They will be relevance ranked per tf-idf. You may want to use ConstantScoringQuery for the first to get the list of matching docids in the fastest possible way.
Sorting by relevance is default behavior on solr/lucene.
If your results are unsatisfied, try to put the keywords in quotes
//Edit: Folowing the answer from Paige Cook, use somethink like that
q="important thing"&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
//2. nd update. By thinking about this answer: quotes are not an good idea, because in this case you will only receive "important thing" mails, but no "important too"
The Point is: what keywords you are using. Because: searching for -- important thing -- results in the highest scores for "important thing" mails. But lucene does not know, how to score "important too" or "not so important, but more than average" in relation to your keywords.
An other idea would be searching only for "important". But the field-values "importand thing" and "importand too" gives nearly the same score values,because 50% of the searched keywords (in this key: "imported") are part of the field-value.
So probably you have to change your keywords. It could work after changeing "importend to" into "also an important mail", to get the beast ratio of search-word "important" and field-value in order to score the shortest Mail-discripton to the highest value.

MySQL: select the closest match?

I want to show the closest related item for a product. So say I am showing a product and the style number is SG-sfs35s. Is there a way to select whatever product's style number is closest to that?
Thanks.
EDIT: to answer your questions. Well I definitely want to keep the first 2 letters as that is the manufacturer code but as for the part after the first dash, just whatever matches closest. so for example SG-sfs35s would match SG-shs35s much more than SG-sht64s. I hope this makes sense whenever I do LIKE product_style_number it only pulls the exact match.
There normally isn't a simple way to match product codes that are roughly similar.
A more SQL friendly solution is to create a new table that maps each product to all the products it is similar to.
This table would either need to be maintained manually, or a more sophisticated script can be executed periodically to update it.
If your product codes follow a consistent pattern (all the letters are the same for similar products, with only the numbers changing), then you should be able to use a regular expression to match the similar items. There are docs on this here...
It sounds like what you want is levenshtein distance .
Unfortunately, there isn't a built-in levenshtein function for mysql, but some folks have come up with a user-defined function that does it(deadlink).
You will probably want to do it as a stored procedure, as I expect that the algorithm may not be trivial.
For example, you may split the term at the -, so you have two parts. You do a LIKE query on each part and use that to make a decision.
You could just loop though, replacing the last character with "%" until you get at least one result, in your stored procedure.
Sounds like you need something like Lucene, though i'm not sure if that would be overkill for your situation. But it certainly would be able to do text searches and return the ones most similar first.
If you need something more simple I would try to start by searching with the full product code, then if that doesn't work try to use wildcards/remove some characters until you return a result.
JD Isaacks.
This situation of yours is very simple to solve.
It`s not like you need to use Artificial Intelligence like the Google.
http://www.w3schools.com/sql/sql_wildcards.asp
Take a look at this manual at w3schools about wildcards to use with your SELECT code.
But also you will need to create a new table with 3 columns: LeftCode, RightCode and WildCard.
Example:
Rows on Table:
LeftCode = SG | RightCode = 35s | WildCard = SG-s_s35s
LeftCode = SG | RightCode = 64s | WildCard = SG-s_t64s
SQL Code
If the user typed the code that matches the row1 of the table:
SELECT * FROM PRODUCTS WHERE CODE LIKE "$WildCard";
Where $WildCard is the PHP variable containing the column 3 of the new table.
I hope I helped, even 4 years late...

Read number of columns and their type from query result table (in C)

I use PostgreSQL database and C to connect to it. With a help from dyntest.pgc I can access to number of columns and their (SQL3) types from a result table of a query.
Problem is that when result table is empty, I can't fetch a row to get this data. Does anyone have a solution for this?
Query can be SELECT 1,2,3 - so, I think I can't use INFORMATION SCHEMA for this because there is no base table.
I'm not familiar with ecpg, but with libpq you should be able to call PQnfields to get the number of fields and then call various PQf* routines (like PQftype, PQfname) to get detailed info. Those functions take a PGResult, which you have even if there are no rows.
Problem is that when result table is empty, I can't fetch a row to get this data. Does anyone have a solution for this?
I am not sure to really get what you want, but it seems the answer is in the question. If the table is empty, there are no rows...
The only solution here seems you must wait a non empty result table, and then get the needed informations.