Boosting Multi-Value Fields - lucene

I have a set of documents containing scored items that I'd like to index. Our data structure looks like:
Document
ID
Text
List<RelatedScore>
RelatedScore
ID
Score
My first thought was to add each RelatedScore as a multi-value field using the Boost property of the Field to modify the value of the particular score when searching.
foreach (var relatedScore in document.RelatedScores) {
var field = new Field("RelatedScore", relatedScore.ID,
Field.Store.YES, Field.Index.UN_TOKENIZED);
field.SetBoost(relatedScore.Score);
luceneDoc.Add(field);
}
However, it appears that the "Norm" that is calculated applies to the entire multi-field - all the RelatedScore" values for a document will end up having the same score.
Is there a mechanism in Lucene to allow for this functionality? I would rather not create another index just to account for this - it feels like there should be a way using a single index. If there isn't a means to accomplish this, a few ideas that we have to compensate are :
Insert the multi-value field items in order of descending value. Then somehow add a positional-aware analysis to assign higher boost/score to the first items in the field.
Add a high value score multiple times to the field. So, a RelatedScore with Score==1 might be added three times, while a RelatedScore with Score==.3 would only be added once.
Both of these will result in a loss of search fidelity on these fields, yes, but they may be good enough. Any thoughts on this?

This appears to be a use case for Payloads. I'm not sure if this is available in Lucene.NET, as I've only used the Java version.
Another hacky way to do this, if the absolute values of the scores aren't that important, is to discretize them (place them in buckets based on value) and create a field for each bucket. So if you have scores that range from 1 to 100, create say, 10 buckets called RelatedScore0_10, RelatedScore10_20, etc, and for any document that has a RelatedScore in that bucket, add a "true" value in that field. Then for every search that gets executed tack on an OR query like:
(RelatedScore0_10:true^1 RelatedScore10_20:true^2 ...)
The nice thing about this is that you can tweak the boost values for each one of your buckets on the fly. Otherwise you'd need to reindex to change the field norm (boost) values for each field.

If you use Lucene.Net you might not have payloads functionality yet. What you can do is convert 0-100 relevancy score to a bucket from 1-10 (integer division by 10), then add each indexed value that many times (but only store value once). Then if you search for that field, lucene built-in scoring will take into account frequency of indexed field (it will be indexed 1-10 times based on relevance). Therefore results can be sorted by variable relevance.
foreach (var relatedScore in document.RelatedScores) {
// get bucket for relevance...
int bucket=relatedScore.Score / 10;
var field = new Field("RelatedScore", relatedScore.ID,
Field.Store.YES, Field.Index.UN_TOKENIZED);
luceneDoc.Add(field);
// add more instances of field but only store the first one above...
for(int i=0;i<bucket;i++)
{
luceneDoc.Add(new Field("RelatedScore", relatedScore.ID,
Field.Store.NO, Field.Index.UN_TOKENIZED));
}
}

Related

SOLR: populate with data from children

I have Products in my SOLR index. I need to create calculated fields for each product. These fields are based on product's children.
Is it possible to create such calculated fields?
For example, I have a Product with id 1, I need to add all the Detail entities, which have "parentId" field value 1. Here is a brief schema: https://www.screencast.com/t/EkNG8NpFp.
I need to have values "v1", "v3" from the example above.
not sure what you exactly mean by "create such calculated fields"...
if you mean if you can query for Products and then for example get the average of field 'value'. Yes you can do stuff like that, look at json facets and how you can use children docs.
if you mean how you can add some new field to your Product doc, based on the values of the children docs, then you can probably do it with Streaming Expressions. You need to use the current collection as a source, and compute the new fields, and finally add the new docs (including the new field) into a new collection

How to get text field history in selenium?

Is it possible to get a field's history (if it exists) for a field in an array or something of that sort in selenium? For example, user id field, I can see all IDs that have been used so far.
The purpose I'd like to use this is quickly create new IDs that haven't been used before. For example testID45 is already taken, so I'll use testID46 to create a new one. It's a lazy way to fill out a form without keeping track of the taken IDs.
I don't fully understand why you want to create IDs using Selenium. If you would post more info on what problem you are trying to solve, I could try to provide a better answer.
If you want to pull the IDs from existing elements you could do something like this. This finds all INPUT elements that have an ID specified and writes out the IDs. You could parse the IDs and then determine which ID to use next. I wouldn't recommend this because it would be faster to just generate a new ID that will be unique but maybe you need this for some reason.
List<WebElement> ids = driver.findElements(By.cssSelector("input[id]"));
for (WebElement id : ids)
{
System.out.println(id.getAttribute("id"));
}
I would recommend generating a new ID of your own format that would be unique on the page. This should be good enough for your purposes.
Random rnd = new Random();
String id = Long.toHexString(rnd.nextLong());
System.out.println("testID-" + id); // e.g. testID-cb8e7bac29ec7c7a
There are many other methods of generating strings in this post that you can reference also.

Lucene.Net: How to grab the next 10 results after a document that has a certain field value?

Suppose I have a query in Lucene.Net 3.0.3 that matches 1,000,000 documents, and each document has a field named ProductID with a unique value. How can I grab the next 10 items immediately after a specific ProductID?
For instance, grab me the next 10 items after ProductID 4264423.
The ProductID could be anywhere within the 1,000,000 matches, and sorted however I wish.
One brute force solution is to loop through all the ScoreDocs, and use the FieldCache to find the correct ProductID, then grab the next 10. However, that seems inefficient, since we'd need to populate a huge ScoreDocs array.
Another idea is to use a custom Collector, along with FieldCache to look for the correct ProductID, but as far as I know, Collector's aren't sorted.
Perhaps a solution is to use a combination of a custom Collector with a PriorityQueue, use the FieldCache to find the correct ProductID, note the Score of that document, then grab the next 10 items based on Score. (Although, if there are similar Score values, how is that handled?)
Please provide code samples, as I'm a Lucene.Net newbie. (Sample code preferably in C#.)
If a custom Collector + PriorityQueue is a viable option, here is some sample code to assist: https://stackoverflow.com/a/7938433/1145177

randomly generating unique number between 1-999 for primary key in table

I have a problem I'm not sure how to solve elegantly.
Background Information
I have a table of widgets. Each widget is assigned an ID from a range of numbers, let's say between 1-999. The values of 1-999 is saved in my database as "lower_range" and "upper_range" in a table called "config".
When a user requests to create a new widget using my web app, I need to be able to do the following:
generate a random number between 1 and 999 using lua's math.random function or maybe a random number generator in sqlite (So far, in my tests, lua's math.random always returns the same value...but that's a different issue)
do a select statement to see if there already is a widget with this number assigned...
if not, create the new widget.
otherwise repeat process until you get a number that is not currently in use.
Problem
The problem I see with the above logic is two-fold:
the algorithm can potentially take a long time because I have to keep searching until I find a unique value.
How do I prevent simultaneous requests for new widget numbers generating the same value?
Any suggestions would be appreciated.
Thanks
Generate your random numbers ahead of time and store them in a table; make sure the numbers are unique. Then when you need to get the next number, just check how many have already been assigned and get the next number from your table. So, instead of
Generate a number between 1-999
Check if it's already assigned
Generate a new number, and so on.
do this:
Generate array of 999 elements that have values 1-999 in some random order
Your GetNextId function becomes return ids[currentMaxId+1]
To manage simultaneous requests, you need to have some resource that generates a proper sequence. The easiest is probably to use a key in your widget table as the index in the ids array. So, add a record to the widgets table first, get its key and then generate widget ID using ids[key].
Create a table to store the keys and the 'used' property.
CREATE TABLE KEYS
("id" INTEGER, "used" INTEGER)
;
Then use the following to find a new key
select id
from KEYS
where used = 0
order by RANDOM()
limit 1
Don't generate a random number, just pick the number off a list that's in random order.
For example, make a list of numbers 1 - 999. Shuffle that list using Fisher-Yates or equivalent (see also Randomize a List in C# even if you're not using C#).
Now you can just keep track of the most recently used index into your list. (Shuffling the list should occur exactly once, then you store and reuse the result).
Rough pseudo-code:
If config-file does not contain list of indices
create a list with numbers 1 - 999
Use Fisher-Yates to shuffle that list
// list now looks like 0, 97, 251, 3, ...
Write the list to the config file
Set 'last index used' to 0 and write to config file
end if
To use this,
NextPK = myList[last-index-used]
last-index-used = last-index-used + 1
write last-index-used to config file
To get and flag an ID as used at same time (expanding on Declan_K's answer):
replace into random_sequence values ((select id from random_sequence where used=0 order by random()), 1);
select id from random_sequence where rowid = last_insert_rowid();
6
When you run out of "unused" sequence table entries the select will return "blank"
I use replace into as update doesn't have an last_insert_rowid() equiv that I can see.
You Can get sql to create a primary key, that will increase by one evert time you add a ros to the database.

How to design a database table structure for storing and retrieving search statistics?

I'm developing a website with a custom search function and I want to collect statistics on what the users search for.
It is not a full text search of the website content, but rather a search for companies with search modes like:
by company name
by area code
by provided services
...
How to design the database for storing statistics about the searches?
What information is most relevant and how should I query for them?
Well, it's dependent on how the different search modes work, but generally I would say that a table with 3 columns would work:
SearchType SearchValue Count
Whenever someone does a search, say they search for "Company Name: Initech", first query to see if there are any rows in the table with SearchType = "Company Name" (or whatever enum/id value you've given this search type) and SearchValue = "Initech". If there is already a row for this, UPDATE the row by incrementing the Count column. If there is not already a row for this search, insert a new one with a Count of 1.
By doing this, you'll have a fair amount of flexibility for querying it later. You can figure out what the most popular searches for each type are:
... ORDER BY Count DESC WHERE SearchType = 'Some Search Type'
You can figure out the most popular search types:
... GROUP BY SearchType ORDER BY SUM(Count) DESC
Etc.
This is a pretty general question but here's what I would do:
Option 1
If you want to strictly separate all three search types, then create a table for each. For company name, you could simply store the CompanyID (assuming your website is maintaining a list of companies) and a search count. For area code, store the area code and a search count. If the area code doesn't exist, insert it. Provided services is most dependent on your setup. The most general way would be to store key words and a search count, again inserting if not already there.
Optionally, you could store search date information as well. As an example, you'd have a table with Provided Services Keyword and a unique ID. You'd have another table with an FK to that ID and a SearchDate. That way you could make sense of the data over time while minimizing storage.
Option 2
Treat all searches the same. One table with a Keyword column and a count column, incorporating SearchDate if needed.
You may want to check this:
http://www.microsoft.com/sqlserver/2005/en/us/express-starter-schemas.aspx