Aggregating MoreLikeThis Results in RavenDB - lucene

I have been trying out the MoreLikeThis Bundle to bring back a set of documents ordered by the number of matches in a field called 'backyardigans' compared to a key document. This all works as expected.
But what I would like to do is order by the number of matches of 3 separate fields added together.
An example record would be:
var data = new Data{
backyardigans = "Pablo Tasha Uniqua Tyrone Austin",
engines = "Thomas Percy Henry Toby",
pigs = "Daddy Peppa George Mummy Granny"
};
If another document matched 1 backyardigan 2 engines and 1 pig it would get a score of 4
If another document matched 2 backyardigans 4 engines and 0 pigs it would get a score of 6
These aggregated scores would be the field we would order the results by so they would come back 6,4 and so on.
Is there a way to achieve this with the MoreLikeThis bundle please?

This isn't possible, we use only a single field frequency for this.
This is important because we need to compare the score on a field basis, and it isn't really possible to compare it on a global basis without taking into account the per fields values.
Note that this is also a limitation in the underlying Lucene implementation, so there isn't much we can do about it.

Related

Cognos calculate using a different layout section

I have a Cognos report that has four different sections/layouts, each containing their own list. I would like to use one section to perform calculations using another list from another section. Is that possible?
For example:
Section Two contains a list with Name, Week1 Commissions, Week2 Commissions, etc
I want Section One to calculate the first 11 weeks only so the List in Section One would contain
Name, Weeks 1 through 11 total.
What is the best way to make this happen if it's possible.
Section One would contain these columns which would perform the calculations
Name Week 1-11 totals |
Section Two would have these columns
Name Week 1 Week 2 Week 3 .....
Best,
Kev
Yes it is possible, and you can perform all of the logic in one query
There are many ways to do this
For example, the four different sections/layouts, selecting the list, go to properties you can use the same query for each list

How to sort string data that represents numbers

My client has a set of numeric data stored in a string field in a database. So of course it doesn't sort correctly. These rows sort like this:
105
3
44
When they should sort like this:
3
44
105
This is very much a legacy database and I can't change it at all. I also can't change the software that uses the database. The client doesn't own it or have the source code. It has never worked the way they want. However, there is an unused string field that I could use to sort on (only a small number of fields can be sorted on.)
What I would like to do is take the input data, derive a string from it, and store the new string in the unused field, such that when the data is sorted on this new data, the original data sorts correctly, i.e., numerically.
So, for an overly simplistic example, if the algorithm produced the following new data:
105 -> c
3 -> a
44 -> b
Then when the second column was sorted, the first column would look 'correct'.
The tricky bit is that when new rows are added to the database, they must also sort correctly, without having to regenerate the sort data for all rows. This is the part of the problem that has my brain in a twist. I'm not sure it's actually possible.
You can assume that the number will never be more than 5 'digits'.
I realize this is a total kludge, but since I can't change the system, I have to find a work around, rather than a quality solution. Welcome to the real world.
~~~~~~~~~~~~~~~~~~~~~~ S O L U T I O N ~~~~~~~~~~~~~~~~~~
I don't think this is an uncommon problem, so here are the results of Gordon's solution:
mysql> select * from t order by new;
+------+------------+
| orig | new |
+------+------------+
| 3 | 0000000003 |
| 44 | 0000000044 |
| 105 | 0000000105 |
+------+------------+
In most databases, you can just do:
order by cast(col as int)
This will convert the string representation to a number and use that for ordering. There is no need for an additional column. If you add one, I would recommend adding a numeric column to contain the actual value.
If you really want to store something in the unused field, then you can left pad the number. How to do this depends on the database, but here is one typical method:
update t
set unused = right(concat('0000000000', col), 10);
Not all databases support these two specific functions, but all offer this basic functionality in some method.
Try something like
SELECT column1 FROM table1 ORDER BY LENGTH(column1) ASC, column1 ASC
(Adjust the column and table name for your environment.)
This is a bit of a hack but works as long as the "numbers" in your string column are natural, non-negative numbers only.
If you are looking for a more sophisticated approach or algorithm, try searching for natural sort together with your DBMS.

Is condensing the number of columns in a database beneficial?

Say you want to record three numbers for every Movie record...let's say, :release_year, :box_office, and :budget.
Conventionally, using Rails, you would just add those three attributes to the Movie model and just call #movie.release_year, #movie.box_office, and #movie.budget.
Would it save any database space or provide any other benefits to condense all three numbers into one umbrella column?
So when adding the three numbers, it would go something like:
def update
...
#movie.umbrella = params[:movie_release_year]
+ "," + params[:movie_box_office] + "," + params[:movie_budget]
end
So the final #movie.umbrella value would be along the lines of "2015,617293,748273".
And then in the controller, to access the three values, it would be something like
#umbrella_array = #movie.umbrella.strip.split(',').map(&:strip)
#release_year = #umbrella_array.first
#box_office = #umbrella_array.second
#budget = #umbrella_array.third
This way, it would be the same amount of data (actually a little more, with the extra commas) but stored only in one column. Would this be better in any way than three columns?
There is no benefit in squeezing such attributes in a single column. In fact, following that path will increase the complexity of your code and will limit your capabilities.
Here's some of the possible issues you'll face:
You will not be able to add indexes to increase the performance of lookup of records with a specific attribute value or sort the filtering
You will not be able to query a specific attribute value
You will not be able to sort by a specific column value
The values will be stored and represented as Strings, rather than Integers
... and I can continue. There are no advantages, only disadvantages.
Agree with comments above, as an example try to use pg_column_size() to compare results:
WITH test(data_txt,data_int,data_date) AS ( VALUES
('9999'::TEXT,9999::INTEGER,'2015-01-01'::DATE),
('99999999'::TEXT,99999999::INTEGER,'2015-02-02'::DATE),
('2015-02-02'::TEXT,99999999::INTEGER,'2015-02-02'::DATE)
)
SELECT pg_column_size(data_txt) AS txt_size,
pg_column_size(data_int) AS int_size,
pg_column_size(data_date) AS date_size
FROM test;
Result is :
txt_size | int_size | date_size
----------+----------+-----------
5 | 4 | 4
9 | 4 | 4
11 | 4 | 4
(3 rows)

Is it Possible to Update a Lucene Document using its Document ID?

A ScoreDoc[] array contains all the document ids from a search. I would like to use these document ids to update a single document. In this particular instance I cannot uniquely identify the row I wish to update, as the given terms will result in matching multiple documents.
Imagine a query where 1:a, 2:b and the following documents are returned
1 2 3 4 5 6
doc 1: a b c d e f
doc 2: a b g h i j
doc 3: a b k l m n
I am basically making an update to fields 3 and 4, but want to leave 5 and 6 intact.
Currently I can grab these rows, make the updates I want, but I can't figure out a way to update them in the index.
An indexWriter.updateDocuments(...) or an indexwriter.DeleteDocuments(...) will result in document 1, 2 3 being deleted.
Since I have the documentId, I assume there is a way for me to update the index with it.
Lucene doesn't allow the updating of fields in a document. It is strictly a delete/add mechanism.
A document's docId can be change during optimization, merging, etc. so relying on that to always be the same isn't something you want to do. You should put your own field into the document that won't change over time and use that instead.
There is a method to delete by docid: IndexWriter.tryDeleteDocument. Having deleted the document, you can then add the new one, which, as stated by others, is how Lucene executes an update.
The documentation linked above provides some interesting information on why it's called tryDeleteDocument

Best way to use hibernate for complex queries like top N per group

I'm working now for a while on a reporting applications where I use hibernate to define my queries. However, more and more I get the feeling that for reporting use cases this is not the best approach.
The queries only result partial columns, and thus not typed objects
(unless you cast all fields in java).
It is hard to express queries without going straight into sql or
hql.
My current problem is that I want to get the top N per group, for example the last 5 days per element in a group, where on each day I display the amount of visitors.
The result should look like:
| RowName | 1-1-2009 | 2-1-2009 | 3-1-2009 | 4-1-2009 | 5-1-2009
| SomeName| 1 | 42 | 34 | 32 | 35
What is the best approach to transform the data which is stored per day per row to an output like this? Is it time to fall back on regular sql and work with untyped data?
I really want to use typed objects for my results but java makes my life pretty hard for that. Any suggestions are welcome!
Using the Criteria API, you can do this:
Session session = ...;
Criteria criteria = session.createCriteria(MyClass.class);
criteria.setFirstResult(1);
criteria.setMaxResults(5);
... any other criteria ...
List topFive = criteria.list();
To do this in vanilla SQL (and to confirm that Hibernate is doing what you expect) check out this SO post: