Unsolvable events scenario in RavenDB?

Unsolvable events scenario in RavenDB? - indexing

Edit. Here's a simplified description of the issue:
I have events
class Event { Id = 0, Dates = new DateTime[] {} }
I need to query for all events within a date range for example (august 1 to october 20). The result shall list uniqe events within this range ordered by date. Like this:
Event one 2012-08-04,2012-09-06,2012-09-10
Event two 2012-10-02
etc.
I need to be able to page this result. That's it.
I have the following issue with my events using ravendb. I have a document (representing an event) that contains an array of dates, for example 2012-08-20, 2012-08-21, 2012-09-14, 2013-01-05 etc.
class Event { Dates = []; }
I have a few criteria that must be met:
I need to be able to query these documents on a date range. For example find all events that has any date between august 1 and september 22, or october 1 and october 3.
I must be able to sort the query on date
I must be able to page the result.
Seems easy enough right? Well I have had two approaches to this but they both fail.
Create an index with multiple from. Like so:
from event in docs.Events
from date in event.Dates
select new { Dates = date}
This works and is easy to query. However it can't be paged because of skippedresults (the index will contain duplicates of each event). And sorting also fails in combination with paging.
...............
Create a simple index
from event in docs.Events
select new { Dates = event.Dates }
This also works and is simple to query, it can also be paged. However it cannot be sorted. I need to sort the documents by the first available date within the queried range.
If I can't solve this it will probably be a deal breaker for us.. and I really don't want to get started with a new application, besides I really like RavenDB..

I had a similar requirement (recurring events), with the added twist that the number of dates was highly variable (could be from 1 to the hundreds) and may be associated with a venue.
I ended up storing event dates in an EventOccurrence after coming to a similar impasse:
public class EventOccurrence {
public string EventId {get; set;}
public DateTime Start {get; set;}
public string VenueId {get; set;}
}
Its easily queryable and sortable, and using Session.Include on the occurrence we still retain query performance.
It seems like a reversion to the relational model, but it was the correct choice given our constraints.

You're saying that the "simple index" approach works except for sorting right?
from event in docs.Events
select new
{
Dates = event.Dates,
FirstDate = event.Dates.First()
}
Then you can sort on FirstDate. You can't sort on analyzed or tokenized fields.

My final solution for this was creating both indexes. We needed both indexes anyway in our application so it wasn't any overhead.
The following index is used for querying and paging. Take() and Skip() works for this one:
from event in docs.Events
from date in event.Dates
select new { Date = date}
However the above index does NOT return the correct number of total hits, which you need for creating a pager. So we create another index:
from event in docs.Events
select new { Date = event.Dates }
Now we can run the exact same query (note that the Date field has the same name on both indexes) on the above index using Statistics() and Take(0) to only get the number of hits.
The downside to this is obviously that you need to run two queries, but I haven't found a way around that.

Related

Get the latest events of a specific type from a specific device

Is there a way to get the latest event of a specific type from a specific device? So far I queried all events via myURL/event/events?source=<<ID>>&type=<<type>>.
Or is there a way to get a collection of events ordered by their creationTime? This could solve my problem too
In the documentation I only found parameters like dateFrom and dateTo. But what if I don't know the time range of the last event?

The syntax is: /event/events?type={type}&source={source}.
For more information on the available endpoints:
GET /platform

Answer from the support-team:
currently there is no way to revert the order of events. Besides
dateFrom and dateTo you can also use creationFrom and creationTo. This
will take the creationTime (server side timestamp when event was
created) instead of the time (that is send within the event) But the
order will still be oldest -> newest.
Best approach currently would be to use a good estimated time range
(so you don't end up with 0 events in the response) where the
dateTo/creationTo is in the future. If you add to the query params
withTotalPages=true the result will give you the total pages in the
statistics part.
Knowing the total pages you can you can do the query again but with
currentPage=XX instead of withTotalPages=true and take the last
element.
We have this functionality on measurements and audits where you can
add the parameter revert=true. I will add an improvement that we
extend this to the other APIs but at the moment you are limited to the
workaround.

You can just only set the dateFrom parameter and the pageSize to 1 like so: &pageSize=1&dateFrom=1970-01-01. As of September 2017, this returns the most recent event.

Paging dynamic query results

Paging using LINQ can be easily done using the Skip() and Take() extensions.
I have been scratching my head for quite some time now, trying to find a good way to perform paging of a dynamic collection of entities - i.e. a collection that can change between two sequential queries.
Assume a query that without paging will return 20 objects of type MyEntity.
The following two lines will trigger two DB hits that will populate results1 and results2 with all of the objects in the dataset.
List<MyEntity> results1 = query.Take(10).ToList();
List<MyEntity> results2 = query.Skip(10).Take(10).ToList();
Now let's assume the data is dynamic, and that a new MyEntity is inserted into the DB between the two queries, in such a way that the original query will place the new entity in the first page.
In that case, results2 list will contain an entity that also exists in results1,
causing duplication of results being sent to the consumer of the query.
Assuming a record from the first page was deleted, it will result missing a record that should have originally appear on results2.
I thought about adding a Where() clause to the query that verify that the records where not retrieved on a previous page, but it seems like the wrong way to go, and it won't help with the second scenario.
I thought about keeping a record of query executions' timestamps, attaching a LastUpdatedTimestamp to each entity and filtering entities that were changed after the previous page request. That direction will fail on the third page afterwards...
How is this normally done?
Is there a way to run a query against an old snapshot of the DB?
Edit:
The background is an Asp.NET MVC WebApi service that responds to a simple GET request to retrieve the list of entities. The client retrieves a page of entities, and presents them to the user. When scrolled down, the client sends another request to the service, to retrieve another page of entities, which should be added to the first page which is already presented to the user.

One way to solve this is to keep a TimeStamp of the first query and exclude any results that have been added after that time. However this solution puts the data into a "stale" state, i.e the data is fixed at a certain point in time, which is usually a bad idea unless you have a very good reason to do it.
Another way of doing this (again depending on your application) would be to store the list of all objects in memory and then only display a certain number of results at one time.
List<Entity> _list;
public class SomeCtor()
{
_list = _db.Entities.ToList();
}
public List<Entity> GetFirst10
{
get
{
return _list.Take(10);
}
}
public List<Entity> GetSecond10
{
get
{
return _list.Skip(10).Take(10);
}
}

Simplest way to persist data in Azure - recommended options?

I'm giving Azure a go with MVC4, and have the simplest data storage requirement I can think of. Each time my controller is hit, I want to record the datetime and a couple of other details to some location. There will most likely be only a few thousand hits at most per month. Then I'd like to view a page telling me how many hits (rows appended) there are.
Writing to a text file in the folder with Server.MapPath... gives permission errors and seems not to be possible due to the distributed nature of it. Getting a whole SQL instance is $10 a month or so. Using the table or blob storage sounds hopeful, but setting up the service and learning to use those seems nowhere near as simple as just a basic file or DB.
Any thoughts would be appreciated.

Use TableStorage. For all intents and purposes it's free (it'll be pennies per month for that amount of volume a fraction of your web roles anyway).
As for how complicated you think it is, it's really not. Have a look at this article to get going. http://www.windowsazure.com/en-us/develop/net/how-to-guides/table-services/#create-table
//Create a class to hold your data
public class MyLogEntity : TableEntity
{
public CustomerEntity(int id, DateTime when)
{
this.PartitionKey = when;
this.RowKey = id;
}
public MyLogEntity () { }
public string OtherProperty { get; set; }
}
//Connect to TableStorage
var connstr = CloudConfigurationManager.GetSetting("StorageConnectionString") //Config File
var storageAccount = CloudStorageAccount.Parse(connstr);
// Create the table client.
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
// Create the table if it doesn't exist.
var table = tableClient.GetTableReference("MyLog");
table.CreateIfNotExists();
var e= new MyLogEntity (%SOMEID%, %SOMEDATETIME%);
e.OtherValue = "Some Other Value";
// Create the TableOperation that inserts the customer entity.
var insertOperation = TableOperation.Insert(customer1);
// Execute the insert operation.
table.Execute(insertOperation);

Augmenting #Eoin's answer a bit: When using table storage, tables are segmented into partitions, based on the partition key you specify. Within a partition, you can either search for a specific row (via row key) or you can scan the partition for a group of rows. Exact-match searching is very, very fast. Partition-scanning (or table-scanning) can take a while, especially with large quantities of data.
In your case, you want a count of rows (entities). Storing your rows seems pretty straightforward, but how will you tally up a count? By day? By month? By year? It may be worth aligning your partitions to a day or month to make counting quicker (there's no function that returns number of rows in a table or partition - you'd end up querying them).
One trick is to keep an accumulated value in another table, each time you write a specific entity. This would be very fast:
Write entity (similar to what Eoin illustratecd)
Read row from Counts table corresponding to the type of row you wrote
Increment and write value back
Now you have a very fast way to retrieve counts at any given time. You could have counts for individual days, specific months, whatever you choose. And for this, you could have the specific date as your partition key, giving you very fast access to the correct entity holding the accumulated count.

Selecting specific joined record from findAll() with a hasMany() include

(I tried posting this to the CFWheels Google Group (twice), but for some reason my message never appears. Is that list moderated?)
Here's my problem: I'm working on a social networking app in CF on Wheels, not too dissimilar from the one we're all familiar with in Chris Peters's awesome tutorials. In mine, though, I'm required to display the most recent status message in the user directory. I've got a User model with hasMany("statuses") and a Status model with belongsTo("user"). So here's the code I started with:
users = model("user").findAll(include="userprofile, statuses");
This of course returns one record for every status message in the statuses table. Massive overkill. So next I try:
users = model("user").findAll(include="userprofile, statuses", group="users.id");
Getting closer, but now we're getting the first status record for each user (the lowest status.id), when I want to select for the most recent status. I think in straight SQL I would use a subquery to reorder the statuses first, but that's not available to me in the Wheels ORM. So is there another clean way to achieve this, or will I have to drag a huge query result or object the statuses into my CFML and then filter them out while I loop?

You can grab the most recent status using a calculated property:
// models/User.cfc
function init() {
property(
name="mostRecentStatusMessage",
sql="SELECT message FROM statuses WHERE userid = users.id ORDER BY createdat DESC LIMIT 1,1"
);
}
Of course, the syntax of the SELECT statement will depend on your RDBMS, but that should get you started.
The downside is that you'll need to create a calculated property for each column that you need available in your query.
The other option is to create a method in your model and write custom SQL in <cfquery> tags. That way is perfectly valid as well.

I don't know your exact DB schema, but shouldn't your findAll() look more like something such as this:
statuses = model("status").findAll(include="userprofile(user)", where="userid = users.id");
That should get all statuses from a specific user...or is it that you need it for all users? I'm finding your question a little tricky to work out. What is it you're exactly trying to get returned?

Boosting Multi-Value Fields

I have a set of documents containing scored items that I'd like to index. Our data structure looks like:
Document
ID
Text
List<RelatedScore>
RelatedScore
ID
Score
My first thought was to add each RelatedScore as a multi-value field using the Boost property of the Field to modify the value of the particular score when searching.
foreach (var relatedScore in document.RelatedScores) {
var field = new Field("RelatedScore", relatedScore.ID,
Field.Store.YES, Field.Index.UN_TOKENIZED);
field.SetBoost(relatedScore.Score);
luceneDoc.Add(field);
}
However, it appears that the "Norm" that is calculated applies to the entire multi-field - all the RelatedScore" values for a document will end up having the same score.
Is there a mechanism in Lucene to allow for this functionality? I would rather not create another index just to account for this - it feels like there should be a way using a single index. If there isn't a means to accomplish this, a few ideas that we have to compensate are :
Insert the multi-value field items in order of descending value. Then somehow add a positional-aware analysis to assign higher boost/score to the first items in the field.
Add a high value score multiple times to the field. So, a RelatedScore with Score==1 might be added three times, while a RelatedScore with Score==.3 would only be added once.
Both of these will result in a loss of search fidelity on these fields, yes, but they may be good enough. Any thoughts on this?

This appears to be a use case for Payloads. I'm not sure if this is available in Lucene.NET, as I've only used the Java version.
Another hacky way to do this, if the absolute values of the scores aren't that important, is to discretize them (place them in buckets based on value) and create a field for each bucket. So if you have scores that range from 1 to 100, create say, 10 buckets called RelatedScore0_10, RelatedScore10_20, etc, and for any document that has a RelatedScore in that bucket, add a "true" value in that field. Then for every search that gets executed tack on an OR query like:
(RelatedScore0_10:true^1 RelatedScore10_20:true^2 ...)
The nice thing about this is that you can tweak the boost values for each one of your buckets on the fly. Otherwise you'd need to reindex to change the field norm (boost) values for each field.

If you use Lucene.Net you might not have payloads functionality yet. What you can do is convert 0-100 relevancy score to a bucket from 1-10 (integer division by 10), then add each indexed value that many times (but only store value once). Then if you search for that field, lucene built-in scoring will take into account frequency of indexed field (it will be indexed 1-10 times based on relevance). Therefore results can be sorted by variable relevance.
foreach (var relatedScore in document.RelatedScores) {
// get bucket for relevance...
int bucket=relatedScore.Score / 10;
var field = new Field("RelatedScore", relatedScore.ID,
Field.Store.YES, Field.Index.UN_TOKENIZED);
luceneDoc.Add(field);
// add more instances of field but only store the first one above...
for(int i=0;i<bucket;i++)
{
luceneDoc.Add(new Field("RelatedScore", relatedScore.ID,
Field.Store.NO, Field.Index.UN_TOKENIZED));
}
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas