Problems loading a series of snapshots by date - rally

I have been running into a consistent problem using the LBAPI which I feel is probably a common use case given its purpose. I am generating a chart which uses LBAPI snapshots of a group of Portfolio Items to calculate the chart series. I know the minimum and maximum snapshot dates, and need to query once a day in between these two dates. There are two main ways I have found to accomplish this, both of which are not ideal:
Use the _ValidFrom and _ValidTo filter properties to limit the results to snapshots within the selected timeframe. This is bad because it will also load snapshots which I don't particularly care about. For instance if a PI is revised several times throughout the day, I'm really only concerned with the last valid snapshot of that day. Because some of the PIs I'm looking for have been revised several thousand times, this method requires pulling mostly data I'm not interested in, which results in unnecessarily long load times.
Use the __At filter property and send a separate request for each query date. This method is not ideal because some charts would require several hundred requests, with many requests returning redundant results. For example if a PI wasn't modified for several days, each request within that time frame would return a separate instance of the same snapshot.
My workaround for this was to simulate the effect of __At, but with several filters per request. To do this, I added this filter to my request:
Rally.data.lookback.QueryFilter.or(_.map(queryDates, function(queryDate) {
return Rally.data.lookback.QueryFilter.and([{
property : '_ValidFrom',
operator : '<=',
value : queryDate
},{
property : '_ValidTo',
operator : '>=',
value : queryDate
}]);
}))
But of course, a new problem arises... Adding this filter results in much too large of a request to be sent via the LBAPI, unless querying for less than ~20 dates. Is there a way I can send larger filters to the LBAPI? Or will I need to break theis up into several requests, which only makes this solution slightly better than the second of the latter.
Any help would be much appreciated. Thanks!

Conner, my recommendation is to download all of the snapshots even the ones you don't want and marshal them on the client side. There is functionality in the Lumenize library that's bundled with the App SDK that makes this relatively easy and the TimeSeriesCalculator will also accomplish this for you with even more features like aggregating the data into series.

Related

Is there a way to concatenate the results of multple mongodb queries together in one statement?

I have a mongodb database that contains a large amount of data without a highly consistent schema. It is used for doing Google Analytics-style interaction tracking with our applications. I need to gather some output covering a whole month, but I'm struggling with the performance of the query, and I don't really know MongoDB very well at all.
The only way I can get results out is by restricting the timespan I am querying within to one day at a time, using the _timestamp field which I believe is indexed by default (I might be wrong).
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-01T00:00:00.000Z"),$lte:ISODate("2019-09-02T00:00:00.000Z")}}); // Day 1..
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-03T00:00:00.000Z"),$lte:ISODate("2019-09-04T00:00:00.000Z")}}); // Day 2..
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-05T00:00:00.000Z"),$lte:ISODate("2019-09-06T00:00:00.000Z")}}); // Day 3..
This works 'fine', but I'd rather be able to SQL union those seperate queries together - but then I guess I'd still end up timing out.
Ideally I'd end up with each of those queries executing seperately, with the resultset being appended to each time and returned at the end.
I might be better off writing a simple application to do this.
Help me Obi-Wan Kenobi, you're my only hope.

Calculating a proxy bit size in a BigQuery table

How does one go about calculating the bit size of each record in BigQuery sharded tables across a range of time?
Objective: how much has it grown over time
Nuances: Of the 70 some fields, some records would have nulls for most, some records would have long string text grabbed directly from the raw logs, and some of them could be float/integer/date types.
Wondering if there's an easy way to do a proxy count of the bit size for one day and then I can expand that to a range of time.
Example from my experience:
One of my tables is daily sharded table with daily size of 4-5TB. Schema has around 780 fields. I wanted to understand cost of each data-point (bit-size) [it was used then for calculating ROI based on cost/usage]
So, let me give you an idea on how cost (bit-size) side of it was approached.
The main piece here is use of dryRun property of Jobs: Query API
Setting dryRun to true allows BigQuery (instead of actually running job) return statistics about the job such as how many bytes would be processed. And that’s exactly what is needed here!
So, for example, below Request is designed to get cost of trafficSource.referralPath in ga_session table for 2017-01-05
POST https://www.googleapis.com/bigquery/v2/projects/yourBillingProject/queries?key={YOUR_API_KEY}
{
"query": "SELECT trafficSource.referralPath FROM yourProject.yourDataset.ga_sessions_20170105`",
"dryRun": true,
"useLegacySql": false
}
You can get this value by parsing totalBytesProcessed out of Response. See example of such response below
{
"kind": "bigquery#queryResponse",
"jobReference": {
"projectId": "yourBillingProject"
},
"totalBytesProcessed": "371385",
"jobComplete": true,
"cacheHit": false
}
So, you can write relatively simple script in the client of your choice that:
reads schema of your table – you can use Tables: get API for this or if schema is known and readily available you can just simply hardcode it
organize loop through all (each and every) field in the schema
inside loop – call query api and extract size of respective filed (as it is outlined above)) and of course log it (or just collect it in memory)
As a result of above - you will have list of all fields with their respective size
If now, you need to analyze those sizes changes over the time – you can wrap above with yet another loop where you will iterate through as many days as you need and collect stats for each and every day
if you are not interested in day-by-day analysis - you just can make sure your query actually queries the range you are interested with. This can be done with use of a Wildcard Table
I consider this relatively easy way to go with
Me personally, I remember doing this with Go-lang, but it doesn't matter - you can use any client that you are most comfortable with
Hope this will help you!

What is a best way to organise the complex couchdb view (sql-like query)?

In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.

Rally: getting the total story points, task hours, etc

I am utilizing the Rally 2.0p4 API and attempting to aggregate the data to get a list of iterations with sums of the story points per iteration. The only way I have found at present to do this is just query the HierarchicalRequirement model and loop all the data and populate it to an array. This seems less then ideal, is there not a way to just get back totals from the call from the server.
If you are wanting this data summarized by Iteration and/or Release, check out the:
IterationCumulativeFlowData
ReleaseCumulativeFlowData
Objects in the Webservices API documentation:
https://rally1.rallydev.com/slm/doc/webservice/
These objects will provide a daily summary of:
CardCount (# Stories/Defects)
TaskEstimateTotal
CardEstimateTotal
CardToDoTotal
By State, within each Iteration or Release as specified by OID.
In case anyone else comes looking, this is one way that api call can appear:
https://rally1.rallydev.com/slm/webservice/1.30/iterationcumulativeflowdata.js?query=(%20IterationObjectID%20=%20%2211203475854%22%20)&fetch=CardCount,CardToDoTotal,CardEstimateTotal,IterationObjectID
That call syntax is delicate (requires a space before the closing paren on 'query', for example).

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.