I've been tasked with plotting some defect charts using Rally historical data. Right now I'm using a simple REST client to pull data at certain points in time and plot the count on a spreadsheet. What I'm doing right now is:
{
find : {
"_ProjectHierarchy": <projectId>,
"_TypeHierarchy": -51006,
"FoundInBuild" : {$regex: "3\\.3\\."},
"State" : {$in : ["Submitted","Open"] },
$or: [
{"Severity" : { $in : ["Catastrophic","Severe"] }},
{"Priority" : "showstopper"}
],
"__At" : "<date>"
},
pagesize : 1,false
}
I just run this once for every date I need the data for. That's a lot of queries! What I'm looking for is a way to run a single query using _ValidFrom and _ValidTo to enclose a time range, then pass it on to a SnapshotStore, then plot that on a Chart? I'm certain there's a way to do it, but I can't figure it out from the docs. Any help much appreciated.
Unfortunately, the example space for AppSDK2 and Lookback API is presently a bit thin. There are some cool apps out there, for instance, you may wish to check out David Thomas' cool Hackathon app:
Defect Re-work Trend
As a starting point. It queries LBAPI for Defects and stores the resulting data in a SnapshotStore. The App itself measures Defect "thrash" or the trending around how many times a Defect is re-opened during a particular development cycle.
In reviewing Hackathon apps, just be aware that certain methods and syntax for the SnapshotStore may change slightly in a future release of AppSDK2.
Related
In our raven based application we are starting to experience major performance issues when the master document starts increasing in size, as it holds a lot of collections that keep growing. As such I am now planning a major data redesign that is likely to take months and I want to be sure I'm on the right track before I do so.
The current design looks like this:
Community
{
id:,
name,
//other properties
members[
{
id:,
name:,
date of birth:
//etc
},
{
//another member, this list could potentially grow to hundreds of thousands
}
],
league
[
{
id,
name,
seasons[
{...},
{
id:,
divisions
[
{
id:,
name:
matches[
{
id:,
//match details
},
{
//another match. there could be hundreds here in a big league
},
{}
As we started hitting performance issues, we started using transformers to only load what is needed, but that didn't solve the problem fully as some of our leagues are a couple mb's just on their own. The other issue is we always need to be doing member checks to check for admin/membership rights so the members list is always needed.
I understand I could omit the member list completely using a transformer and use an index for membership checks, but the problem remains about what to do, when a member is added , that list will need to be loaded and with an upcoming project there is a potential for it to grow to half a million people or more.
So my plan is separate each entity into it's own document, so in the case of leagues, I will have a league document, and a match document, with matches containing {leagueId, season number, division number, other match details}.
Each member will have their own document with a list of community document Id's they're a member of.
I'm just a bit worried, that using this design, is missing the whole point of a document db and we may as well have used sql, or do you think I'm on the right track with approach?
According to this page in the docs, only one snapshot can be active for a given object. However, I seem to have a Defect with 2 active snapshots. All snapshots are shown in the screenshot below:
As you can, see I have connected the snapshots with arrows and they do not all link together. Is this a bug with Rally or is it in fact possible to have 2 defects with _ValidTo dates in the year 9999?
My query is taken from the example in the docs:
URI: https://rally1.rallydev.com/analytics/v2.0/service/rally/workspace/12345/artifact/snapshot/query.js
POST data:
{
"find": {
"ObjectID": my funky object
},
"fields": ["State", "_ValidFrom", "_ValidTo", "ObjectID", "FormattedID"],
"hydrate": ["State"],
"compress": true
}
The object should not have two current snapshots with _ValidTo set to 9999-01-01. Please contact CA Agile Central (Rally) support, and they will raise the issue with the LookbackAPI team, which I believe has a way of fixing the data on their end.
I have recently been tasked with a small project of setting up a periodic (somewhere between daily & weekly) data dump from an internal database to a 3rd party product. This project dovetails nicely with my company's desire (one which I share) to start standing up a formal service layer/API over the top of our data.
My personal preference is that those APIs should take on the form of RESTful endpoints - however, now I have what I think is a big design question - let me explain...
Looking at the data pull in question, it's hardly complicated. If I were just going to construct a one-off query, it would conceptually look a little something like:
select o.order_num, o.order_date, p.product_description, sr.sales_rep_name
from order o, line_item li, product p, sales_rep sr
where li.order_num = o.order_num
and li.product_id = p.product_id
and sr.sales_rep_id = o.sales_rep_id
and o.order_date >= [some arbitrary date]
Flipping my brain into "Resource Mode", I can think about how to convert this basic data model into URI's/payloads without too much trouble:
GET /orders/123
{
"order_num": 59324,
"order_date": "2014-07-07",
"sales_rep_uri": "/salesRep/34",
"line_items_uri": "/order/123/lineItems"
}
Getting more information about the sales rep:
GET /salesRep/34
{
"sales_rep_name": "Jane Doe",
"open_orders_uri": "/salesRep/34/orders"
}
Getting more information about line items:
GET /orders/123/lineItems
{
"line_items": [
{"order_uri": "/order/123", "product_uri": "/products/68"},
{"order_uri": "/order/123", "product_uri": "/products/99"}
]
}
And so on. I'm not saying it's a perfect API, I'm just trying to demonstrate it's not exactly rocket science to go about thinking how you might express the data model in a nicely normalized, resource-oriented type of way via RESTful URIs. But that is exactly where the design question comes into play...
On one hand, I can crank out a query to solve the problem very easily, but the very nature of queries requires the various domain concepts to be tightly coupled (in other words, utilizing joins to bring all of the normalized data together into one nice, custom-purpose de-normalized structure).
On the other hand, going through the mental process of thinking out a RESTful API leads me right back down that road of keeping things nicely compartmentalized - e.g. asking for "Order 123" shouldn't send me back this huge graph where I can see the full product description, the sales rep's phone number, etc, etc. The concept of a full blown HATEOAS-level RESTful API dictates consumers should be making subsequent GETs to drill down for that kind of detail only as-needed.
My question boils down to this: solving this use case seems really easy to do with a direct query and really difficult to do against a nice & tidy RESTful API (I'm picturing the literally 1000's of individual GETs it would take for me to assemble a weeks worth of data vs the few seconds it would take for the query to run). Is there some elegant subtlety of good RESTful design that I don't understand that would prevent me from seeing a good solution, or am I trying to fit a round peg into a square hole (i.e. REST is not good at pulling big data batches across multiple resources)?
I'm just going to throw this out there as a potential solution:
Conceptually, I treat the results of this query as a resource unto itself - like "orderReport".
Treating this as it's own resource, the API could behave something like:
GET /orderReport/[some arbitrary date]
You could then send back either a 201 Created (if the query is relatively quick running) with a location header like Location: /orderReport/[GUID]. Alternatively, if the query takes a while to run (I honestly don't know if it does or not off the top of my head), you could send back a 202 Accepted with a location header of Location: /orderReport/[GUID]/status.
You could then do follow up GETs against those URLs to get either the report status (200 OK if still processing without error, 201 Created with location header pointing to the report URL if its done) or the report itself.
There's nothing to say the report data couldn't also incorporate HATEOAS in addition to the data strictly needed to fulfill the use case requirements, like:
{
[
{
"order_num": 123,
"order_uri": "/orders/123",
"order_date": 2014-07-03,
"product_description": "widget",
"sales_rep_name": "Jane Doe",
"sales_rep_uri": "/salesRep/34"
},
{
"order_num": 456,
"order_uri": "/orders/456",
"order_date": 2014-07-04,
"product_description": "gadget",
"sales_rep_name": "Frank Smith",
"sales_rep_uri": "/salesRep/53"
}
]
}
The stats.grok.se tool provides the pageview statistics of a particular page in wikipedia. Is there a method to use the wikipedia api to get the same information? What does the page views counter property actually mean?
The Pageview API was released a few days ago: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}
https://wikimedia.org/api/rest_v1/?doc#/
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageview_API
For example https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Foo/daily/20151010/20151012 will give you
{
"items": [
{
"project": "en.wikipedia",
"article": "Foo",
"granularity": "daily",
"timestamp": "2015101000",
"access": "all-access",
"agent": "all-agents",
"views": 79
},
{
"project": "en.wikipedia",
"article": "Foo",
"granularity": "daily",
"timestamp": "2015101100",
"access": "all-access",
"agent": "all-agents",
"views": 81
}
]
}
No, there is not.
The counter property returned from prop=info would tell you how many times the page was viewed from the server. It is disabled on Wikipedia and other Wikimedia wikis because the aggressive squid/varnish caching means only a tiny fraction of page views would make it to the actual server in order to affect that counter, and even then the increased database write load for updating that counter would probably be prohibitive.
The stats.grok.se tool uses anonymized logs from the cache servers to calculate page views; the raw log files are available from http://dammit.lt/wikistats. If you need an API to access the data from stats.grok.se, you should contact the operator of stats.grok.se to request one be created.
Note this was written 4 years ago, and an API has since been created (see this answer). There's not yet a way to access that via api.php, though.
get the daily JSON for the last 30 days like this
http://stats.grok.se/json/en/latest30/Britney_Spears
You can look into the stats here.
Have anyone experienced some API to get the Pageview Stats?
Furthermore, I have also looked into the available Raw Data but could not find the solution to extract the Pageview Count.
There doesn't seem to be any API; however, you can make HTTP requests to stats.grok.se and parse the HTML or JSON result to extract the page view counts.
I created a website http://wikipediaviews.org that does exactly that in order to facilitate easier comparison for multiple pages across multiple months and years. To speed things up, and minimize the number of requests to stats.grok.se, I keep all past query results stored locally.
The code I used is available at http://github.com/vipulnaik/wikipediaviews.
The file with the actual retrieval code is in https://github.com/vipulnaik/wikipediaviews/blob/master/backend/pageviewqueries.inc
function getpageviewsonline($page, $month, $language)
{
$url = getpageviewsurl($page,$month,$language);
$html = file_get_contents($url);
preg_match('/(?<=\bhas been viewed)\s+\K[^\s]+/',$html,$numberofpageviews);
return $numberofpageviews[0];
}
The code for getpageviewsurl is in https://github.com/vipulnaik/wikipediaviews/blob/master/backend/stringfunctions.inc:
function getpageviewsurl($page,$month,$language)
{
$page = str_replace(" ","_",$page);
$page = str_replace("'","%27",$page);
return "http://stats.grok.se/" . $language . "/" . $month . "/" . $page;
}
PS: In case the link to wikipediaviews.org doesn't work, it's because I registered the domain quite recently. Try http://wikipediaviews.subwiki.org instead in the interim.
em.. this question was asked 6 years ago. There's no such an API in official site in the past.
It changed.
A simple example:
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=pageviews&titles=Buckingham+Palace%7CBank+of+England%7CBritish+Museum
See document:
prop=pageviews
Shows per-page pageview data (the number of daily pageviews for each of the last pvipdays days). The result format is page title (with underscores) => date (Ymd) => count.
I'm considering using MongoDB or CouchDB on a project that needs to maintain historical records. But I'm not sure how difficult it will be to store historical data in these databases.
For example, in his book "Developing Time-Oriented Database Applications in SQL," Richard Snodgrass points out tools for retrieving the state of data as of a particular instant, and he points out how to create schemas that allow for robust data manipulation (i.e. data manipulation that makes invalid data entry difficult).
Are there tools or libraries out there that make it easier to query, manipulate, or define temporal/historical structures for key-value stores?
edit:
Note that from what I hear, the 'version' data that CouchDB stores is erased during normal use, and since I would need to maintain historical data, I don't think that's a viable solution.
P.S. Here's a similar question that was never answered: key-value-store-for-time-series-data
There are a couple options if you wanted to store the data in MongoDB. You could just store each version as a separate document, as then you can query to get the object at a certain time, the object at all times, objects over ranges of time, etc. Each document would look something like:
{
object : whatever,
date : new Date()
}
You could store all the versions of a document in the document itself, as mikeal suggested, using updates to push the object itself into a history array. In Mongo, this would look like:
db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})
// make changes to obj
...
db.foo.update({object: obj._id}, {$push : {history : {date : new Date(), object : obj}}})
A cooler (I think) and more space-efficient way, although less time-efficient, might be to store a history in the object itself about what changed in the object at each time. Then you could replay the history to build the object at a certain time. For instance, you could have:
{
object : startingObj,
history : [
{ date : d1, addField : { x : 3 } },
{ date : d2, changeField : { z : 7 } },
{ date : d3, removeField : "x" },
...
]
}
Then, if you wanted to see what the object looked like between time d2 and d3, you could take the startingObj, add the field x with the value 3, set the field z to the value of 7, and that would be the object at that time.
Whenever the object changed, you could atomically push actions to the history array:
db.foo.update({object : startingObj}, {$push : {history : {date : new Date(), removeField : "x"}}})
Yes, in CouchDB the revisions of a document are there for replication and are usually lost during compaction. I think UbuntuOne did something to keep them around longer but I'm not sure exactly what they did.
I have a document that I need the historical data on and this is what I do.
In CouchDB I have an _update function. The document has a "history" attribute which is an array. Each time I call the _update function to update the document I append to the history array the current document (minus the history attribute) then I update the document with the changes in the request body. This way I have the entire revision history of the document.
This is a little heavy for large documents, there are some javascript diff tools I was investigating and thinking about only storing the diff between the documents but haven't done it yet.
http://wiki.apache.org/couchdb/How_to_intercept_document_updates_and_perform_additional_server-side_processing
Hope that helps.
I can't speak for mongodb but for couchdb it all really hinges on how you write your views.
I don't know the specifics of what you need but if you have a unique id for a document throughout its lifetime and store a timestamp in that document then you have everything you need for robust querying of that document.
For instance:
document structure:
{ "docid" : "doc1", "ts" : <unix epoch> ...<set of key value pairs> }
map function:
function (doc) {
if (doc.docid && doc.ts)
emit([doc.docid, doc.ts], doc);
}
}
The view will now output each doc and its revisions in historical order like so:
["doc1", 1234567], ["doc1", 1234568], ["doc2", 1234567], ["doc2", 1234568]
You can use view collation and start_key or end_key to restrict the returned documents.
start_key=["doc1", 1] end_key=["doc1", 9999999999999]
will return all historical copies of doc1
start_key=["doc2", 1234567] end_key=["doc2", 123456715]
will return all historical copies of doc2 between 1234567 and 123456715 unix epoch times.
see ViewCollation for more details