Is there a way to force Open Refine to extract all history changes into JSON - openrefine

I was using Open Refine for a couple of days and noticed that I am allowed only to extract the last operations, probably only the ones made in the last execution of the project. The previous operations are in grey and won't let me select them. Is there any way to get the JSON associated with that operations?

Related

Creating a test-data container in Azure blob storage

I'm adding some testing to my current project which uses Azure blob storage to store telemetry data coming from a stream analytics job. I want to do testing of the routines that get the telemetry data, so I created a separate container for test data. I downloaded a sample set of data, modified the data to serve my needs and re-uploaded (using Azure storage explorer) everything back into the new container.
The tests were immediately failing and I quickly found out that this is because the LastModified date of the files changed into the date/time of upload. This is fine, but the sequence of the upload was also different. My code uses the modified date of the file to find out which one is the most recent, which would now return a different file based on the new dates.
I found that you cannot modify this property, although you can change another property to have it update. So I know the solution: I could write a quick script which gets the sequence of files from my production instance and then touches every file in the test instance in the same sequence.
But... I was wondering whether this is the best option. I also read it's 'best practice' to store a custom datetime in a separate property, but I don't think I can do that straight from Stream Analytics (which is writing the blobs). I also considered using an Azure Function to do this (new blob => update property), but I'm than adding complexity and something that might fail for whatever reason.
So I'm looking for the best way to solve this problem. Anyone?
Update: this one probably deserves a tiny bit more explanation. Apart from using the LastModified date to sort on, I also use it to filter blobs. The blobs themselves are CSV files containing ASA output data, so telemetry records. Each record has a timestamp, but that information is IN the file. When retrieving data, I don't want to have to dive into each file to find out what the timestamp is of those records. So I use a prefilter to filter out the blobs within a certain timespan, and then only download / open those file to the records inside.
This works perfectly as long as you do not touch any of the blob, but obviously it stops working as soon as any of the blobs gets modified for whatever reason. So I'm now convinced that I need a different / better way to solve this issue; but how?
It seems to me that you have two separate things: the data that you want to store in blob storage and metadata about the blob such as the timestamp. I would create a different (azure) database for the metadata or even simpler just add metadata to the (block)blob:
blockBlob.Metadata.Add("from", dateTime.ToString());
blockBlob.Metadata.Add("to", dateTime.ToString());
blockBlob.Metadata.Add("order", "1");
For sorting I would just add a simple order property.
The comment by #Vignesh deserves the credit here, but in order to get this one marked answer I'll provide it myself.
With ASA, you can set the output to be structured by date/time. That means in this case, data is written to the blob store with a directory structure such as:
2016 / 06 / 27 / 15 / 23 (= 27-06-2016 15:23)
2016 / 06 / 28 / 11 / 02 (= 28-06-2016 11:02)
The ASA output allow you to specify how granular you want the structure to be, in my case I chose to store it by day (so not including a time path). The ASA runtime will now ensure that data from a certain point in time is stored within a blob in that resides in the correct path.
Then I subsequently changed my logic to not use the datetime stamp of the individual blob files any more, but simply read just the files from the folders that are within the timerange I'm interested in. That assures we only get data that was produced within that timerange. And if there's more than one file in a folder, I need to load them both since both were in the same timerange anyway. As long as minutes are enough granularity for you, this works excellent even though it might feel a bit strange to use a folder structure for such a thing.
Having a seperate 'index' for blobs which tracks their datetime would work too of course, but adds complexity which in this case I don't really need.

Logging When Files Are Saved, Modified or Deleted Using VBA

I work with VBA in MS Access databases. I'd like to be able to log when files are saved, modified or deleted without having to update the existing code to do the logging when the pertinent events take place. I want the time, location and the name of the file.
I found a good example here: when file modified
However, it only allows for monitoring a particular location (path). I want to be able to log regardless of where the save, modify or delete takes place. I'm only allowed to program in the MS Office environment in this situation. It seems as though using the Windows API is going to be how this task will be achieved. However, I don't have much experience working with the API. Is there an easier way to achieve what I want that doesn't involve using the API?
Have you worked with After_Updates or After_Insert macros? Also, is your application split? Meaning there's a front-end and a back-end of the database. You can create a separate table that mirrors that table that you need to track changes for. Every time a table is updates, run a macro that inserts a row to that table.
I'm assuming you're saving files to the database. If that's the case, add a after_update or after_insert macro that can keep track of when then files are being modified or added to the table.

How should I deal with copies of data in a database?

What should I do if a user has a few hundred records in the database, and would like to make a draft where they can take all the current data and make some changes and save this as a draft potentially for good, keeping the two copies?
Should I duplicate all the data in the same table and mark it as a draft?
or only duplicate the changes? and then use the "non-draft" data if no changes exist?
The user should be able to make their changes and then still go back to the live and make changes there, not affecting the draft?
Just simply introduce a version field in the tables that would be affected.
Content management systems (CMS) do this already. You can create a blog post for example, and it has version 1. Then a change is made and that gets version 2 and on and on.
You will obviously end up storing quite a bit more data. A nice benefit though is that you can easily write queries to load a version (or a snapshot) of data.
As a convention you could always make the highest version number the "active" version.
You can either use BEGIN TRANS, COMMIT and ROLLBACK statements or you can create a stored procedure / piece of code that means that any amendments the user makes are put into temporary tables until they are ready to be put into production.
If you are making a raft of changes it is best to use temporary tables as using COMMIT etc can result in locks on the live data for other uses.
This article might help if the above means nothing to you: http://www.sqlteam.com/article/temporary-tables
EDIT - You could create new tables (ie NOT temporary, but full fledged sql tables) "on the fly" and name them something meaningful. For instance, the users intials, followed by original table name, followed by a timestamp.
You can then programtically create, amend and delete these tables over long periods of time as well as compare against Live tables. You would need to keep track of how many tables are being created in case your database grows to vast sizes.
The only major headache then is putting the changes back into the live data. For instance, if someone takes a cut of data into a new table and then 3 weeks later decides to send it into live after making changes. In this instance there is a likelihood of the live data having changed anyway and possibly superseding the changes the user will submit.
You can get around this with some creative coding though. There are many ways to tackle this, so if you get stuck at the next step you might want to start a new question. Hopefully this at least gives you some inspiration though.

How do I use the version one api to get project and sprint burndown charts?

I am trying to use the Version One api to get the project and sprint burndown charts.
I am reading this page but I am just getting confused.
Has anybody done something similar and have any tips for how to hit the api to get what I want?
The VersionOne api does not serve images or chart specific data. You can use the query language and the rest endpoint to produce the data that is needed for a burndown. You would need to be able to read/parse the data and produce a graph yourself.
With that being said, a burndown graph compares how much closed estimate versus how much open estimate and graphs that over time. So you need to know three pieces of data: open estimate, closed estimate, and time. You'll also want to limit it to a certain project (and it'd children).
This should get you close to the data you need for a project burndown:
http://<host>/VersionOne/rest-1.v1/Data/Timebox?where=Schedule.ScheduledScopes='Scope:1055'&sel=Name,BeginDate,EndDate,Workitems:Story[AssetState!='Closed'].Estimate.#Sum,Workitems:Story[AssetState='Closed'].Estimate.#Sum&sort=+EndDate
Be sure to change Scope:1055 to the project oid that you're interested in.
This is how I got there. First I was thinking "well you need to sum up a bunch of story estimates" so I thought I'd do a historical query of stories:
http://<host>/VersionOne.Web/rest-1.v1/Hist/Story?where=Scope.ParentMeAndUp='Scope:1055'
But quickly found that you cannot aggregate on your root. What that means is if I want to sum up estimate, I need to use something else like Project (scope) to get at the data:
http://<host>/VersionOne.Web/rest-1.v1/Hist/Scope/1055?sel=Workitems:Story[AssetState!='Closed'].Estimate.#Sum,Workitems:Story[AssetState='Closed'].Estimate.#Sum,ChangeDate
The problem with this query is is gives you what the closed versus open estimate looked like at weird intervals; namely whenever the project changed. So it wouldn't make a very nice looking graph.
But as you know VersionOne has a concept of Iterations and Schedules that are associated to a project, and stories are associated to iterations. So I used that as a root to query for and aggregate story estimates, and limit the data to projects that use that schedule.
The data that is produced is more regular (grouped by iteration) and contains correctly aggregated estimate data.
So what is left? You'll have to aggregate the aggregation of estimate data to get a total estimate number for your project. Then you'll need to produce a graph (maybe bar or line) where each data point is at the end of an iteration. You'll keep a running total of closed estimate and add that to the iteration's total to produce the data point.
You need to do multiple queries to produce a burndown. First find the date range for the burndown:
/Data/Timebox?sel=BeginDate,EndDate&where=Name='X'
Now for every day the date range, sum up the ToDo hours as of that point in history:
/Hist/Timebox?asof=2013-08-09T23:59:59&where=Name='X'&sel=Workitems[Team.Name='Y';AssetState!='Dead'].ToDo.#Sum
The API and documentation are excellent. If you are interested in seeing the code for some custom reports, check out https://github.com/timothypratley/vone/blob/master/src/vone/models/queries.clj (the code is in Clojure). There is a burndown, cumulative flow, and more :)
There is now a "recipe" to query for burndown data that works with the query.v1 API endpoint.

How to find the document visitior's count?

Actually I am in need of counting the visitors count for a particular document.
I can do it by adding a field, and increasing its value.
But the problem is following.,
I have 10 replication copies in different location. It is being replicated by scheduled manner. So replication conflict is happening because of document count is editing the same document in different location.
I would use an external solution for this. Just search for "visitor count" in your favorite search engine and choose a third party tool. You can then display the count on the page if that is important.
If you need to store the value in the database for some reason, perhaps you could store it as a new doc type that gets added each time (and cleaned up later) to avoid the replication issues.
Otherwise if storing it isn't required consider Google Analytics too.
Also I faced this problem. I can not say that it has a easy solution. Document locking is the only solution that i had found. But the visitor's count is not possible.
It is possible, but not by updating the document. Instead have an AJAX call to an agent or form with parameters on the URL identifying the document being read. This call writes a document into a tracking DB with one or two views and then determines from those views how many reads you have had. The number of reads is the return value of the AJAX form.
This can be written in LS, Java or #Formulas. I would try to do it 100% in #Formulas to make it as efficient as possible.
You can also add logic to exclude reads from the same user or same source IP address.
The tracking database then replicates using the same schedule as the other database.
Daily or Hourly agents can run to create summary documents and delete the detail documents so that you do not exceed the limits for #DBLookup.
If you do not need very nearly real time counts (and that is the best you can get with replicated system like this) you could use the web logs that domino generates by finding the reads in the logs and building the counts in a document per server.
/Newbs
Back in the 90s, we had a client that needed to know that each person had read a document without them clicking to sign or anything.
The initial solution was to add each name to a text field on a separate tracking document. This ran into problems when it got over 32k real fast. Then, one of my colleagues realized you could just have it create a document for each user to record that they'd read it.
Heck, you could have one database used to track all reads for all users of all documents, since one user can only open one document at a time -- each time they open a new document, either add that value to a field or create a field named after the document they've read on their own "reader tracker" document.
Or you could make that a mail-in database, so no worries about replication. Each time they open a document for which you want to track reads, it create a tiny document that has only their name and what document they read which gets mailed into the "read counter database". If you don't care who read it, you have an agent that runs on a schedule that updates the count and deletes the mailed-in documents.
There really are a lot of ways to skin this cat.