Adding custom data to GapMinder - data-visualization

Does anyone have any experience adding their own data to GapMinder, the really cool software that Hans Rosling uses in his TED talks? I have an array od objects in JSON that would be easy to show in moving bubbles. This would be really cool.
I can see that my Ubuntu box has what looks like data in /opt/Gapminder Desktop/share/assets/graphs/world, but I would need to figure out:
How to add a measure to a graph
How to add a data series
How to set the time range of the data
Identify the measures to follow at each time step
and so on.

Just for the record: if you want to use Gapminder with your own dataset, you have to convert your data in a format suitable to Gapminder. More specifically, looking in the assets/graphs/world, you will have to:
Edit the file overview.xml, which contains the tree structure of all the indicators (just copy/paste an entry and specify your own data);
Convert your data copying the structure of the xml files in that directory (this is the tricky part): you can specify some metadata in the preamble, and then specify your own data series, with something like:
<t1 m="i20,50.0,99.0,1992" d="90.0, ... ,50.0, ..."/> where i20 is the country id, which is followed by the minima and maxima of the series, and the year it refers to.
In my humble opinion, Gapminder is a great app but it definitely needs more work on integration with other datasets. Way better to use Google Motion Chart as you did, or MooGraph (site and doc), which is unfortunately not as great as Gapminder.

#Stefano
the information you provided is very valuable. Is somewhere available a detailed specification of the XML files containing the data?
Anyway, just to enrich your response, I also found that:
overview.xml file
The link between Nations and their IDs is in this file
The structure of the menus for the selection of the indicators is also in the same file (at the bottom) under the section <indicatorCategorization>
The structure of the datafile XML
For each line the year represents the first year of the serie, and then the values follow one per year, comma separated.
Grazie,
Max

I ended up using the google motion chart API. I ended up with this.

Related

How to partition by in pandas and output to a word doc?

I have a table I have filtered from data. It is my highlights across the web. I want to, ultimately, output these to a doc file I have by the page they came from
I have the api data filtered down to two columns
url|quote
How do I, for each url, output the quote to a doc file. or just for starters iterate through a set of quotes by each earl.
In SQL it would be something like this
Select quote over(partition by url) as sub_header
From table
url quote
https://jotengine.com/transcriptions/WIUL8HBabqxffIDOkUA9Dg I actually think that the bigger problem is not necessarily having the ideas. I think everyone has lots of interesting ideas. I think the bigger problem is not killing the bad ideas fast enough. I have the most respect for the Codecademy founders in this respect. I think they tried 12 ideas in seven weeks or something like that, in the summer of YC.
https://jotengine.com/transcriptions/WIUL8HBabqxffIDOkUA9Dg We were like what the heck is going on here so we went and visited five of our largest customers in New York, this was about three years ago and we said okay, you're using the S3 integration but what the heck are you using it for? For five out of five customers in a row, they said well we have a data engineering team that's taking data from the S3 bucket, converting it into CS view files and managing all the schema-translations and now they're uploading it into a data warehouse like Redshift. The first time I heard that from a customer, I was like okay, that's interesting
I want to output a url header followed by all the quotes I've highlighted. Ideally my final product will be in docx
it would be great if you could provide some source code to help explain your problem. From looking at your question, I would say all you need to do is put your columns into a DataFrame, then export this to excel.
df = pd.DataFrame({"url":url,"quote":quote})
df["quote"].to_excel("filename.xlsx")
Hope this helps.

What is the best method to extract a recurring blob data and put in another table ? - SQL

I'm developing a new webpage in (.NET framework, if that helps) for the below scenario. Every single day, we get a cab drivers report.
Date | Blob
-------------------------------------------------------------
15/07 | {"DriverName1":"100kms", "DriverName2":"10kms", "Hash":"Value"...}
16/07 | {"DriverName1":"50kms", "DriverName3":"100kms", "Hash":"Value"}
Notice that the 'Blob' is the actual data received in json format - contains information about the distance covered by a driver at that particular day.
I have written a service which reads the above table & further breaks down this and puts it into a new table like below:
Date | DriverName | KmsDriven
15/07 DriverName1 100
15/07 DriverName2 10
16/07 DriverName3 100
16/07 DriverName1 50
By populating this, I can easily do the following queries:
How many drivers drove on that particular day.
How is 'DriverName1' did for that particular week, etc.,
My questions here are:
Are there anything in .NET / SQL world to specifically address this or let me know if I am reinventing the wheel here.
Is this the right way to use the Blob data ?
Are there any design patterns to adhere here to ?
Are there anything in .NET / SQL world to specifically address this or
let me know if I am reinventing the wheel here.
Well, there are JSON parsers available, for example Newtonsoft's Json.NET. Or you can use SQL Server's own functions. Once you have extracted individual values from JSON, you can write them into corresponding columns (in your new table).
Is this the right way to use the Blob data?
No. It violates the principle of atomicity, and therefore the first normal form.
Are there any design patterns to adhere here to?
I'm not sure about "patterns", but I don't see why would you need a BLOB in this case.
Assuming the data is uniform (i.e. it always has the same fields), you can just declare the columns you need and write directly to them (as you already proposed).
Otherwise, you may consider using SQL Server's XML data type, which will enable you to extract some of the sections within an XML document, or insert a new section without replacing your whole document.

UniData - record count of all files / tables

Looking for a shortcut here. I am pretty adept with SQL database engines and ERPs. I should clarify... I mean databases like MS SQL, MySQL, postresql, etc.
One of the things that I like to do when I am working on a new project is to get a feel for what is being utilized and what isn't. In T-SQL this is pretty easy. I just query the information schema and get a row count of all the tables and filter out the ones having rowcount = 0. I know this isn't truly a precise row count, but it does give me an idea of what is in use.
So I recently started at a new company and one of their systems is running on UniData. This is a pretty radical shift from mainstream databases and there isn't a lot of help out there. I was wondering if anybody knew of a command to do the same thing listed above in UniBasic/UniQuery/whatever else.
Which tables, files, are heavily populated and which ones are not?
You can start with a special "table" (or file in Unidata terminology) named VOC - it will have a list of all the other files that are in your current "database" (aka account), as well as a bunch of other things.
To get a list of files in (or pointed to) the current account:
:SORT VOC WITH F1 = "F]" "L]" "DIR" F1 F2
Try HELP CREATE.FILE if you're curious about the difference between F and LF and DIR.
Once you have a list of files, weed out the ones named *TEMP* or *WORK* and start digging into the ones that seem important. There are other ways to get at what's important (e.g using triggers or timestamps), but browsing isn't a bad idea to see what conventions are used.
Once you have a file that looks interesting (let's say CUSTOMERS), you can look at the dictionary of that file to see
:SORT DICT CUSTOMERS F1 F2 BY F1 BY F2 USING DICT VOC
It can help to create something like F2.LONG in DICT VOC to increase the display size up from 15 characters.
Now you have a list of "columns" (aka fields or attributes), you're looking for D-type attributes that will tell you what columns are in the file. V or I-type's are calculations
https://github.com/ianmcgowan/SCI.BP/blob/master/PIVOT is helpful with profiling when you see an attribute that looks interesting and you want to see what the data looks like.
http://docs.rocketsoftware.com/nxt/gateway.dll/RKBnew20/unidata/previous%20versions/v8.1.0/unidata_userguide_v810.pdf has some generally good information on the concepts and there are many other online manuals available there. It can take a lot of reading to get to the right thing if you don't know the terminology.

NetLogo: how to read values from a data set, assigning values at each tick?

I'm modelling salmon population dynamics and I have a real data set about temperature and flow. I would like to assign a daily value of these two parameters during each tick, setting the first tick as the first day in the dataset and making it keep reading the file.
How can I do that?
Jacopo
NetLogo has fairly extensive IO capabilities for text files (and thus for CSV). You apparently have your data in a simple CSV file, so you will need to use these capabilities. For simple IO examples, see https://subversion.american.edu/aisaac/notes/netlogo-intro.xhtml#file-based-io There are also lots of examples of reading CSV files on the web (e.g., http://netlogoabm.blogspot.com/2014/01/reading-from-csv-file.html). Unfortunately, NetLogo does not provide a CSV reader.
You suggest you would like to repeatedly read from the file. You will then have to leave the file open for the entire simulation. Each tick you can read in one line from each open file.
Unless it is a very large dataset, I would rather read in all the data into two global lists (e.g., temparatures and flows) at the very beginning. Since you say you want to update the values each tick, use the current tick value to index into these lists. E.g., set temp item ticks temperatures. (Here I assume you only use tick to advance the tick counters, so that you get successive integers. Also if you tick before you start reading data, you'll need to use ticks - 1.)
hth

customizing output from database and formatting it

Say you have an average looking database. And you want to generate a variety of text files (each with their own specific formatting - so the files may have rudimentary tables and spacing). So you'd be taking the data from the Database, transforming it in a specified format (while doing some basic logic) and saving it as a text file (you can store it in XML as an intermediary step).
So if you had to create 10 of these unique files what would be the ideal approach to creating these files? I suppose you can create classes for each type of transformation but then you'd need to create quite a few classes, and what if you needed to create another 10 more of these files (a year down the road)?
What do you think is a good approach to this problem? being able to maintain the customizability of the output file, yet not creating a mess of a code and maintenance effort?
Here is what I would do if I were to come up with a general approach to this vague question. I would write three pieces of code, independent of each other:
a) A query processor which can run a query on a given database and output results in a well-known xml format.
b) An XSL stylesheet which can interpret the well-known xml format in (a) and transform it to the desired format.
c) An XML-to-Text transformer which can read the files in (a) and (b) and put out the result.