dimple.js aggregating non-numerical values - data-visualization

Say i have a csv containing exactly one column Name, here's the first few values:
Name
A
B
B
C
C
C
D
Now i wish to create a bar plot using dimple.js that displays the distribution of the Name (i.e the count of each distinct Name) but i can't find a way to do that in dimple without having the actual counts of each name. I looked up the documentation of chart.addMeasureAxis(
https://github.com/PMSI-AlignAlytics/dimple/wiki/dimple.chart#addMeasureAxis) and found that for the measure argument:
measure (Required): This should be a field name from the data. If the field contains numerical values they will be aggregated, if it contains any non-numeric values the distinct count of values will be used.
So basically what that says is it doesn't matter how many times each category occurred, dimple is just going to treat each of them as if they occurred once ??
I tried the following chart.addMeasureAxis('y', 'Name') and the result was a barplot that had each bar at value of 1 (i.e each name occurred only once).
Is there a way to achieve this without changing the data set ?

You do need something else in your data which distinguishes the rows:
e.g.
Type Name
X A
Y A
Z A
X B
Y B
A C
You could do it with:
chart.addCategoryAxis("x", "Name");
chart.addMeasureAxis("y", "Type");
or add a count in:
Name Count
A 1
A 1
A 1
B 1
B 1
C 1
Then it's just
chart.addCategoryAxis("x", "Name");
chart.addMeasureAxis("y", "Count");
But there isn't a way to do it with just the name.

Hope this helps.. First use d3.nest() to get the unique values of the dataset. then add the count of each distinct values to the newly created object.
var countData = d3.nest()
.key(function (d) {
return d["name];
})
.entries(yourDate);
countData.forEach(function (c) {
c.count = c.values.length;
});
after this created the dimple chart from the new "countData Object"
var myChart = new dimple.chart(svg, countData);
Then for 'x' and 'y' axis,
myChart.addCategoryAxis("x", "key");
myChart.addMeasureAxis("y", "count");
Please look in to d3.nest() to understand the "countData" object.

Related

Calling for a column in a dataframe

I'm creating a dataframe(df) with two, columns PEA_0 and PEA_180. To input values in the cells of each column I could do this:
df$PEA_0 <-1
df$PEA_180 <-1
But since I'm expanding my df I would like a easy way to change the number behind "PEA_". I would like to assign the number to a letter so that I could change the number on the fly later:
a<-0
b<-180
and run the code like this:
paste("df$PEA_", a, sep="") < -1
I want R to understand that the code above means:
df$PEA_0 <- 1 but R only sees "df$PEA_0" <- 1 and throws the error :
Error in paste("tilst$Death_", a) <- 1 :
target of assignment expands to non-language object
Any thoughts on how to omit this?

how to sum rows in my dataframe Pandas with specific condition?

Could anyone help me ?
I want to sum the values with the format:
print (...+....+)
for example:
a b
France 2
Italie 15
Croatie 7
I want to make the sum of France and Croatie.
Thank you for your help !
One of possible solutions:
set column a as the index,
using loc select rows for the "wanted" values,
take column b,
sum the values found.
So the code can be:
result = df.set_index('a').loc[['France', 'Croatie']].b.sum()
Note double square brackets. The outer pair is the "container" of index values
passed to loc.
The inner part, and what is inside, is a list of values.
To subtract two sums (one for some set of countries and the second for another set),
you can run e.g.:
wrk = df.set_index('a').b
result = wrk.loc[['Italie', 'USA']].sum() - wrk.loc[['France', 'Croatie']].sum()

AttributeError: 'int' object has no attribute 'count' while using itertuples() method with dataframes

I am trying to iterate over rows in a Pandas Dataframe using the itertuples()-method, which works quite fine for my case. Now i want to check if a specific value ('x') is in a specific tuple. I used the count() method for that, as i need to use the number of occurences of x later.
The weird part is, for some Tuples that works just fine (i.e. in my case (namedtuple[7].count('x')) + (namedtuple[8].count('x')) ), but for some (i.e. namedtuple[9].count('x')) i get an AttributeError: 'int' object has no attribute 'count'
Would appreciate your help very much!
Apparently, some columns of your DataFrame are of object type (actually a string)
and some of them are of int type (more generally - numbers).
To count occurrences of x in each row, you should:
Apply a function to each row which:
checks whether the type of the current element is str,
if it is, return count('x'),
if not, return 0 (don't attempt to look for x in a number).
So far this function returns a Series, with a number of x in each column
(separately), so to compute the total for the whole row, this Series should
be summed.
Example of working code:
Test DataFrame:
C1 C2 C3
0 axxv bxy 10
1 vx cy 20
2 vv vx 30
Code:
for ind, row in df.iterrows():
print(ind, row.apply(lambda it:
it.count('x') if type(it).__name__ == 'str' else 0).sum())
(in my opinion, iterrows is more convenient here).
The result is:
0 3
1 1
2 1
So as you can see, it is possible to count occurrences of x,
even when some columns are not strings.

Conditional formatting: max value, comparing rows with specific data

I am making an exercise tracking sheet with Google Sheets, and ran into a problem. The sheet has a table for raw data such as day, exercise type chosen from a validated list, and sets, reps, weight, you name it. To find the useful information for analysis, I have set up a pivot table. I want to find the max values for each type of value per exercise.
For example, comparing all the three instances of "DL-m BB" in column D, the table should highlight the highest values between all them: H9 would be the record weight, F5 record volume and so on, and for "SQ-lb BB box" H12 would be max weight and F3 max volume. Eventually the table will have several hundred rows per year, and finding max values per exercise per attribute is going to be too much of a task, time better spent elsewhere.
The Conditional Formatting can be set as follow for the two examples you give above. A separate rule is set for each. They are set from the same cell (H1) adding an additional rule.
Apply to Range
H1:H1000
Custom Formula is
=$H1=max(filter($D:$H,$D:$D="DL-m BB"))
Add another rule
Apply to Range
H1:H1000
Custom Formula is
=$H1=max(filter($D:$H,$D:$D="SQ-lb box BB"))
Place this on your Pivot table page (Try M1 - it must be outside of the PT)
It list max correctly.
=UNIQUE(query($D2:$K,"SELECT D,Max(F),Max(G),Max(H),Max(I),Max(J), Max(K) Where D !='' Group By D Label Max(F) 'Max F', Max(G) 'Max G', Max(H) 'Max H', Max(I) 'Max I',Max(J) 'Max J',Max(K) 'Max K'"))
The below query lists the max for F
=query($D2:$F,"SELECT Max(F) Where D !='' Group By D label Max(F)''")
I have been trying this with conditional formatting and it almost works. Maybe you will see something I don't. Still trying.
This works.
function onOpen(){
keepUnique()
}
function keepUnique(){ //create array og unique non black values to find max for
var col = 4 ; // choose the column you want to use as data source (0 indexed, it works at array level)
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sh = ss.getSheets()[1];
var data = sh.getRange(2, col, sh.getLastRow()-2).getValues();
var newdata = new Array();
for(nn in data){
var duplicate = false;
for(j in newdata){
if(data[nn][0] == newdata[j][0] || data[nn][0]==""){
duplicate = true;
}
}
if(!duplicate){
newdata.push([data[nn][0]]);
}}
colorMax(newdata)
}
function colorMax(newdata){
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sh = ss.getSheets()[1];
lc=sh.getLastColumn()
var data = sh.getRange(2, 4, sh.getLastRow()-2,lc).getValues(); //get col 4 to last col
for(k=2;k<lc;k++){
for(i=0;i<newdata.length;i++){
var maxVal=0
for(j=0;j<data.length;j++){
if(data[j][0]==newdata[i][0]){
if(data[j][k]>maxVal){maxVal=data[j][k];var row=j+2} //find max value and max value row number
}}
var c =sh.getRange(row,k+4,1,1)//get cell to format
var cv=c.getValue()
c.setFontColor("red") //set font red
}}
}

Mark accumulated values on a QlikView column if condition is fulfilled

I have a table in Qlikview with 2 columns:
A B
a 10
b 45
c 30
d 15
Based on this table, I have a formula with full acumulation defined as:
SUM(a)/SUM(TOTAL a)
As a result,
A B D
b 45 45/100=0.45
c 30 75/100=0.75
d 15 90/100=0.90
a 10 100/100=1
My question is. how do I mark in colour the values in column A that have on column D <=0.8)?
The challenge is that D is defined with full accumulation, but if I reference D in a formula, it doesn't consider the full accumulation!
I tried with defining a formula E=if(D>0.8,'Y','N') but this formula doesn't take the visible (accumulated) value for D unfortunately, instead it takes the D with no accumulation. If this worked, I would have tried to hide (not disable) E and reference it from the dimensions column of the table , Text colour option. Any ideas please?? Thanks
You can't get an expression column's value from within a dimension or it's properties, because the expression columns rely on the dimensions provided. It would create an endless loop. Your options are:
Apply your background colour to the expression columns, not the dimensions. This would actually make more sense as the accumulated values would have the colour, not the dimension.
When loading this specific table, have QlikView create a new column that contains the accumulated values of B. This would mean, however, that the order of your chart-table would need to be fixed for the accumulations to make any sense.
Use aggregation to create a temporary table and accumulate the values using RangeSum(). Note this will only accumulate properly if the table is ordered in Ascending order of Column A
=IF(Aggr(RangeSum(Above(Sum(B),0,10)),A)/100>0.8,
rgb(0,0,0),
rgb(255,0,0)
)