Count the number of null value per column with pentaho - pentaho

I've got a csv file that contain more than 60 columns and 2 000 000 lines, I'm trying to count the number of null value per variable (per column) then to do the sum of that new row to get the number total of null value in the entire csv. For example if we got this file in input:
We expect this other file in output:
I know how to count the number of null value per line but, I didn't figure out how to count the number of null value per column.

There has to be a better way to do this, but I made a really nasty JavaScript which does the job.
It has some problems for different column types, as it doesn't set the column type. (It should set all columns to integer, but I don't know if that is possible from JavaScript.)
You have to run Identify last row in a stream first, and save it to the column last (or change the script).
var nulls;
var seen;
if (!seen) {
// Initialize array
seen = 1;
nulls = [];
for (var i = 0; i < getInputRowMeta().size(); i++) {
nulls[i] = 0;
}
}
for (var i = 0; i < getInputRowMeta().size(); i++) {
if (row[i] == null) {
nulls[i] += 1;
}
// Hack to find empty strings
else if (getInputRowMeta().getValueMeta(i).getType() == 2 && row[i].length() == 0) {
nulls[i] += 1;
}
}
// Don't store any values
trans_Status = SKIP_TRANSFORMATION;
// Only store the nulls at the last row
if (last == true) {
putRow(nulls);
}

Please drag and drop below steps in to canvas.
step1: Add constants: create one variable called constant and value = 1
step2: Filter Rows: you have filter null values of all columns.
step3: Group by: here group by field constant variable
aggregates section we have to specify remaining columns like ct_inc.And type is Number of Values (N)
If you have any doubts feel free to ask.
skype_id : panabakavenkatesh

Related

Need Assistance in find Min And Max from user input Array

On the Android Studio emulator The user is required to enter a maximum of 10 numbers. When I put in the number 1 the output shows 0 instead of 1 (this is for the min number; the max works perfectly fine) Can anyone please assist me in this problem. I tried using minOf() and max() nothing worked Below is a snippet of my source code:
val arrX = Array(10) { 0 }
.
.
.
.
findMinAndMaxButton.setOnClickListener {
fun getMin(arrX: Array<Int>): Int {
var min = Int.MAX_VALUE
for (i in arrX) {
min = min.coerceAtMost(i)
}
return min
}
fun getMax(arrX: Array<Int>): Int {
var max = Int.MIN_VALUE
for (i in arrX) {
max = max.coerceAtLeast(i)
}
return max
}
output.text = "The Min is "+ getMin(arrX) + " and the Max is " + getMax(arrX)
}
}
}
Is there anything that can be done to get this work?
You're initialising arrX to a bunch of zeroes, and 0.coerceAtMost(someLargerNumber) will always stick at 0.
Without seeing how you set the user's numbers it's hard to say what you need to do - but since you said the user enters a maximum of 10 numbers, at a guess there are some gaps in your array, i.e. indices that are still set to 0. If so, they're going to be counted in your min calculation.
You should probably use null as your default value instead - that way you can just ignore those in your calculations:
val items = arrayOfNulls<Int?>(10)
// this results in null, because there are no values - handle that however you like
println(items.filterNotNull().minOrNull())
>> null
// set values on some of the indices
(3..5).forEach { items[it] = it }
// now this prints 3, because that's the smallest of the numbers that -do- exist
println(items.filterNotNull().minOrNull())
>> 3

Least Common Multiple with while loop, Javascript

I'm trying to find the least common multiple of an array of integers, e.g. if there are 2 numbers given (7, 3) then my task is to find the LCM of the numbers 3 through 7 (3,4,5,6,7 in that case).
My solution would be to add the maximum number to a new variable (var common) until the remainders of all of the numbers in the array (common % numBetween[i]) equal 0. There are more efficient ways of doing this, for example applying the Euclidean Algorithm, but I wanted to solve this my way.
The code:
function smallestCommons(arr) {
var numBetween = [];
var max = Math.max.apply(Math, arr);
var min = Math.min.apply(Math, arr);
while (max - min !== -1) {
numBetween.push(min);
min += 1;
} //this loop creates the array of integers, 1 through 13 in this case
var common = max;
var modulus = [1]; //I start with 1, so that the first loop could begin
var modSum = modulus.reduce(function (a, b) {
return a + b;
}, 0);
while (modSum !== 0) {
modulus = [];
for (var i = 0; i < numBetween.length; i++) {
modulus.push(common % numBetween[i]);
}
if (modSum !== 0) {
common += max;
break; //without this, the loop is infinite
}
}
return common;
}
smallestCommons([1,13]);
Now, the loop is either infinite (without break in the if statement) so I guess the modSum never equals 0, because the modulus variable always contains integers other than 0. I wanted to solve this by "resetting" the modulus to an empty array right after the loop starts, with
modulus = [];
and if I include the break, the loop stops after 1 iteration (common = 26). I can't quite grasp why my code isn't working. All comments are appreciated.
Thanks in advance!
I may be false, but do you actually never change modSum within the while-loop? If so, this is your problem. You wanted to do this by using the function .reduce(), but this does not bind the given function, so you have to call the function each time again in the loop.

How do I generate a ranking of columns in Google Sheets? Do I use ArrayFormula?

Here's a picture as an example:
The left side is the raw data I currently have, consisting of columns of teams A, B, C, D, E, F. Each row represents 1 game, and many games are played (this spreadsheet has several thousand rows).
How do I generate a ranking of the teams, like what I manually did on the right side? I want to sort the teams by the number of points they have.
In the first row, Team A scores the most points, so they are ranked 1st; Team C scores the second most points, and thus are in the second column, etc.
I'm having trouble figuring out how to sort by column in each row, rather than sorting by column, which is what the default sort() function does. I've tried multiple methods, including using the google SQL query function, (https://developers.google.com/chart/interactive/docs/querylanguage) but couldn't figure out how to use that either.
Is using ArrayFormula() a good idea in this situation? How would I implement that? This sheet does have several thousand rows, so it would be much better if I can just put an ArrayFormula() in the first row, rather than manually autofilling every row.
You may use costum formula, for cell H2:
=ArraySort(A1:F1,A2:F2001)
here's the code to paste into script editor:
function ArraySort(Arr1, Arr2) {
var w = Arr1[0].length;
var column = [];
var row = [];
var result = [];
for (var i = 0; i < Arr2.length; i++) {
row = [];
for (var k = 0; k < w; k++) {
column[k] = [Arr2[i][k],Arr1[0][k]];
}
column.sort(sortFunction);
for (var j = 0; j < w; j++) {
row[j] = column[j][1];
}
result.push(row);
}
return result;
}
function sortFunction(a, b) {
if (a[0] === b[0]) {
return 0;
}
else {
return (a[0] > b[0]) ? -1 : 1;
}
}
The following formula creates the ranking in the second row:
=array_constrain(transpose(sort(transpose({A$1:F$1; A2:F2}), 2, False)), 1, 6)
It can be copied down to rank the rest of rows. I do not think there is an arrayformula approach that would sort an array of arrays.
Explanation of the formula: it creates a two-row array (team names and scores), transposes it so the scores become the second column, sorts by that column, transposes again, and keeps only the names (the first row).

How to filter records with a null value in PIG?

I am trying to drop records that contain at least one null in any of the fields. For example, if the data has 3 fields, then:
filtered = FILTER data by ($0 is not null) AND ($1 is not null) AND ($2 is not null)
Is there any cleaner way to do this, without having to write out 3 boolean expressions?
If all of the fields are of numeric types, you could simply do something like
filtered = FILTER data BY $0*$1*$2 is not null;
In Pig, if any terms in an arithmetic expression are null, the result is null.
You could also write a UDF to take an arbitrary number of arguments and return null (or 0, or false, whatever you find most convenient) if any of the arguments are null.
filtered = FILTER data BY NUMBER_OF_NULLS($0, $1, $2) == 0;
where NUMBER_OF_NULLS is defined elsewhere, e.g.
public class NUMBER_OF_NULLS extends EvalFunc {
public Integer exec(Tuple input) {
if (input == null) { return 0; }
int c = 0;
for (int i = 0; i < input.size(); i++) {
if (input.get(i) == null) c++;
}
return c;
}
}
Note: I have not tested the above UDF, and I don't claim it adheres to any best practices for writing clear, robust UDFs. You should add exception-handling code, for example.
I was thinking there is a better way of doing this without using the UDF, i.e, using SPLIT in Pig.
emp = load '/Batch1/pig/emp' using PigStorage(',') as (id:chararray, name:chararray, salary:int, dept:chararray);
SPLIT emp INTO emptyDept IF depart == '', nonemptyDept IF depart != '';
DUMP nonemptyDept;
The resulting relation nonemptyDept would display all the non-empty Department values of the emp relation.

Generate combinations ordered by an attribute

I'm looking for a way to generate combinations of objects ordered by a single attribute. I don't think lexicographical order is what I'm looking for... I'll try to give an example. Let's say I have a list of objects A,B,C,D with the attribute values I want to order by being 3,3,2,1. This gives A3, B3, C2, D1 objects. Now I want to generate combinations of 2 objects, but they need to be ordered in a descending way:
A3 B3
A3 C2
B3 C2
A3 D1
B3 D1
C2 D1
Generating all combinations and sorting them is not acceptable because the real world scenario involves large sets and millions of combinations. (set of 40, order of 8), and I need only combinations above the certain threshold.
Actually I need count of combinations above a threshold grouped by a sum of a given attribute, but I think it is far more difficult to do - so I'd settle for developing all combinations above a threshold and counting them. If that's possible at all.
EDIT - My original question wasn't very precise... I don't actually need these combinations ordered, just thought it would help to isolate combinations above a threshold. To be more precise, in the above example, giving a threshold of 5, I'm looking for an information that the given set produces 1 combination with a sum of 6 ( A3 B3 ) and 2 with a sum of 5 ( A3 C2, B3 C2). I don't actually need the combinations themselves.
I was looking into subset-sum problem, but if I understood correctly given dynamic solution it will only give you information is there a given sum or no, not count of the sums.
Thanks
Actually, I think you do want lexicographic order, but descending rather than ascending. In addition:
It's not clear to me from your description that A, B, ... D play any role in your answer (except possibly as the container for the values).
I think your question example is simply "For each integer at least 5, up to the maximum possible total of two values, how many distinct pairs from the set {3, 3, 2, 1} have sums of that integer?"
The interesting part is the early bailout, once no possible solution can be reached (remaining achievable sums are too small).
I'll post sample code later.
Here's the sample code I promised, with a few remarks following:
public class Combos {
/* permanent state for instance */
private int values[];
private int length;
/* transient state during single "count" computation */
private int n;
private int limit;
private Tally<Integer> tally;
private int best[][]; // used for early-bail-out
private void initializeForCount(int n, int limit) {
this.n = n;
this.limit = limit;
best = new int[n+1][length+1];
for (int i = 1; i <= n; ++i) {
for (int j = 0; j <= length - i; ++j) {
best[i][j] = values[j] + best[i-1][j+1];
}
}
}
private void countAt(int left, int start, int sum) {
if (left == 0) {
tally.inc(sum);
} else {
for (
int i = start;
i <= length - left
&& limit <= sum + best[left][i]; // bail-out-check
++i
) {
countAt(left - 1, i + 1, sum + values[i]);
}
}
}
public Tally<Integer> count(int n, int limit) {
tally = new Tally<Integer>();
if (n <= length) {
initializeForCount(n, limit);
countAt(n, 0, 0);
}
return tally;
}
public Combos(int[] values) {
this.values = values;
this.length = values.length;
}
}
Preface remarks:
This uses a little helper class called Tally, that just isolates the tabulation (including initialization for never-before-seen keys). I'll put it at the end.
To keep this concise, I've taken some shortcuts that aren't good practice for "real" code:
This doesn't check for a null value array, etc.
I assume that the value array is already sorted into descending order, required for the early-bail-out technique. (Good production code would include the sorting.)
I put transient data into instance variables instead of passing them as arguments among the private methods that support count. That makes this class non-thread-safe.
Explanation:
An instance of Combos is created with the (descending ordered) array of integers to combine. The value array is set up once per instance, but multiple calls to count can be made with varying population sizes and limits.
The count method triggers a (mostly) standard recursive traversal of unique combinations of n integers from values. The limit argument gives the lower bound on sums of interest.
The countAt method examines combinations of integers from values. The left argument is how many integers remain to make up n integers in a sum, start is the position in values from which to search, and sum is the partial sum.
The early-bail-out mechanism is based on computing best, a two-dimensional array that specifies the "best" sum reachable from a given state. The value in best[n][p] is the largest sum of n values beginning in position p of the original values.
The recursion of countAt bottoms out when the correct population has been accumulated; this adds the current sum (of n values) to the tally. If countAt has not bottomed out, it sweeps the values from the start-ing position to increase the current partial sum, as long as:
enough positions remain in values to achieve the specified population, and
the best (largest) subtotal remaining is big enough to make the limit.
A sample run with your question's data:
int[] values = {3, 3, 2, 1};
Combos mine = new Combos(values);
Tally<Integer> tally = mine.count(2, 5);
for (int i = 5; i < 9; ++i) {
int n = tally.get(i);
if (0 < n) {
System.out.println("found " + tally.get(i) + " sums of " + i);
}
}
produces the results you specified:
found 2 sums of 5
found 1 sums of 6
Here's the Tally code:
public static class Tally<T> {
private Map<T,Integer> tally = new HashMap<T,Integer>();
public Tally() {/* nothing */}
public void inc(T key) {
Integer value = tally.get(key);
if (value == null) {
value = Integer.valueOf(0);
}
tally.put(key, (value + 1));
}
public int get(T key) {
Integer result = tally.get(key);
return result == null ? 0 : result;
}
public Collection<T> keys() {
return tally.keySet();
}
}
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
Check out this question in stackoverflow: Algorithm to return all combinations
I also just used a the java code below to generate all permutations, but it could easily be used to generate unique combination's given an index.
public static <E> E[] permutation(E[] s, int num) {//s is the input elements array and num is the number which represents the permutation
int factorial = 1;
for(int i = 2; i < s.length; i++)
factorial *= i;//calculates the factorial of (s.length - 1)
if (num/s.length >= factorial)// Optional. if the number is not in the range of [0, s.length! - 1]
return null;
for(int i = 0; i < s.length - 1; i++){//go over the array
int tempi = (num / factorial) % (s.length - i);//calculates the next cell from the cells left (the cells in the range [i, s.length - 1])
E temp = s[i + tempi];//Temporarily saves the value of the cell needed to add to the permutation this time
for(int j = i + tempi; j > i; j--)//shift all elements to "cover" the "missing" cell
s[j] = s[j-1];
s[i] = temp;//put the chosen cell in the correct spot
factorial /= (s.length - (i + 1));//updates the factorial
}
return s;
}
I am extremely sorry (after all those clarifications in the comments) to say that I could not find an efficient solution to this problem. I tried for the past hour with no results.
The reason (I think) is that this problem is very similar to problems like the traveling salesman problem. Until unless you try all the combinations, there is no way to know which attributes will add upto the threshold.
There seems to be no clever trick that can solve this class of problems.
Still there are many optimizations that you can do to the actual code.
Try sorting the data according to the attributes. You may be able to avoid processing some values from the list when you find that a higher value cannot satisfy the threshold (so all lower values can be eliminated).
If you're using C# there is a fairly good generics library here. Note though that the generation of some permutations is not in lexicographic order
Here's a recursive approach to count the number of these subsets: We define a function count(minIndex,numElements,minSum) that returns the number of subsets of size numElements whose sum is at least minSum, containing elements with indices minIndex or greater.
As in the problem statement, we sort our elements in descending order, e.g. [3,3,2,1], and call the first index zero, and the total number of elements N. We assume all elements are nonnegative. To find all 2-subsets whose sum is at least 5, we call count(0,2,5).
Sample Code (Java):
int count(int minIndex, int numElements, int minSum)
{
int total = 0;
if (numElements == 1)
{
// just count number of elements >= minSum
for (int i = minIndex; i <= N-1; i++)
if (a[i] >= minSum) total++; else break;
}
else
{
if (minSum <= 0)
{
// any subset will do (n-choose-k of them)
if (numElements <= (N-minIndex))
total = nchoosek(N-minIndex, numElements);
}
else
{
// add element a[i] to the set, and then consider the count
// for all elements to its right
for (int i = minIndex; i <= (N-numElements); i++)
total += count(i+1, numElements-1, minSum-a[i]);
}
}
return total;
}
Btw, I've run the above with an array of 40 elements, and size-8 subsets and consistently got back results in less than a second.