Count number of missing value with Pentaho - pentaho

I'm new to Pentaho and I'm trying to do a really simple task (I suppose), but I didn't succeed it. I have a CSV file which contains multiple column and row. I want to count the number of missing values in each row for each row. I tried to do so:
I try to used a "group by" box but I don't really know if it's appropriate.
Could you give me some hint or the appropriate box for my problem
The first picture is a sample of some line from the file (which contain 69 column and 2 500 000 rows) and the second picture is the expect result (the number of null value per row)

There is probably some other way to do this, but it is possible to do this with a Modified Java Script step. Something like this will count the number of null's.
var fields = getInputRowMeta().getFieldNames();
var nulls = 0;
for (var i = 0; i < fields.length; i++) {
if (row[i] == null) {
nulls += 1;
}
}
And then output the nulls value to the row from the step.

Related

Results DataSet from DynamoDB Query using GSI is not returning correct results

I have a dynamo DB table where I am currently storing all the events that are happening in my system with respect to every product. There is a primary key with a Hash combination of productid,eventtype and eventcategory and Sort Key as Creation Time on the main table. The table was created and data was added into it.
Later I added a new GSI on the table with the attributes being Secondary Hash (which is just the combination of eventcategory and eventtype (excluding productid) and CreationTime as Sort Key. This was added so that I can query for multiple products at once.
The GSI seems to work fine, However only later I realized the data being returned is incorrect
Here is the scenario. (I am running all these queries against the newly created index)
I was querying for products with in the last 30 days and the Query returns 312 records, However, when I run the same query for last 90 days, it returns me only 128 records (which is wrong, should be atleast equal or greater than last 30 days records)
I have the pagination logic already embedded in my code, so that the lastEvaluatedKey is verified every-time, to loop and fetch the next set of records and after the loop, all the results are combined.
Not sure if I am missing something.
ANy suggestions would be appreciated.
var limitPtr *int64
if limit > 0 {
limit64 := int64(limit)
limitPtr = &limit64
}
input := dynamodb.QueryInput{
ExpressionAttributeNames: map[string]*string{
"#sch": aws.String("SecondaryHash"),
"#pkr": aws.String("CreationTime"),
},
ExpressionAttributeValues: map[string]*dynamodb.AttributeValue{
":sch": {
S: aws.String(eventHash),
},
":pkr1": {
N: aws.String(strconv.FormatInt(startTime, 10)),
},
":pkr2": {
N: aws.String(strconv.FormatInt(endTime, 10)),
},
},
KeyConditionExpression: aws.String("#sch = :sch AND #pkr BETWEEN :pkr1 AND :pkr2"),
ScanIndexForward: &scanForward,
Limit: limitPtr,
TableName: aws.String(ddbTableName),
IndexName: aws.String(ddbIndexName),
}
You reached the maximum number of items to evaluate (not necessarily the number of matching items). The limit is 1 MB.
The response will contain a LastEvaluatedKey parameter, it is the last item's id. You have to perform a new query with an extra ExclusiveStartKey parameter. (ExclusiveStartKey should be equal with LastEvaluatedKey's value.)
When the LastEvaluatedKey is empty you reached the end of the table.

Maintaining auto ranking as a column in MongoDB

I am using MongoDB as my database.
I have data which contains rank and name as columns. Now a new row can be updated with a rank different from ranks already existing or can be same.
If same then the ranks of other rows must be adjusted .
Rows having lesser rank than the to be inserted one must be incremented by one and the rows which are having ranks can remain as it it.
Feature is something like number bulleted list in MS Word type of applications. Where inserting a row in between adjust the numbering of other rows below it.
Rank 1 is the highest rank.
For e.g. there are 3 rows
Name Rank
A 1
B 2
C 3
Now i want to update a row with D as name and 2 as rank. So now after the row insert, the DB should like below
Name Rank
A 1
B 3
C 4
D 2
Probably using Database triggers i can achieve this by updating the other rows.
I have couple of questions
(a) Is there any other better way than using database trigger for achieving this kind of scenario ? Updating all the rows might be a time consuming job.
(b) Does MongoDB support database trigger natively ?
Best Regards,
Saurav
No, MongoDB, does not provide triggers (yet). Also I don't think trigger is really a great way to achieve this.
So I would just like to throw some ideas, see if it makes sense.
Approach 1
Maybe instead of disturbing those many documents, you can create a collection with only one document (Let's call the collection ranking). In that document, have an array field call ranks. Since it's an array it's already maintaining a sequence.
{
_id : "RANK",
"ranks" : ["A","B","C"]
}
Now if you want to add D to this rank at 2nd position
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["D"],$position:1}}});
it would add D to index 1 which is 2nd position considering index starts at 0.
{
_id : "RANK",
"ranks" : ["A","D","B","C"]
}
But there is a catch, what if you want to change C position to 1st from 4th, you need to remove it from end and put it in the beginning, I am sure both operation can't be achieved in single update (didn't dig in the options much), so we can run two queries
db.ranking.update({_id:"RANK"},{$pull : {"ranks": "C"}});
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["C"],$position:0}}});
Then it would be like
{
_id : "RANK",
"ranks" : ["C","A","D","B"]
}
maintaining the rest of sequence.
Now you would probably want to store id instead of A,B,C etc. one document can be 16MB so basically this ranks array can store more than 1.3 million entries of id, if id is MongoDB ObjectId of 12 bytes each. if that is not enough, we still have option to have followup document(s) with further ranking.
Approach 2
you can also, instead of having rank as number, just have two field like followedBy and precededBy.
so your user document would look
{
_id:"A"
"followedBy":"B",
}
{
_id:"B"
"followedBy":"C",
"precededBy":"A"
}
{
_id:"c"
"precededBy":"B",
}
if you want to add D at second position, then you need to change the current 2nd position and you need to insert the new One, so it would be change in only two document
{
_id:"A"
"followedBy":"B",
}
{
_id:"B"
"followedBy":"C",
"precededBy":"D" //changed from A to D
}
{
_id:"c"
"precededBy":"B",
}
{
_id:"D"
"followedBy":"B",
"precededBy":"A"
}
The downside of this approach is that you cannot sort in query based on the ranking until and unless you get all these in application and create a linkedlist sort of structure.
This approach just preserve the ranking with minimum DB changes.

WPF Comparing two datatables to find matching values

enter code hereI have two data tables one of the data tables connects to sql server and the other connects to oracle.
I am running query statements on both and those work perfectly find.
Now I need to write something that will compare the "UNIT_NO" in oracle to the "VehicleName" in sql. yes they are the same number.
Right now the Oracle table bring in 6 columns and the SQL brings in 4 columns
an example would be:
VehicleName, VehicleGroupName, UserDefinedColumn2, UserDefinedColumn3
Unit_No, Unit_ID, Using_Dept, Status, Using_Dept_Desc,
I want my code to find the matching number from Unit_NO and VehicleName and display all the above information all in one row. I was thinking linq but I cant get it to display correctly
This code combines the columns from both tables but pulls but does not add the any data in the rows any suggest or fixes
private void GetSQLOraclelinqData()
{
var TstarData = GetTrackstarTruckData();
var M5Data = GetM5Data();
DataTable ComTable = new DataTable();
foreach (DataColumn OraColumn in M5Data.Columns)
{
ComTable.Columns.Add(OraColumn.ColumnName, OraColumn.DataType);
}
foreach (DataColumn SQLColumn in TstarData.Columns)
{
if (SQLColumn.ColumnName == "VehicleName")
ComTable.Columns.Add(SQLColumn.ColumnName + 2, SQLColumn.DataType);
else
ComTable.Columns.Add(SQLColumn.ColumnName, SQLColumn.DataType);
}
var results = TstarData.AsEnumerable().Join(M5Data.AsEnumerable(),
a => a.Field<String>("VehicleName"),
b => b.Field<String>("Unit_NO"),
(a, b) =>
{
DataRow row = ComTable.NewRow();
row.ItemArray = a.ItemArray.Concat(b.ItemArray).ToArray();
ComTable.Rows.Add(row);
return row;
});
SQLDataTable.ItemsSource = ComTable.DefaultView;
}
I would do it using two nested for loops.
The outer for loop would iterate through each row in the SQL DataTable.
The inner loop would iterate through each row in the Oracle DataTable, and if there is a match, it would store the match somewhere (perhaps in a list).
Optional Hints
Assuming that each number plate only occurs once, we could optimise this code by breaking out of the inner loop as soon as we get a match.
We cannot rely on the rows coming back in the same order. Thus, we cannot naively compare row 1 from SQL against row 1 from Oracle.
my code is based off grabbing columns that are not populated yet in the datagrid I call two datatable on the code behind to populate the table and also us excel sheets. If anyone needs this information i can help.
I connect to SQL, Oracle and load excel sheets to make a comparison on data

Pig: summing column b of rows with the same column a

I'm trying to count the number of tweets with a certain hashtag over a period of time but I'm getting an error when trying to use the built-in SUM function.
Example:
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by count;
X = FOREACH NBLNabilVoto GENERATE group, SUM(data.count);
Error:
<line 22, column 47> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
First Load the data then filter for the time interval you want to process. Group the record based on the hashtag. Use count() function to count the number of twitter for the corresponding hashtag.
I am not sure that the code is doing what you think or want it to do but the error you are getting is because you are doing a SUM on the wrong thing. You need to do this
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count);
NBLNabilVoto_count is the name of the tuples in the databag
i think you are using the wrong realtion in your SUM, you could SUM NBLNabilVoto_count not data realtion. i have question why you are groupping by COUNT ?
if you want count all your tweet with hashtag NBLNabilVoto.
i think the code must be like :
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by all;
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count.count);

ScrollableResults size gives repeated value

I am working on application using hibernate and spring. I am trying to get count of result got by query by using ScrollableResults but as query contains lots of join(Inner joins), the result contains id repeated many times. this creates problem for ScrollableResults when i am using it to know total no of unique rows(or unique ids) returned from database. please help. Some part of code is below :
StringBuffer queryBuf = new StringBuffer("Some SQL query with lots of Joins");
Query query = getSession().createSQLQuery(queryBuf.toString());
query.setReadOnly(true);
ScrollableResults results = query.scroll();
if (results.isLast() == false)
results.last();
int total = results.getRowNumber() + 1;
logger.debug(">>>>>>TOTAL COUNT<<<<<< = {}", total);
It gives total count 1440 but actual unique rows in database is 504.
Thanks in Advance.
You can try
Integer count= ((Long)query.uniqueResult()).intValue();
Unfortunately, getRowNumber does not give you the size, or the number of results, but the current position in the results. ScrollableResults does not provide a way to get the number of results out-of-the-box.
I am referring to ScrollableResults Hibernate Version 5.4.
As a workaround, you can try
Long l_resultsCount = 0L;
while(results.next()) {
l_resultsCount++;
}
getRowNumber() gives the number of the current row.
Call last() and afterwards getRowNumber()+1 will give the total number of results.