Pentaho Data Integration (PDI) lookup for latest record - pentaho

Say my lookup table looks something like
Table_1
Key Id incremental _count date
1 1 1 2015-05-20
2 1 2 2015-05-20
3 1 4 2015-05-22
4 2 1 2015-05-22
5 1 6 2015-05-22
For each Id how do I limit PDI lookup to return only the most recent record?
OUTPUT
Key Id incremental _count date
4 2 1 2015-05-22
5 1 6 2015-05-22

it should work with this transformation setup:
1) Sort Rows Step:
sort by (ascending): Id -> Inc -> Date
2) 1. Group by Step:
Group field:
date
id
Aggregates:
Subject: key - Type: last value
Subject: inc -Type:last value
3) 2. Group by Step:
Group field:
1: id
Aggregates:
Subject: date - Type: last value
Subject: key - Type: last value
Subject: inc - Type: last value
Hope this will help.
With best regards,
S.R.

Did similar thing - Group By over Group By with aggregate as Last Value
Thanks S.R

Related

Find entries with array of dates if any is currently available within a certain range in postgres

I have a postgres table with columns:
id: text
availabilities: integer[]
A certain ID can has multiply availabilities (different days (not continuous) in a range for up to a few years). Each availability is a Unix timestamp (in seconds) for a certain day.
Hours, minutes, seconds, ms are set to 0, i.e. a timestamp represents the start of a day.
Question:
How can I find all IDs very fast, which contain at least one availability inbetween a certain from-to range (also timestamp)?
I can also store them differently in an array, e.g "days since epoch", if needed (to get 1 (day) steps instead of 86400 (second) steps).
However, if possible (and speed is roughly same), I want to use an array and on row per each entry.
Example:
Data (0 = day-1, 86400 = day-2, ...)
| id | availabilities |
| 1 | [0 , 86400, 172800, 259200 ]
| 2 | [ 86400, 259200 ]
| 3 | [ , 345600 ]
| 4 | [ , 172800, ]
| 5 | [0, ]
Now I want to get a list of IDs which contains at least 1 availability which:
is between 86400 AND 259200 --> ID 1, 2, 4
is between 172800 AND 172800 --> ID 1, 4
is between 259200 AND (max-int) --> ID 1,2,3
In PostgreSQL unnest function is the best function for converting array elements to rows and gets the best performance. You can use this function. Sample Query:
with mytable as (
select 1 as id, '{12,2500,6000,200}'::int[] as pint
union all
select 2 as id, '{0,200,3500,150}'::int[]
union all
select 4 as id, '{20,10,8500,1100,9000,25000}'::int[]
)
select id, unnest(pint) as pt from mytable;
-- Return
1 12
1 2500
1 6000
1 200
2 0
2 200
2 3500
2 150
4 20
4 10
4 8500
4 1100
4 9000
4 25000

How to remove duplicate entires using the latest time in Pandas

Here is the snippet:
test = pd.DataFrame({'uid':[1,1,2,2,3,3],
'start_time':[datetime(2017,7,20),datetime(2017,6,20),datetime(2017,5,20),datetime(2017,4,20),datetime(2017,3,20),datetime(2017,2,20)],
'amount': [10,11,12,13,14,15]})
Output:
amount start_time uid
0 10 2017-07-20 1
1 11 2017-06-20 1
2 12 2017-05-20 2
3 13 2017-04-20 2
4 14 2017-03-20 3
5 15 2017-02-20 3
Desired Output:
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
I want to group by uid and mind the row with the latest start_time. Basically, I want to remove duplicate uid by only selecting the uid with the latest start_time.
I tried test.groupby(['uid'])['start_time'].max() but it doesn't work as it only returns back the uid and start_time column. I need the amount column as well.
Update: Thanks to #jezrael & #EdChum, you guys always help me out on this forum, thank you so much!
I tested both solutions in terms of execution time on a dataset of 1136 rows and 30 columns:
Method A: test.sort_values('start_time', ascending=False).drop_duplicates('uid')
Total execution time: 3.21 ms
Method B: test.loc[test.groupby('uid')['start_time'].idxmax()]
Total execution time: 65.1 ms
I guess groupby requires more time to compute.
Use idxmax to return the index of the latest time and use this to index the original df:
In[35]:
test.loc[test.groupby('uid')['start_time'].idxmax()]
Out[35]:
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
Use sort_values by column start_time with drop_duplicates by uid:
df = test.sort_values('start_time', ascending=False).drop_duplicates('uid')
print (df)
amount start_time uid
0 10 2017-07-20 1
2 12 2017-05-20 2
4 14 2017-03-20 3
If need output with ordered uid:
print (test.sort_values('start_time', ascending=False)
.drop_duplicates('uid')
.sort_values('uid'))

DB Query matching ids and sum data on columns

Here is the info i have on my tables, what i need is to create a report based on certain dates and make a sum of every stock movement of the same id
Table one Table Two
Items Stocks
---------- ---------------------------
ID - NAME items_id - altas - bajas - created_at
1 White 4 5 0 8/10/2016
2 Black 2 1 5 8/10/2016
3 Red 3 3 2 8/11/2016
4 Blue 4 1 4 8/11/2016
2 10 2 8/12/2016
So based on a customer choice of dates (on this case lets say it selects all the data available on the table), i need to group them by items_id and then SUM all altas, and all bajas for that items_id, having the following at the end:
items_id altas bajas
1 0 0
2 11 7
3 3 2
4 6 4
Any help solving this?
Hope this will help:
Stock.select("sum(altas) as altas, sum(bajas) as bajas").group("item_id")

SQL Server table index columns order

Is there any difference when I create a table index for more columns if I use the columns in different order?
Exactly is difference between ID, isValid, Created and ID, Created, isValid indices?
And is there any difference in querying order?
where ID = 123
and isValid = 1
and Created < getdate()
vs.
where ID = 123
and Created < getdate()
and isValid = 1
Column types: ID [int], isValid [bit], Created [datetime])
Exactly is difference between ID, isValid, Created and ID, Created, isValid indices?
If you always use all three columns in your WHERE clause - there's no difference.
(as Martin Smith points out in his comment - since of the criteria is not an equality check, the sequence of the columns in the index does matter)
However: an index can only ever used if the n left-most columns (here: n between 1 and 3) are used.
So if you have a query that might only use ID and isValid for querying, then the first index can be used - but the second one will never be used for sure.
And if you have queries that use ID and Created as their WHERE parameters, then your second index might be used, but the first one can never be used.
AND is commutative, so the order of ANDed expressions in WHERE doesn't matter.
Order of columns in an index does matter, it should match your queries.
If ID is your table's clustered primary key and your queries ask for specific ID, don't bother creating an index. That would be like giving an index to a book saying "page 123 is on page 123" etc.
The order in the query makes no difference. The order in the index makes a difference. I'm not sure how good this will look in text but here goes:
where ID = 123 and isValid = 1 and Created < Date 'Jan 3'
Here are a couple of possible indexes:
ID IsValid Created
=== ======= =========
122 0 Jan 4
122 0 Jan 3
... ... ...
123 0 Jan 4
123 0 Jan 3
123 0 Jan 2
123 0 Jan 1
123 1 Jan 4
123 1 Jan 3
123 1 Jan 2 <-- Your data is here...
123 1 Jan 1 <-- ... and here
... ... ...
ID Created IsValid
=== ======= ========
122 Jan 4 0
122 Jan 4 1
... ... ...
123 Jan 4 0
123 Jan 4 1
123 Jan 3 0
123 Jan 3 1
123 Jan 2 0
123 Jan 2 1 <-- Your data is here...
123 Jan 1 0
123 Jan 1 1 <-- ... and here
... ... ...
As you can probably tell, creating an index(IsValid, Created, ID) or any other order, will separate your data even more. In general, you want to design the indexes to make your data as "clumpy" as possible for the queries executed most often.

Row aggregation of count-distinct measure

I have a fairly simple project set up to demonstrate what I want here. Here's the data:
Group
ID Name
1 Group 1
2 Group 2
3 Group 3
Person
ID GroupID Age Name
1 1 18 John
2 1 21 Stephen
3 1 18 Kate
4 2 18 Mary
5 2 19 Joseph
6 2 19 Michael
7 3 21 David
8 3 22 Kevin
9 3 21 Julian
I have 1 measure in my cube called Person Count which is a Distinct count on Person ID
I have set up each non-ID column in the dimensions as attributes (Age, Person Name, Group).
When I process and browse the cube in Business Intelligence Development Studio, I get the following result set:
But what I actually want here are the rows for Age to aggregate up the count of the Person Count together, so here it should show 2 and only one row for 18.
Is this possible (and how)?
Turns out this was a problem with the way I set up the Age attribute for the dimension.
I had:
KeyColumns = Person.ID
ValueColumn = Person.Age.
I don't know why I did this, but the solution is to delete the content of ValueColumn and set the KeyColumns to Person.Age again.
I now get the following result:
Everything else is the same for the project; this was the only change and is exactly what I wanted. If I get any issues with it I will keep this post updated for anyone else who may run into this in the future.