Kettle Datedif month issue - pentaho

I need to reproduce the kettle Datedif function in the R programming language. I need the 'datedif month' option. I thought reproducing would be pretty easy but I have some 'weird behaviour' in pentaho. As an example:
ID date_1 date_2 monthly_difference_kettle daydiff_mysql
15943 31/12/2013 28/07/2014 7 209
15943 31/12/2011 27/07/2012 6 209
So in pentaho kettle I used the formula-step and the function DATEDIF(date2,date1,"m"). As you can see when I calculate the daily difference in mysql I get for both records the same amount of days in difference (209), however, when the monthly difference is calculated via the formula step in pentaho kettle I get a different result in months (7 and 6 respectively). I don't understand how this is calculated...
Can anyone produce the source code for the 'DATEDIF months' function in pentaho? I would like to reproduce it in R so I get exactly the same results.
Thanks in advance,
best regards,

Not sure about mysql, but i think it is the same. In PostgreSQL date difference gives integer value (in days). It means both rows has total match in days.
Calculating month difference non trivial. What is month (28, 30, 31 day)? Shall we count if month is not full?
Documentation states If there is not a complete month between the dates, 0 will be returned
But according to source code easy to understand how calculated datedif:
Source code available via github https://github.com/pentaho/pentaho-reporting/blob/f7defbcfc0e8f48ad2b139fe9820445f052e0e78/libraries/libformula/src/main/java/org/pentaho/reporting/libraries/formula/function/datetime/DateDifFunction.java
private int addFieldLoop( final GregorianCalendar c, final GregorianCalendar target, final int field ) {
c.set( Calendar.MILLISECOND, 0 );
c.set( Calendar.SECOND, 0 );
c.set( Calendar.MINUTE, 0 );
c.set( Calendar.HOUR_OF_DAY, 0 );
target.set( Calendar.MILLISECOND, 0 );
target.set( Calendar.SECOND, 0 );
target.set( Calendar.MINUTE, 0 );
target.set( Calendar.HOUR_OF_DAY, 0 );
if ( c.getTimeInMillis() == target.getTimeInMillis() ) {
return 0;
}
int count = 0;
while ( true ) {
c.add( field, 1 );
if ( c.getTimeInMillis() > target.getTimeInMillis() ) {
return count;
}
count += 1;
}
}
Append 1 month to start date till it will become bigger then end date

Related

SSAS MDX Problem with IIF and TAIL() function

Experts,
I am using SQL SSAS Std. 2017 and in the end want to create a calculated member that returns either the last month's day of my data or the current day if the last existing data is of today.
(=> If today is Aug-31 I want to retrieve Aug-31 of my data, otherwise if today is e.g. Aug-30 retrieve Jul-31)
To develop this member I am currently in the process of creating a MDX query in SQL Server. I am having difficulties to understand what is actually a "tuple set expression" (because the TAIL() function shall return a subset (ergo set) according to MSDN) but in fact I am receiving errors when playing around with the .Item(0) function on its result. In MSDN I cannot find information about "tuple sets" and how to make them do, what I want.
My Date Dimension has a Hierarchy JMT (Year | Month | Day | Date Key of type DATE).
To receive the most current date member of the cross product I am using the TAIL(NONEMPTY(Date...Members, { (DimX.&.. , DimY.&.. , DimZ.&..) })) expression which works fine.
But how do I choose between today's or the previous month's date?
My MDX for testing purposes on February (2) is as follows:
SELECT {
IIF(
TAIL(NONEMPTY([DateDim].[JMT].[T].Members, { ([DimX].[X].&[200], [DimY].[Company].&[4499166], [DateDim].[JMT].[M].&[2020]&[2]) })).Item(0) --.Properties('Date Key', TYPED)
> NOW()
, TAIL(NONEMPTY([DateDim].[JMT].[T].Members, { ([DimX].[X].&[200], [DimY].[Company].&[4499166], [DateDim].[JMT].[M].&[2020]&[1]) }))
, TAIL(NONEMPTY([DateDim].[JMT].[T].Members, { ([DimX].[X].&[200], [DimY].[Company].&[4499166], [DateDim].[JMT].[M].&[2020]&[2]) }))
)
-- ,
-- TAIL(NONEMPTY([DateDim].[JMT].[T].Members, { ([DimX].[X].&[200], [DimY].[Company].&[4499166], [DateDim].[JMT].[M].&[2020]&[2]) })) } on columns
, { [Measures].[Turnover] } on rows
FROM [Finance]
Result:
As you can see, the IIF function does not do what I want. It assumes .Item(0) is greater than NOW() therefore returning the "31" member of January (1). Expected: "29" of February.
I guess, it might be a problem with the data types and the actual value returned by .Item(0). But if I want to use the .Properties('Date Key', TYPED) it states "The Date Key Dimension Attribute could not be found. See below picture.
In the image of the DateDim it should be "DateDim.JMT" in the blue area ;-).
Do you have any suggestions?
Thank you, Cordt
If you switch this:
TAIL(NONEMPTY([DateDim].[JMT].[T].Members, { ([DimX].[X].&[200], [DimY].[Company].&[4499166], [DateDim].[JMT].[M].&[2020]&[2]) })).Item(0)
to the following does it help?
Tail
(
NonEmpty
(
[DateDim].[JMT].[T].MEMBERS
,{
(
[DimX].[X].&[200]
,[DimY].[Company].&[4499166]
,[DateDim].[JMT].[M].&[2020]&[2]
)
}
)
).Item(0).Item(0).MemberValue

SQL using the results from the previous row

I'm trying to do an Excel kind of calculation in SQL. This involves using the closing rate (ClRate) from the previous row and using it to calculate values in the next row. The table starts from 1 Jan and has 1000 rows and each row has known data and data which needs to be calculated (shown in [ ])
Date RecQty RecAmt IssQty IssAmt ClQty ClAmt ClRate
1 Jan - - - - 100 $20,000 200
2 Jan +10 +$2100 -5 [ ] [ ] [ ] [ ]
The calculations to generate the desired result are in the table below
Date RecQty RecAmt IssQty IssAmt ClQty ClAmt ClRate
1 Jan 100 $20,000 200
2 Jan +10 +$2100 -5 -[200*5] [100+10-5] [20,000+2100-200*5] [ClAmt/ClQty]
The IssAmt for each day will be calculated by multiplying the IssQty by the previous days' ClRate. The ClQty is computed as previous day ClQty + current day RecQty - current day IssQty. The ClAmt is computed as previous day ClAmt+ current day RecAmt - current day IssAmt. Finally, the ClRate for each day is computed as ClAmt / ClQty
The only ClRate known is the opening inventory row of the table (1 Jan)- thereafter the ClRate for each subsequent row needs to be computed.
In Excel, you would simply do this calculation by linking the required cells of the previous row and copying/pasting the formula to all the rows below.
How would you do this in SQL? I have tried self joining CTEs, loops and LAG- none of these seems to work. The reason is that the ClRate for each row from 2 Jan onwards is not known - and while Excel can handle computing results on the fly which are used in the following row - SQL is unable to do this.
Seeking help to solve this problem. I'm using SQL Server 2017 and SSMS. If required I can provide the code
Table DDL
CREATE TABLE [Auto].[IronOreTbl](
[Id] [int] NULL,
[Counter] [int] NULL,
[TDate] [date] NOT NULL,
[RecQty] [decimal](16, 2) NULL,
[RecAmt] [decimal](16, 2) NULL,
[IssQty] [decimal](16, 2) NULL,
[IssAmt] [decimal](16, 2) NULL,
[ClQty] [decimal](16, 2) NULL,
[ClAmt] [decimal](16, 2) NULL,
[ClRate] [decimal](16, 2) NULL
) ON [PRIMARY]
GO
INSERT INTO [Auto].[IronOreTbl]
([Id],[Counter],[TDate],[RecQty],[RecAmt],[IssQty],[IssAmt],[ClQty],[ClAmt],[ClRate])
VALUES
(1,0,'2019-01-01',NULL,NULL,NULL,NULL,100,20000,200),
(2,1,'2019-01-02',10,2100,5,NULL,105,NULL,NULL),
(3,2,'2019-01-03',8,1600,2,NULL,111,NULL,NULL),
(4,3,'2019-01-04',20,2400,10,NULL,121,NULL,NULL)
CTE attempts
;WITH ClAmtCTE AS
(
SELECT
Id,RecQty,RecAmt,IssQty,ClQty,ClAmt,ClRate
,EffRate = ClRate
,CumHoldVal= ClAmt
--CAST(ClAmt AS decimal(16,2))
,CumClRt=CAST(ClRate AS decimal(16,2))
,[Counter]
FROM
[Auto].IronOreTbl
WHERE
Id=1
UNION ALL
SELECT
C2.Id,C2.RecQty,c2.RecAmt,C2.IssQty,C2.ClQty,C2.ClAmt,c2.ClRate,
EffRate = (SELECT CumClRt WHERE C2.ID=C2.[Counter]+1),
CumRN =
CAST(
(
CumHoldVal+ISNULL(C2.RecAmt,0)-
(EffRate)*ISNULL(C2.IssQty,0)
)
AS decimal(16,2)
),
CumClRt=CAST(CumHoldVal/C2.ClQty AS decimal(16,2)),
C2.[Counter],
FROM
[Auto].IronOreTbl C2
INNER JOIN ClAmtCTE C1 ON C1.Id = C2.[Counter]
The following code achieves the desired result - while I had got close several days ago, it still took all this time to tie up the last bit; of matching the ClRate with the correct row. There was an additional issue encountered where on days where there are no issues, but only receipts, the rate picked up belonged to the wrong row (I still have no clue why and how this was happening - all I know is that the previous code needed revision to resolve this discrepancy and the modified code attends to the same)
;WITH ClAmtCTE AS
(
SELECT
Id,RecQty,RecAmt,IssQty,ClQty,ClAmt,ClRate
,EffRate = ClRate
,CumHoldVal= ClAmt
--CAST(ClAmt AS decimal(16,2))
,CumClRt=CAST(ClRate AS decimal(16,2))
,[Counter]
FROM
[Auto].IronOreTbl
WHERE
Id=1
UNION ALL
SELECT
C2.Id,C2.RecQty,c2.RecAmt,C2.IssQty,C2.ClQty,C2.ClAmt,c2.ClRate,
EffRate = (SELECT CumClRt WHERE C2.ID=C2.[Counter]+1),
CumRN =
CAST(
(
CumHoldVal+ISNULL(C2.RecAmt,0)-
((SELECT CumClRt WHERE C2.ID=C2.[Counter]+1))*ISNULL(C2.IssQty,0)
)
AS decimal(16,2)
),
CumClRt=CAST((CumHoldVal+ISNULL(C2.RecAmt,0)-
((SELECT CumClRt WHERE C2.ID=C2.[Counter]+1))*ISNULL(C2.IssQty,0))/C2.ClQty AS decimal(16,2)),
C2.[Counter]
FROM
[Auto].IronOreTblC2
INNER JOIN ClAmtCTE C1 ON C1.Id = C2.[Counter]
)
UPDATE [Auto].IronOreTbl
SET ClRate = T1.CumClRt,
ClAmt=CumHoldVal
FROM ClAmtCTE T1
INNER JOIN [Auto].IronOreTbl T2
ON T1.ID = T2.Id
OPTION (MAXRECURSION 0)
Attempting and, in what appears to me as, solving it has taught me several things. Primarily these include:
Calculations which you can set up and run in a heartbeat in a spreadsheet are anything but in SQL
Excel has the ability to both calculate on the fly and utilise results calculated on the fly in the dependent calculations. SQL cannot do this. Having this ability in Excel means you do not have to iterate over the data - once to have Excel compute the results and then for Excel to apply the results where they are needed
For all its use and utility as a data housing medium SQL has a long way to go in attending to standard real world calculations - such as this example; multiple loans, changing interest rates and relevant interest calculations etc etc.

LEFT JOIN include data

I have an application which handles school vacation. Unfortunately there are three kinds of different school vacations: Country wide, Federal State wide and City wide vacations. I store all the information in a table days, a table vacation_periods and a connection table slots:
days {
id:integer
date_value:date
}
slots {
id:integer
day_id:integer
vacation_period_id:integer
}
vacation_periods {
id:integer
starts_on:date
ends_on:date
name:string
country_id:integer
federal_state_id:integer
city_id:integer
}
I want to select all days within a specific time frame. Let's say Jan 1st of 2017 to Jan 31st of 2017. I can get those days with:
SELECT * FROM days WHERE date_value >= '2017-01-01' AND
date_value <= '2017-01-31';
But for my vacation calendar I don't just need the days but also the information which vacation_periods are within. Assuming I search for all vacation_periods which are in that time frame and which have
country_id == 1 or federal_state_id == 5 or city_id == 30
I've read about JOINS and LEFT JOINS which seem to be the solution to the problem. But I can't get everything together.
Is it possible to send one SQL request which returns all days within the requested time frame and the additional information if a vacation_period that fits the country_id == 1 or federal_state_id == 5 or city_id == 30 rule is connected via slots to each day. Including the name of that vacation_period?
If one request is not possible: Which is the quickest way to solve this within the database? How many requests? What kind of requests?
If possible I'd like to get a result in some kind of this form:
- date_value: "2017-01-01"
- date_value: "2017-01-02"
- date_value: "2017-01-03"
* vacation_period.id: 15
* vacation_period.name: "foobar"
- date_value: "2017-01-04"
* vacation_period.id: 15
* vacation_period.name: "foobar"
- date_value: "2017-01-05"
* vacation_period.id: 15
* vacation_period.name: "foobar"
- date_value: "2017-01-06"
- date_value: "2017-01-07"
...
The following query might give you the answer you are looking for:
SELECT * FROM days WHERE date_value >= '2017-01-01' AND date_value <='2017-01-31'
INNER JOIN slots ON days.id = slots.day_id
INNER JOIN vacation_periods ON vacation_periods.id = slots.vacation_period_id
I think you can get an unformatted version of what you want (that could be processed into a hierarchical output) with
CREATE TYPE vacation_authority AS ENUM
('COUNTY', 'FED-STATE', 'CITY');
/* not necessary, but cleans up the vacation_period table */
change to let vacation_period have only one id, and a new field authority of type vacation_authority. You can now make a primary key out of either the id field or (id, authority), depending on how the vacation data comes into the system.
SELECT date_value, vp.name, vp.id /* is the ID meaningful or arbitrary? */
FROM dates LEFT JOIN vacation_periods vp
WHERE date_value BETWEEN vp.starts_on AND vp.ends_on; -- inclusive range
Now if there are multiple holidays spanning a given date, this will be multiple records in the output. It's not clear what you want in this case.
None of the other answers was able to solve my problem but they let me to the solution so I'm grateful for them. Here's the solution:
SELECT days.date_value, slots.period_id, vacation_periods.name FROM days
LEFT OUTER JOIN slots ON (days.id = slots.day_id)
LEFT OUTER JOIN vacation_periods ON (slots.period_id = vacation_periods.id)
WHERE days.date_value >= '2017-01-05'
AND days.date_value <='2017-01-15'
AND (vacation_periods.id IS NULL
OR vacation_periods.country_id = 1
OR vacation_periods.federal_state_id = 5)
ORDER BY days.date_value;

Mean time to Failure calculation in DAX

I am trying to calculate the mean time to failure for each asset in a job table. At the moment I calculate as follows;
Previous ID = CALCULATE(MAX('JobTrackDB Job'[JobId]),FILTER('JobTrackDB Job','JobTrackDB Job'[AssetDescriptionID]=EARLIER('JobTrackDB Job'[AssetDescriptionID]) && 'JobTrackDB Job'[JobId]<EARLIER('JobTrackDB Job'[JobId])))
Then I bring back the last finish time for the current job when the JobStatus is 7 (closed);
Finish Time = CALCULATE(MAX('JobTrackDB JobDetail'[FinishTime]),'JobTrackDB JobDetail'[JobId],'JobTrackDB JobDetail'[JobStatus]=7)
Then I bring back the previous jobs finish time where the JobType is 1 (Response rather than comparing it to maintenance calls);
Previous Finish = CALCULATE(MAX('JobTrackDB Job'[Finish Time]),FILTER('JobTrackDB Job','JobTrackDB Job'[AssetDescriptionID]=EARLIER('JobTrackDB Job'[AssetDescriptionID]) && 'JobTrackDB Job'[Finish Time]<EARLIER('JobTrackDB Job'[Finish Time]) && EARLIER('JobTrackDB Job'[JobTypeID])=1))
Then I calculate the Time between failure where I also disregard erroneous values;
Time between failure = IF([Previous Finish]=BLANK(),BLANK(),IF('JobTrackDB Job'[Date Logged]-[Previous Finish]<0,BLANK(),'JobTrackDB Job'[Date Logged]-[Previous Finish]))
Issue is that sometimes the calculation uses previous maintenance jobs even though I specified JobTypeID = 1 in the filter. Also, the current calculation does not take into account the time from the start of records to the first job for that asset and also from the last job till today. I am scratching my head trying to figure it out.
Any ideas???
Thanks,
Brent
Some base measures:
MaxJobID := MAX( Job[JobID] )
MaxLogDate := MAX ( Job[Date Logged] )
MaxFinishTime := MAX (JobDetail[Finish Time])
Intermediate calculations:
ClosedFinishTime := CALCULATE ( [MaxFinishTime], Job[Status] = 7 )
AssetPreviousJobID := CALCULATE (
[MaxJobID],
FILTER(
ALLEXCEPT(Job, Table1[AssetDescriptionID]),
Job[JobId] < MAX(Table1[JobID])
)
)
PreviousFinishTime: = CALCULATE ( [ClosedFinishTime],
FILTER(
ALLEXCEPT(Job, Job[AssetDescriptionID]),
Job[JobId] < MAX(Job[JobID])
&& Job[JobType] = 1
)
)
FailureTime := IF (
ISBLANK([PreviousFinishTime]),
0,
( [MaxLogDate]-[PreviousFinishTime] )
)
This should at least get you started. If you want to set some sort of "first day", you can replace the 0 in the FailureTime with a calc like MaxLogDate - [OverallFirstDate], which could be a calculated measure or a constant.
For something that hasn't failed, you'd want to use an entirely different measure since this one is based on lookback only. Something like [Days Since Last Failure] which would just be (basically) TODAY() - [ClosedFinishTime]

RankOver Partition by with minutes and seconds

I am trying to sequence data and as it occurs there are instances where I have to order this sequence using hour/minutes and seconds. However when I use the rank/partition by function, it's almost as if it does not recognize this as chronological data at all. An example of the data I am trying to sequence is below:
Mod_Order Last_Activity ACTIVITY_DATE_DTTM hdm_modif_dttm
1 NULL 15/08/2007 00:00:00 59:47.3
2 NULL 27/09/2007 14:30:02 59:22.9
3 NULL 27/11/2007 15:30:02 59:10.5
3 NULL 27/11/2007 15:30:02 58:38.9
As you can see the last two ACTIVITY_DATE_DTTM date times are exactly the same so I need to go a step further - I removed the date from the hdm_modif_dttm field to see if it made any difference but it does not (I left it as time though as I figured it does not make any difference anyhow). So my code was as follows:
Update q
set q.Mod_Order = b.Mod_Order
from [#Temp_last_act_2]q
Left join
(
select
RANK () over
(partition by pathway_id
order by pathway_id, ACTIVITY_DATE_DTTM,hdm_modif_dttm) as Mod_Order,
PATHWAY_ID,
MODIF_DTTM,
ACTIVITY_DATE_DTTM
from #temp_Last_act_2
) as b on b.PATHWAY_ID = q.PATHWAY_ID
and b.MODIF_DTTM = q.MODIF_DTTM
and b.ACTIVITY_DATE_DTTM = q.ACTIVITY_DATE_DTTM
Is anyone aware of any limitations using this function that I am unaware of or is there a function that may handle this better (or am I being really daft?)