Pentaho data integration issue with loading a kettle based on some condition - pentaho

I have a Pentaho Data Integration job which has the following steps:
Generate row step which has an initial date (for e.g. 2010-01-01) and the limit as 10*366 = 3660 rows for 10 years.
Next step has an incrementer to increment the number of days.
Next step uses this information viz. initial date, limit, and the incrementer, to generate dates for each day for 10 years starting 2010-01-01 using javascript functions.
Final step loads a table with the generated dates.
All this works fine.
Now, I have a requirement where I do not want this table to be static with dates for 10 years only. If the max date in the date table is 2 years from today, I want to load dates for 10 more years in the table.
For the above example, with the 1st load loading dates for 10 years from 2010, I should be able to load 10 more years in 2018, the next 10 years in 2028 and so on and so forth.
What will be the best way to achieve this?
How can I:
1) Read the max date from my date table? - I know how to do this.
2) Use the read date to compare against today. And if the max date is within 2 years from today, I populate the table with next 10 years.
I don't know how to do 2 above in Pentaho data integration. Will really appreciate any pointers on a way to resolve this issue.

You need to read the current date (today) in a variable. For example with a Get system info step.
Then you can compare the two fields, max date and today, with a Filter Rows step.
As the previous step may give you more than one row, you need to either use a Unique Row (no field to provide) either a Group by (no group by field).
If any row gets by, then you launch you generate 10 years process. As you cannot have a hop from a step into this second Generate row, you must use a Transformation executor to launch your currently existing transformation.
Now, if your requirement gets a tiny little bit more complex than that, I strongly suggest you to use jobs to orchestrate your transformations.

Related

Creating a 7 day live dashboard on Tableau

As the title says I'm trying to create a live dashboard in Tableau that updates every day showing the data for the last 7 days. I'm querying through SQL and then importing it in Tableau. Do I have to specify this requirement in my query or would there be some way to do it in the tableau itself. Thank you so much. I would really appreciate the help.
Disclaimer: I'm pretty novice in tableau and SQL.
If you have a date field in your table then you can use it as a filter and select relative date as the option for the filter and in the dialog that appears you can enter number of days for the days field. Since you want the live data for the last 7 days, you can enter 7 and you'll get the updated data each time.
If you are querying through SQL, put the filter for date/ timestamp in where condition itself like so:
DATE(date_column_filter) >= (DATE(NOW()) - INTERVAL 7 DAY)

Bigquery - Table decorators changed weirdly

I used to have a number of queries running on the past 40 days of data using a decorator with [dataset.table#-4123456789-].
However, since September 15 all the decorators return maximum 10 days of data.
By the way [dataset.table#0] returns the whole table and not the past 7 days as told in the documentation.
Does anyone know what is going on. Do I have to move my table to partition in order to receive data for a limited period of time but more the a week?
Thanks

Analysis to be carried out in SQL

I have a database which captures everyday every column data as binary 0 or 1 in a SQL database. Now I want to calculate what is the avg no. of days a user takes to complete that column. How can I run the query for that?
Please note timeline for every user is different and I want data to be in 3 days slab (for example 1-3,4-6 so on and so forth).

Schedule algorithm for nightly SQL extract of data

I am looking for an algorithm to extract data from one system in to another but on a sliding scale. Here are the details:
Every two weeks, 80 weeks of data needs to be extracted.
Extracts take a long time and are resource intensive so we would like to distribute the load of the extract over time.
The first 8-12 weeks are the most important and need be updated more often over the two week window. Data further out can be updated less frequently to the point where the last 40 weeks+ could even just be extracted once every two weeks.
Every two weeks, the start date shifts two weeks ahead and so two new weeks are extracted.
Extract procedure takes a start and end date (this is already made and should be treated like a black box). The procedure could be run for multiple date spans in a day if required but contiguous dates are faster than multiple blocks of dates.
Extracts blocks should be no smaller than 2 weeks and probably no greater than 16 weeks. Longer blocks are possible but at 16 weeks are already a significant load to the system.
4 contiguous weeks of data takes about 1 hour approximately. It takes a long time because the data needs to be generated/calculated.
Data that is newly extracted replaces the old data for the timespan. No need to merge or diff the data, it is just replaced.
This algorithm needs to be built into a SQL job which will handle the daily process (triggered once a day only).
My initial thought was to create a sliding schedule pretty much. Rotate the first 4 week block every second day and then the second 4 week block every 3 to 4 days. The rest of the data would be extracted in blocks in smaller chunks over the two week period.
What I am going to do will work but I wanted to spend some time seeing if there might be a better way to approach the problem. Mainly looking for an algorithm to do the start/end date schedule for the daily extract.

BigQuery: Why does Table Range Decorators return wrong result sometimes?

I've been using the Table Range Decorators feature daily since May in order to only query the data from the last 7 days in some of my tables.
Since 2 weeks, I've noticed that sometimes some data is missing when I use that feature. For example, if do a query to get the results for the last 7 days (by adding "#-604800000--1" to table), some data will be missing as opposed to if I query on the whole table (without a table decorator).
I wonder what could explain this and if there is a fix coming soon to address this?
If this can help the BigQuery team, I've noticed that when using Table Decorators some data was missing for us for October 16th between around 16:00 and 20:00 UTC time.
For the BigQuery team here are 2 jobs ids where some data is missing: job_-xtL4PlIYhNjQ5weMnssvqDmd6U , job_9ASNxqq_swjCd1eMmiQ6SmPpxlQ
and 1 job id where data is correct(without decorators): job_QbcRwYGbQv0BZdHreQEvRlYh-mM
This is a known issue with table decorators containing a time range. Due to a bug in BigQuery, it is possible for certain time ranges to omit data that should be included within the time range.
We're working on a fix and plan to have it released next week. After this fix is deployed time range decorators should again work as expected.