I've been trying to loop using a Pentaho Data Integration (9.1)
So I have a table that has JULIAN DATE , which is through 2023001 until 2023010
But I've only wanted it to loop only until 2023006.
My KJB Files looks like this :
Start -> Set_Parameter.ktr -> Data_Select.ktr -> Success
In the Set_Parameter.ktr :
Table Input -> Copy Rows To Result
In the Data_Select.ktr :
Table Input -> Select Values -> Table Output
The query (postgresql) Table input in the Set_Parameter.ktr :
SELECT generate_series(2023001, 2023006) AS JULIANMISDATE
The query in the Data_Select.ktr :
SELECT * FROM DATA_LOOPING where JULIAN_MIS_DATE=${JULIANMISDATE}
The error says that my query in the Data_Select.ktr is wrong because it has "$" , seems like it didn't recognise that i was trying to call the JULIANMISDATE from the SET_Parameter.ktr
LINK FOR KTR AND KJB FILES
Related
I am trying to do multiple mysql queries to get monthly data from a table, using a for/while loop using a set of dates.
For ex, I have a list of dates as :
a = [
(2022-01-01, 2022-01-31),
(2022-02-01, 2022-02-28),
(2022-03-01, 2022-03-31),
...,
(2022-12-01, 2022-12-31)
]
I would like to go through the dates so that I don't have to query 12 times manually. But I get sql compilation error: invalid identifier A:
for i in range(12):
sql_query = select *
from table
where date between a[i][i] and a[i][i+1];
and pass this query to snowflake db
I have written this except query to get difference in record from both hive tables from databricks notebook.(I am trying to get result as we get in mssql ie only difference in resultset)
select PreqinContactID,PreqinContactName,PreqinPersonTitle,EMail,City
from preqin_7dec.PreqinContact where filename='InvestorContactPD.csv'
except
select CONTACT_ID,NAME,JOB_TITLE,EMAIL,CITY
from preqinct.InvestorContactPD where contact_id in (
select PreqinContactID from preqin_7dec.PreqinContact
where filename='InvestorContactPD.csv')
But the result set returned is also having matching records.The record which i have shown above is coming in result set but when i checked it separately based on contact_id it is same.so I am not sure why except is returning the matching record also.
Just wanted to know how we can use except or any difference finding command in databrick notebook by using sql.
I want to see nothing in result set if source and target data is same.
EXCEPT works perfectly well in Databricks as this simple test will show:
val df = Seq((3445256, "Avinash Singh", "Chief Manager", "asingh#gmail.com", "Mumbai"))
.toDF("contact_id", "name", "job_title", "email", "city")
// Save the dataframe to a temp view
df.createOrReplaceTempView("tmp")
df.show
The SQL test:
%sql
SELECT *
FROM tmp
EXCEPT
SELECT *
FROM tmp;
This query will yield no results. Is it possible you have some leading or trailing spaces for example? Spark is also case-sensitive so that could also be causing your issue. Try a case-insensitive test by applying the LOWER function to all columns, eg
I am new to Pig scripting but good with SQL. I wanted the pig equivalent for this SQL line :
SELECT * FROM Orders WHERE Date='2008-11-11'.
Basically I want to load data for one id or date how do I do that?
I did this and it worked, used FILTER in pig, and got the desired results.
`ivr_src = LOAD '/raw/prod/...;
info = foreach ivr_src generate timeEpochMillisUTC as time, cSId as id;
Filter_table= FILTER info BY id == '700000';
sorted_filter_table = Order Filter_table BY $1;
store sorted_filter_table into 'sorted_filter_table1' USING PigStorage('\t', '-
schema');`
I ended up with a table storing a network topology as follows:
create table topology.network_graph(
node_from_id varchar(50) not null,
node_to_id varchar(50) not null
PRIMARY KEY (node_from_id, node_to_id)
);
The expected output data in something like this, where all the sub-graphs starting from the node "A" are listed:
Now I try to find the paths between the nodes, starting at a specific node using this query:
WITH RECURSIVE network_nodes AS (
select
node_from_id,
node_to_id,
0 as hop_count,
,ARRAY[node_from_id::varchar(50), node_to_id::varchar(50)] AS "path"
from topology.network_graph
where node_from_id = '_9EB23E6C4C824441BB5F75616DEB8DA7' --Set this node as the starting element
union
select
nn.node_from_id,
ng.node_to_id,
nn.hop_count + 1 as hop_count,
, (nn."path" || ARRAY[ng.node_to_id])::varchar(50)[] AS "path"
from topology.network_graph as ng
inner join network_nodes as nn
on ng.node_from_id = nn.node_to_id
and ng.node_to_id != ALL(nn."path")
)
select node_from_id, node_to_id, hop_count, "path"
from network_nodes
order by node_from_id, node_to_id, hop_count ;
The query runs several minutes before throwing the error:
could not write to tuplestore temporary file: No space left on device
recursive query postgresql
The topology.network_graph has 2148 records and during the query execution the base/pgsql_tmp directory grows bis some GBs. It seems I have an infinite loop.
Can someone find what could be wrong?
I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();