I'm trying to merge 2 datasets, say A and B. The dataset A has a variable "Flag" which takes 2 values. Rather than jut merging both data together I was trying to merge 2 datasets based on "flag" variable.
The merging code is the following:
create table new_data as
select a.*,b.y
from A as a left join B as b
on a.x=b.x
Since I'm running Hive code through CLI, I'm calling this through the following command
hive -f new_data.hql
The looping part of the code I'm calling to merge data based on "Flag" variable is the following:
for flag in 1 2;
do
hive -hivevar flag=$flag -f new_data.hql
done
I put the above code in another ".hql" file asn calling it:
hive -f loop_data.hql
But it's throwing error.
cannot recognize input near 'for' 'flag' 'in'
Can anybody please tell me where I'm making mistake.
Thanks!
You should be adding the loop logic to shell script.
File Name: loop_data.sh
for flag in 1 2;
do
hive -hivevar flag=$flag -f new_data.hql
done
And execute the script like:
sh loop_data.sh
In your new_data.hql script, you are creating table. Since you should split out the DDL & DML in 2 separate scripts. Like
DDL: create_new_data.hql
create table new_data as
select
a.*,
b.y
from
A as a left join
B as b on
a.x = b.x
where
1 = 0;
DML: insert_new_data.hql
insert into new_data
select
a.*,
b.y
from
A as a left join
B as b on
a.x = b.x
where
flag = ${hiveconf:flag}
And update you shell script like:
File Name: loop_new_data.sh
# Create table
hive -f create_new_data.hql
# Insert data
for flag in 1 2;
do
hive -hiveconf flag=$flag -f insert_new_data.hql
done
And execute it like:
sh loop_new_data.sh
Let me know if you want more info.
Related
With Dataiku, I am trying to compute multiple joins across the same table in Bigquery. For example, my query would be (in a simple language) :
For i = 1 to 24 :
CREATE TABLE table0 as
SELECT
A.*,
B.column as column_i
FROM
table0 AS A
LEFT JOIN table_i AS B
ON A.id=B.id
How can I do this in a simple way ? I tried with a SQL script or notebook but it seems that Dataiku doesnt support the statement DECLARE for my variable i.
is it possible to do this? (I'm a SAS programmer usually so am used to building code in SAS macros like this). I have a table (lets call it TableCode) that holds lines of code (dynamically built by previous queries from metadata etc), e.g.
code pos
---- ---
a.id as id_a, 1
b.id as id_b, 2
a.var1 as var1_a, 3
b.var1 as var_b 4
from tablea a, 991
join tableb b 992
on a.id=b.id; 993
it would be bigger than that but you get the idea.
So, I'd like to be able to do something like:
execute 'select '||code||' from TableCode order by pos';
meaning that the code stored in TableCode would run. Is such a thing possible with Redshift?
Run your query to generate the code as the result of the query. Copy the text generated and paste it back into your SQL Workbench editor window. Submit it.
I am new to SSC. My scenario is that I have created tables A, B, and C which are related to one another.
Whenever I need data from these three tables I always need to join them to get results. It's a bit time consuming to do this all the time.
Because of this I created a table 'R' and a procedure to update its contents. In this procedure I am joining all the tables (A, B, and C) and storing the result in table R.
To get the results into this table I create a SqlJob which runs once daily. However, there is a problem. Sometimes I want the results from A, B, and C tables where records were inserted recently (before R has been updated).
Is there any solution to get the result from the R table every time without running the SqlJob to update it constantly?
Additional Information
My desired solution is that any time I need data, table R is queried, not the joined tables A, B, and C. Your solution must take this into account.
Thank you.
Instead of running a procedure to constantly update table 'R', create a database view. This view would join A, B, and C together.
Then, any time you need to query A, B, and C, instead of risking getting stale data by querying table R, you would query the view.
I don't know your database schema, so I don't know what fields to join tables A, B, and C on, but it might look something like this:
CREATE VIEW V1
AS
SELECT * FROM A INNER JOIN B ON A.X = B.X INNER JOIN C ON B.Y = C.Y;
To query the view, you would use a SELECT statement just as you would with a table:
SELECT * FROM V1;
add a timex (timestamp) column in your R Table.
so in any time you can get your latest result set.
Based on feedback from the OP that the table 'R' must always be the table queried (is this homework?), then I suppose the only solution would be to place an update trigger on each of the tables 'A', 'B', and 'C' so that when any of these tables are updated their updated contents are automatically placed in table 'R'.
Though inefficient, at least this is better than running a stored procedure on some time basis, for example every 5 minutes.
CREATE PROCEDURE [usp_SyncR]
AS
BEGIN
SET NOCOUNT ON;
UPDATE [R]
SELECT *
GETUTCDATE() as [UpdatedOn]
FROM A INNER JOIN B ON A.X = B.X INNER JOIN C ON B.Y = C.Y
END
CREATE TRIGGER [trg_A_Sync_R]
ON [A]
AFTER Update
AS
BEGIN
EXEC [usp_SyncR];
END
CREATE TRIGGER [trg_B_Sync_R]
ON [B]
AFTER Update
AS
BEGIN
EXEC [usp_SyncR];
END
CREATE TRIGGER [trg_C_Sync_R]
ON [C]
AFTER Update
AS
BEGIN
EXEC [usp_SyncR];
END
I want to create a new variable based on certain conditions in a table and merge it to another table using the newly created variable in a single data or a proc sql step.
eg)
table 1
var new_var
x 3x
y 4y
z 5z
table 2
new_var additional_var
3x a
3x a
4y z
and merging both the tables using the new_var in a single step
Thanks
You can accomplish this using a join with an inline view. Inline views can save you time coding and reduce I/O operations, which is the biggest cause of slowdowns. This is particularly true on non-SSD hard drives.
proc sql noprint;
create table want as
select var, t1.new_var, additional_var
from table1 as t1
LEFT JOIN
(select new_var,
CASE
when(<conditions>) then 'a'
else 'z'
END as additional_var
from table2) as t2
ON t1.new_var = t2.new_var
quit;
I am trying to merge two large (million+) datasets in SAS. I'm pretty new to SAS and this is my first stackexchange question so hopefully the following makes sense...
SETUP:
All observations in the "Master" dataset have a unique identifier var1 and some have unique identifier var2. Some observations in the "Addition" dataset have unique identifier var1 and some have unique identifier var2; some observations have var2 but not var2.
I want to merge in all matches from the Addition dataset on EITHER var1 or var2 into the Master dataset.
METHODS I HAVE EXPLORED:
Option A: proc sql left join on var1 OR var2. Unfortunately, because there are multiple missing observations on var2 in both Master and Addition this runs into a Cartesian product problem - it works, but is impractically slow with my large datasets.
proc sql;
create table match as
select a.id1, a.id2, varmast, b.varadd
from master a
left join addition b
on (a.id1=b.id1 and a.id2=b.id2) or (a.id2=b.id2 and b.id2 is not null);
quit;
Option B: I'm thinking maybe merge on the first identifier and then use proc sql update to update from the Addition variables using the second identifier? I'm not sure of the syntax.
Option C: I could see probably doing this with a few regular merges & then appending and deduping, but as this would probably take 5+ steps and each step takes a few minutes to run (on a good day) am hoping for something shorter.
I suspect that two left joins are what you want . . . and it should have better performance. The result is something like this:
proc sql;
create table match as
select m.id1, a.id2, varmast, coalesce(a.varadd, a2.varadd) as varadd
from master m left join
addition a
on m.id1 = a.id1 and m.id2 = a.id2 left join
addition a2
on m.id1 = a2.id1 and m.id2 is null and a.id1 is null
quit;