Hive to Hive ETL - sql

I have two large Hive tables, say TableA and TableB (which get loaded from different sources).
These two tables have almost identical table structure / columns with same partition column, a date stored as string.
I need to filter records from each table based on certain (identical) filter criteria.
These tables have some columns containing "codes", which need to be looked up to get its corresponding "values".
There are eight to ten such lookup tables, say, LookupA, LookupB, LookupC, etc.,
Now, I need to:
do a union of those filtered records from TableA and TableB.
do a lookup into the lookup tables and replace those "codes" from the filtered records with their respective "values". If a "code" or "value" is unavailable in the filtered records or lookup table respectively, I need to substitute it with zero or an empty string
transform the dates in the filtered records from one format to another
I am a beginner in Hive. Please let know how I can do it. Thanks.
Note: I can manage till union of the tables. Need some guidance on lookup and transformation.

To basically do a lookup Please follow these steps below,
You have to create a custom User Defined function(UDF) which basically does the look up work,meaning you have to create a Java Program internally for looking up, jar it and add it to Hive something like below:
ADD JAR /home/ubuntu/lookup.jar
You then have to add lookup file containing keyvalue pair as follows:
ADD FILE /home/ubuntu/lookupA;
You then have to create a temporary lookup function such as
CREATE TEMPORARY FUNCTION getLookupValueA AS 'com.LookupA';
Finally you have to call this lookup function in the Select query which will basically populate lookup value for the given lookup key.
Same thing can be achieved using JOIN but that will take a hit on the performance.
Taking a join approach you can very well join by the lookupcode for source and lookup tables something like
select a.key,b.lookupvalue
table a join lookuptable b
where a.key=b.lookupKey
Now for Date Transformation, you can use Date functions in Hive.

For the above problem follow the following steps:
Use union schema to union two tables(schema must be same).
For the above scenario you can try pig script.
script would look like(jn table A and tableB with lookup table and generate the appropriate columns):
a = join TableA by codesA left outer, lookupA by codesA.
b = join a by codesB left outer, lookupB by codesB.
Similarly for Table B.
Suppose some value of codesA does not have a value in the lookup table, then:
z = foreach b generate codesA as codesA, valueA is null ? '0' as valuesA.
(will replace all null values from value with 0).
If you are using Pig 0.12 or later, you can use ToString(CurrentTime(),'yyyy-MM-dd')
I hope it will solve your problem. Let me know in case of any concern.

Related

How to get the differences between two - kind of - duplicated tables (sql)

Prolog:
I have two tables in two different databases, one is an updated version of the other. For example we could imagine that one year ago I duplicated table 1 in the new db (say, table 2), and from then I started working on table 2 never updating table 1.
I would like to compare the two tables, to get the differences that have grown in this period of time (the tables has preserved the structure, so that comparison has meaning)
My way of proceeding was to create a third table, in which I would like to copy both table 1 and table 2, and then count the number of repetitions of every entry.
In my opinion, this, added to a new attribute that specifies for every entry the table where he cames from would do the job.
Problem:
Copying the two tables into the third table I get the (obvious) error to have two duplicate key values in a unique or primary key costraint.
How could I bypass the error or how could do the same job better? Any idea is appreciated
Something like this should do what you want if A and B have the same structure, otherwise just select and rename the columns you want to confront....
SELECT
*
FROM
B
WHERE NOT EXISTS (SELECT * FROM A)
if NOT EXISTS doesn't work in your DBMS you could also use a left outer join comparing the rows columns values.
SELECT
A.*
from
A left outer join B
on A.col = B.col and ....

A possible way to remove BigQuery column

I'm looking around for an approach to update an existing BigQuery table.
With the CLI I'm able to copy the table to a new one. And now, i'm looking for an effective to remove/rename a column.
It's said that is not possible to remove a column . So is it possible when copying table1 to table2 to exclude some columns ?
Thanks,
You can do this by running a query that copies the old table to the new one. You should specify allowLargeResults:true and flattenSchema:false. The former allows you to have query results larger than 128MB, the latter prevents repeated fields from being flattened in the result.
You can write the results to the same table as the source table, but use the writeDisposition:WRITE_TRUNCATE. This will atomically overwrite the table with the results. However, if you'd like to test out the query first, you always could write the results to a temporary table first, then copy the temporary table over the old table when you're happy with it (using WRITE_TRUNCATE to atomically replace the table).
(Note, the flags I'm describing here are their names in the underling API, but they have analogues in both the query options in the Web UI and the bq CLI).
For example, if you have a table t1 with schema {a, b, c, d} and you want to drop field c, and rename b to b2 you can run
SELECT a, b as b2, d FROM t1

Spotfire - Getting data from one table that falls between two dates in another table and adding to a calculated column

What would be the expression to create a calculated column in Table Example 2 called "SZODMAXCALC", that would contain the SZODMAXCALC from Table Example 1 given that the data from Table Example 1 falls between the dates (DTTMSTART and DTTMEND) within Table Example 2?
Maybe this is easier done on the SQL side that loads the data?
there is no way to create a calculated column that references a column in another table.
you will need to do a join either in Spotfire (via Insert...Columns)* or on the SQL-side of things (either via a view on your database or by creating a new information link in Spotfire).
the best method depends on your data structure, implementation, and desired results, so I'm not able to recommed there. take a look at both options and evaluate which one works best.
* NOTE that Spotfire cannot join based on a Calculated Column as a common key. that is, using your example, if [WELLNAME] is a calculated column, you cannot tell Spotfire the equivalent of SELECT wellname, ... FROM table_a LEFT JOIN table_b ON table_a.wellname = table_b.wellname.
the alternative is to Insert...Transformation and choose Insert New Calculated Column, and to join on that instead.
the reason for this is that calculated columns are very mutable; they could change frequently based on a user action. it would be inefficient to re-execute the join each time the column's contents changed. conversely, a "Transformation Calculated Column" is only updated when the data table is loaded.

Replace certain values by the means of a table with replacement values

I have a table with data in it. Generally the values are correct but sometimes we need to modify them. The modifications are saved in a second table.
I wanted to create a query that dynamically replaces the values if they exist in the replacement table.
This is what my query design looks like but it doesn't work:
This is my query code:
SELECT
b.Pos,
b.Posten,
IsNull(c.Wert_Neu, b.Bez1) AS Bez1,
IsNull(c.Wert_Neu, b.Bez2) AS Bez2,
IsNull(c.Wert_Neu, b.Bez3) AS Bez3,
b.Wert,
b.Einheit
FROM
Table_Values b LEFT JOIN
Table_Replacements c ON b.Bez1 = c.Wert_Alt AND b.Bez2 = c.Wert_Alt AND b.Bez3 = c.Wert_Alt
Where is my logical error? It doesn't replace the values. I assume it has something to do with the joins all going there without OR, but OR would be too costly for performance.
Anyone with a better idea?
Looks like what you want to do is to replace each of the values with the one that appears in the replacement table, but you have three separate columns, and each of those three values will have a different corresponding entry in the replacement table. So you will have to link to that table three different times, once for each value, to link to its replacement, something like:
SELECT
b.Pos,
b.Posten,
IsNull(c.Wert_Neu, b.Bez1) AS Bez1,
IsNull(d.Wert_Neu, b.Bez2) AS Bez2,
IsNull(e.Wert_Neu, b.Bez3) AS Bez3,
b.Wert,
b.Einheit
FROM
Table_Values b
LEFT JOIN Table_Replacements c on b.bez1=c.wert_alt
LEFT JOIN Table_Replacements d on b.bez2=d.wert_alt
LEFT JOIN Table_Replacements e on b.bez3=e.wert_alt
It will be important that your replacement table have an index on wert_alt so that those links can be done efficiently.
Another possibility is to actually store the replacement values in your main data table. So the fields in it would be:
bez1
bez1Replacement
bez2
bez2Replacement
...
Maybe have a trigger on the table so that on any insert or update, the trigger looks up each of the three replacement values from the replacement table and adds them to the main data record. That would not be exactly normalized, but it would speed up your query. But, you may not need to do that at all. The above query is probably efficient enough if you do have that index.

SQL Query for filtering columns returned

I want to return columns based on some meta data in an other table. i.e. i have my table which contains 10 columns, and another table which contains those columns denormalise with metadata to do with them.
i.e.
Table - Car:
columns - Make,Model,Colour
and another table called "Flags" which has a row for each of the above columns and each row has a column for "IsSearchable" and "ShowOnGrid" - that sort of thing.
The query i want is one which will return all columns from the cars table that are flagged in the "Flags" table as "ShowInGrid"
----EDIT
Apologise, I should have stated that this is on SQL Server 2008.
Also, I dont want to have to physically state the columns which i would like to return, i.e. If i add a column to the car table, then add it into the Flags table and declare it to be searchable, I don't want to have to physically state in the SQL Query that i want to return that column, i want it to automatically pull through.
You need to use dynamic SQL; this can easily be done with a stored procedure.
Something like this might work:
Select
D.CarID,
Case D.ShowMake When True Then D.Make Else NULL END AS Make
...
From
(Select
C.CarID, C.Make, C.Model, C.Colour, F.IsSearchable, F.ShowOnGrid, F.ShowMake
From
Cars C
Inner Join
Flags F
On C.CarID = F.CarID) D
I didn't write in all the case statements and don't know how many flags you're working, but you can give it a try. It would require to filter on null values in your application. If you actually want the columns omitted on the basis of the Flag column value the other answer and comment are both right on. Either Dynamic SQL or build your query outside in another language first.