Hive - How to combine many tables having the same appendix? - hive

I want to combine many tables selected according to the year. For the current year (2019), I have tab_h_2016, tab_h_2017 & tab_h_2018. When We'll be in 2020, we will add tab_h_2019. How Can I join (using union) all the tables having the same appendix in a way that if a new table is added to the database, the table is automatically combined?

Calculate table names in the shell and parametrize your script.
Shell:
table1=$(date +"tab_h_%Y" --date " -3 year");
table2=$(date +"tab_h_%Y" --date " -2 year");
table3=$(date +"tab_h_%Y" --date " -1 year");
hive --hiveconf table1="$table1" --hiveconf table2="$table2" --hiveconf table3="$table3" -f your_script.hql
Script your_script.hql:
select * from ${hiveconf:table1} inner join ${hiveconf:table2} ...

Related

SQLite query WHERE with OUTER JOIN

I am a bit rusty with my SQL and am running into a little issue with a query. In our application we have two relative tables to this problem. There are entries, and for each entry there are N steps.
We are trying to optimize our querying, so instead of asking for all entries all the time, we just ask for entries that were updated after we last checked. There can be a lot of steps, so this query is just supposed to return the entries and some step summary data, and we can separately query for steps if needed.
The entry start time and updated time are calculated from the first and most recent process step time respectively. We also have to group together entry statuses.
Here's the query as we build it in python, since it seems easier to read:
statement = 'SELECT e.serial_number, ' + \
'e.description, ' + \
'min(p.start_time) begin_time, ' + \
'group_concat(p.status) status, ' + \
'max(p.last_updated) last_updated, ' + \
'FROM entries e ' + \
'LEFT OUTER JOIN process_steps p ON e.serial_number = p.serial_number ' + \
# if the user provides a "since" date, only return entries updated after
# that date
if since is not None:
statement += ' WHERE last_updated > "{0}"'.format(since)
statement += ' GROUP BY e.serial_number'
The issue we are having is that if we apply that WHERE clause, it filters the process steps too. So for example if we have this situation with two entries:
Entry: 123 foo
Steps:
1. start time 10:00, updated 10:30, status completed
2. start time 11:00, updated 11:30, status completed
3. start time 12:00, updated 12:30, status failed
4. start time 13:00, updated 13:30, status in_progress
Entry: 321 bar
Steps:
1. start time 01:00, updated 01:30, status completed
2. start time 02:00, updated 02:30, status completed
If we query without the where, we would get all entries. So for this case it would return:
321, bar, 01:00, "completed,completed", 02:30
123, foo, 10:00, "completed,completed,failed,in_progress", 13:30
If I had time of 12:15, then it would only return this:
123, foo, 12:00, "failed,in_progress", 13:30
In that result, the start time comes from step 3, and the statuses are only from steps 3 and 4. What I'm looking for is the whole entry:
123, foo, 10:00, "completed,completed,failed,in_progress", 13:30
So basically, I want to filter the final results based on that last_updated value, but it is currently filtering the join results as well, which throws off the begin_time, last_updated and status values since they are calculated with a partial set of steps. Any ideas how to modify the query to get what I want here?
Edit:
It seems like there might be some naming issues here too. The names I used in the example code are equal to or similar to what we actually have in our code. If we change max(p.last_updated) last_updated to max(p.last_updated) max_last_updated, and change the WHERE clause to use max_last_updated as well, we get OperationalError: misuse of aggregate: max() We have also tried adding AS statements in there with no difference.
Create a subquery that selects updated processes first:
SELECT whatever you need FROM entries e
LEFT OUTER JOIN process_steps p ON e.serial_number = p.serial_number
WHERE e.serial_number in (SELECT distinct serial_number from process_steps
WHERE last_updated > "date here")
GROUP BY e.serial_number
You can do this with a having clause:
SELECT . . .
FROM entries e LEFT JOIN
process_steps ps
ON e.serial_number = ps.serial_number
GROUP BY e.serial_number
HAVING MAX(ps.last_updated) > <your value here>;

SQL / Access Query - How do I append two records for every one master record? Time Cards

I have a list of records in TBL_WheelHours with the following Schema:
**GUID - Operator 1 - Operator 2 - Data1 - Data2 - Data3 - Data4 Etc.**
I have a set of queries that append all new entries from this table to another table called TBL_CostLog.
What I want to do is create two entries in the cost log that looks as such:
**TableID - GUID - Operator 1 - Data1 etc.**
**TableID - GUID - Operator 2 - Data1 etc.**
And then I want to be able to run an update query using Tbl Wheel hours as the master, so if that any information in that table changed it propagates to the cost log.
I have many other tables and queries doing this exact same thing and its working beautifully. The difference here though is that there are two operators on this machine, and only 1 record with both names on it.
Any advice or direction I should pursue to do this?
EDIT:
Here is what I have for the other tables where this is not an issue:
APPEND QUERY
INSERT INTO TBL_TimeLog ( Customer, RefNumber, StartTime, StopTime, Multiplier, FromTable, WorkType, [TableID], ProductID, QtySprayed, CoatDesc, Operator_1, Operator_2 )
SELECT TBL_BlastHours.Customer, TBL_BlastHours.[WO #], TBL_BlastHours.[Start Time], TBL_BlastHours.[Stop Time], "1" AS Expr1, "Blast" AS Expr2, "Blast" AS Expr3, TBL_BlastHours.IDLoc, "NA" AS Expr4, 0 AS Expr5, TBL_BlastHours.Booth, TBL_BlastHours.Blaster, "NA" AS Expr6
FROM TBL_BlastHours
LEFT JOIN TBL_TimeLog ON TBL_BlastHours.IDLoc = TBL_TimeLog.TableID
WHERE (((TBL_TimeLog.TableID) Is Null));
UPDATE QUERY
UPDATE TBL_BlastHours
INNER JOIN TBL_TimeLog
ON TBL_BlastHours.IDLoc = TBL_TimeLog.TableID
SET TBL_TimeLog.Customer = [TBL_BlastHours].[Customer], TBL_TimeLog.RefNumber = [TBL_BlastHours].[WO #], TBL_TimeLog.StartTime = [TBL_BlastHours].[Start Time], TBL_TimeLog.StopTime = [TBL_BlastHours].[Stop Time], TBL_TimeLog.CoatDesc = [TBL_BlastHours].[Booth], TBL_TimeLog.Operator_1 = [TBL_BlastHours].[Blaster], TBL_TimeLog.Operator_2 = "NA"
WHERE (((TBL_TimeLog.FromTable)="Blast"));
I think you want union all:
select guid, operator1, data1
from tbl_wheelhours
union all
select guid, operator2, data1
from tbl_wheelhours;
From your description, you may also need a trigger. However, you say have similar code working for a single record, so union all might be the missing piece.
I found a way to get the results I want. I added a column called OperatorKey and OperatorFlag
Append new entries as I had been doing, with all new entries having "1" in the operator flag. I then append a second set of entries with the second operator for all entries that have "1" with the operator flag.
I then run an update query that changes all operator flags to "0".
I create a unique Operator Key for each entry, and then I can run two update queries with the operator key and the GUID key and update the entries from the master list.
Seems to be working great for now.

Batch processing versus Single row transactions for atomicity

I have two tables; one to hold records of reports generated, and the other to update a flag that the reports have been generated. This script will be scheduled, and the SQLs have been implemented. However, there are two implementations of the script:
Case 1:
- Insert all the records, then
- Update all the flags,
- Commit if all is well
Case 2:
While (there are records)
- Insert a record,
- Update the flag
- Commit if all is well
Which should be preferred and why?
A transaction for Case 1 is for all inserts, then all update. It's all or nothing. I'm to believe this is faster, or not if the connection to the remote database keeps getting interrupted. It requires very little client side processing. But if the inserts fail midway, we'll have to rerun from the top.
A transaction for Case 2 is one insert, update. This requires to keep track of the inserted records, and updating the specific records. I'll have to use placeholders, and while granted the database may cache the SQL, and use the query execution plan repeatedly, I suspect this would be slower than Case 1 because of the additional client side processing. However on an unreliable connection, which we can assume, this looks the better choice.
EDIT 5/11/2015 11:31AM
CASE 1 snippet:
my $sql = "INSERT INTO eval_rep_track_dup\#prod \
select ert.* \
from eval_rep_track ert \
inner join \
(
select erd.evaluation_fk, erd.report_type, LTRIM(erd.assign_group_id, '/site/') course_name \
from eval_report_dup\#prod erd \
inner join eval_report er \
on er.id = erd.id \
where erd.status='queue' \
and er.status='done' \
) cat \
on ert.eval_id = cat.evaluation_fk \
and ert.report_type = cat.report_type \
and ert.course_name = cat.course_name";
my $sth = $dbh->prepare($sql) or die "Error with sql statement : $DBI::errstr\n";
my $noterror = $sth->execute() or die "Error in sql statement : " . $sth->errstr . "\n";
...
# update the status from queue to done
$sql = "UPDATE eval_report_dup\#prod \
SET status='done' \
WHERE id IN \
( \
select erd.id \
from eval_report_dup\#prod erd \
inner join eval_report er \
on er.id = erd.id \
where erd.status='queue' \
and er.status='done' \
)";
$sth = $dbh->prepare($sql);
$sth->execute();
eval_rep_track_dup has 3 number, 8 varchar2 and a timestamp columns
eval_report_dup has 10 number, 8 varchar2 and 3 timestamp columns
Hi
Well if it was up to me I would do the latter method. The principle reason would be if the server/program went down in the middle of processing; you could easily restart the job.
Good luck
pj

Optimize this query getting exceed recourse limit

SELECT DISTINCT
A.IDPRE
,A.IDARTB
,A.TIREGDAT
,B.IDDATE
,B.IDINFO
,C.TIINTRO
FROM
GLHAZQ A
,PRTINFO B
,PRTCON C
WHERE
B.IDARTB = A.IDARTB
AND B.IDPRE = A.IDPRE
AND C.IDPRE = A.IDPRE
AND C.IDARTB = A.IDARTB
AND C.TIINTRO = (
SELECT MIN(TIINTRO)
FROM
PRTCON D
WHERE D.IDPRE = A.IDPRE
AND D.IDARTB = A.IDARTB)
ORDER BY C.TIINTRO
I get below error when I run this query(DB2)
SQL0495N Estimated processor cost of "000000012093" processor seconds
("000575872000" service units) in cost category "A" exceeds a resource limit error
threshold of "000007000005" service units. SQLSTATE=57051
Please help me to fix this problem
Apparently, the workload manager is doing its job in preventing you from using too many resources. You'll need to tune your query so that its estimated cost is lower than the threshold set by your DBA. You would start by examining the query explain plan as produced by db2exfmt. If you want help, publish the plan here, along with the table and index definitions.
To produce the explain plan, perform the following 3 steps:
Create explain tables by executing db2 -tf $INSTANCE_HOME/sqllib/misc/EXPLAIN.DDL
Generate the plan by executing the explain statement: db2 explain plan for select ...<the rest of your query>
Format the plan: db2exfmt -d <your db name> -1 (note the second parameter is the digit "1", not the letter "l").
To generate the table DDL statements use the db2look utility:
db2look -d <your db name> -o tables.sql -e -t GLHAZQ PRTINFO PRTCON
Although not a db2 person, but I would suspect query syntax is the same. In your query, you are doing a sub-select based on the C.TIINTRO which can kill performance. You are also querying for all records.
I would start the query by pre-querying the MIN() value and since you are not even using any other value field from the "C" alias, leave it out.
SELECT DISTINCT
A.IDPRE,
A.IDARTB,
A.TIREGDAT,
B.IDDATE,
B.IDINFO,
PreQuery.TIINTRO
FROM
( SELECT D.IDPRE,
D.IDARTB,
MIN(D.TIINTRO) TIINTRO
from
PRTCON D
group by
D.IDPRE,
D.IDARTB ) PreQuery
JOIN GLHAZQ A
ON PreQuery.IDPre = A.IDPRE
AND PreQuery.IDArtB = A.IDArtB
JOIN PRTINFO B
ON PreQuery.IDPre = B.IDPRE
AND PreQuery.IDArtB = B.IDArtB
ORDER BY
PreQuery.TIINTRO
I would ensure you have indexes on
table Index keys
PRTCON (IDPRE, IDARTB, TIINTRO)
GLHAZQ (IDPRE, IDARTB)
PRTINFO (IDPRE, IDARTB)
If you really DO need your "C" table, you could just add as another JOIN such as
JOIN PRTCON C
ON PreQuery.IDArtB = C.IDArtB
AND PreQuery.TIIntro = C.TIIntro
With such time, you might be better having a "covering index" with
GLHAZQ table key ( IDPRE, IDARTB, TIREGDAT )
PRTINFO (IDPRE, IDARTB, IDDATE, IDINFO)
this way, the index has all the elements you are returning in the query vs having to go back to all the actual pages of data. It can get the values from the index directly

Compare current and next record using Hive SQL Query

My RFID tag file has huge amount of data and data was grouped by Date & Time value (every group has multiple tags) . I would like to know missing Tag # between 1st set and 2nd set of the data.
Please help me hereā€¦
Sample file:
field names: Tag # Date & Time
1st line -> 00045512|05-01-2013 12:02:03
2nd line -> 00052450|05-01-2013 12:02:03
Same file continued with different times but missing 1st line from above 2 set sorted by time..(below)
00052450|05-01-2013 13:02:03
Basically I would like to find missing tag when my 'Date & time" field changes.
This is similar problem solved in SQL...(link attached)
http://www.milesdennis.com/2011/06/comparing-current-and-previous-records.html
Use left outer join:
select s1.tag, case s2.tag when null then 1 else 0 end as missing_flag
from
set1 s1
left outer join set2 s2 on (s1.tag=s2.tag)