how to query many tables in one shot in Hive? - sql

I have to query and then "union" many tables. I did manually in Hive but wondering if there's a more optimal (shorter) way to do it.
We have tables for each month, so instead of doing this for a whole year:
create table t_2019 as
select * from
(select * from t_jan where...
union all
select * from t_feb where...
union all
select * from t_mar where...);
Does Hive (or any kind of SQL) allow to loop through tables? I've seen for loop and while examples in T-SQL, but they are individual queries. In this case I want to union the tables.
#t_list = ('t_jan', 't_feb', 't_mar'...etc)
Then, how to query each table in #t_list and "union all"? Each month has about 800k rows, so it's big but Hive can handle.

You can solve this problem with partitioned hive table instead of multiple tables.
Ex: table_whole pointing to hdfs path hdfs://path/to/whole/ with partitions on Year and Month
Now you can query to get data from all months in 2019 using
select * from table_whole where year = '2019'
If you need just data from one month say Jan in 2019. you can filter by that partition
select * from table_whole where year = '2019' and month='JAN'

Related

Querying all partition table

I have around 600 partitioned tables called table.ga_session. Each table is separated by 1 day, and for each table it has its own unique name, for example, table for date (30/12/2021) has its name as table.ga_session_20211230. The same goes for other table, the naming format would be like this table.ga_session_YYYYMMDD.
Now, when I try to call all partitioned table, I cannot use command like this:. The error showed that _PARTITIONTIME is unrecognized.
SELECT
*,
_PARTITIONTIME pt
FROM `table.ga_sessions_20211228`
where _PARTITIONTIME
BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP('2020-01-02')
I also tried this and does not work
select *
from between `table.ga_sessions_20211228`
and
`table.ga_sessions_20211229`
I also cannot use FROM 'table.ga_sessions' to apply WHERE clause to take out range of time as the table does not exist. How do I call all of these partitioned table? Thank you in advance!
You can query using wildcard tables. For example:
SELECT max
FROM `bigquery-public-data.noaa_gsod.gsod*`
WHERE _TABLE_SUFFIX = '1929'
This will specifically query the gsod1929 table, but the table_suffix clause can be excluded if desired.
In your scenario you could do:
select *
from table.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20190101' and '20200102'
For more information see the documentation here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference

How to get all columns in SQL after using DISTINCT and UNION functions?

I am trying to write a Query that will combine historical appointment data with live-updating appointment data.
The Live Updating Data and Historical Data have all common column headers and data types.
The Historical Data set is a static snapshot of 100k-150k rows of data which I am trying to UNION with the Live-Updating Data to create a Full Data Set
Since there is some overlap between the Live-Updating Data and the Historical Data, I want to filter out Distinct appointment ID's
Here is the query that I've written:
SELECT
DISTINCT(n.appointment_id)
FROM (
SELECT
* FROM note_data
UNION
SELECT * FROM note_data_historical) as n
FULL OUTER JOIN note_data_historical as historical
on historical.appointment_id = n.appointment_id
FULL OUTER JOIN note_data as live
on live.appointment_id = n.appointment_id
What I am trying to do is to avoid having to write out the couple of dozen column headers, but also not have duplicate rows.
So to summarize, I would like to:
Join Two Data Sets with Overlapping Rows to Get a Complete Data Set
Filter out Overlapping Rows
Get all of the columns to appear (like a SELECT * grouped by or joined on one column)
It sounds like you want something like the following
SELECT *
FROM note_data
UNION ALL
SELECT *
FROM note_data_historical
WHERE note_data_historical.appointment_id NOT IN
(
SELECT appointment_id FROM note_data
)
This gets all of your note_data and note_data_historical rows, unless the note_data_historical.appointment_id exists in note_data. And you don't need to list the columns in your query.
Note that I used a UNION ALL instead of a UNION, but since I don't know your data, I don't know if that's actually reasonable.

Data reconciliation between 2 datasets on SQL

image_table
I currently need to find all the differences between a new_master dataset and a previous one using SQL Oracle. The datasets have the same structure and consist of both integers and strings and do not have a unique key id unless I select several columns together. You can see an image at the beginning as image_table. I found online this code and I wanted to ask you if you have any advices.
SELECT n.*
FROM new_master as n
LEFT JOIN old_master as o
ON (n.postcode = o.postcode)
WHERE o.postcode IS NULL
SORT BY postcode
In doing so I should get back all the entries from the new_master that are not in the old one.
Thanks
If you are in an Oracle databse, there are a couple queries that can help you find any differences.
Find any records in OLD that are not in NEW.
SELECT * FROM old_master
MINUS
SELECT * FROM new_master;
Find any records in NEW that are not in OLD.
SELECT * FROM new_master
MINUS
SELECT * FROM old_master;
Count number of items in OLD
SELECT COUNT (*) FROM old_master;
Count number of items in NEW
SELECT COUNT (*) FROM new_master;
The COUNT queries are needed in addition to the MINUS queries in case there are duplicate rows with the same column data.

BigQuery: Querying multiple datasets and tables using Standard SQL

I have Google Analytics data that's spread across multiple BigQuery datasets, all using the same schema. I would like to query multiple tables each across these datasets at the same time using BigQuery's new Standard SQL dialect. I know I can query multiple tables within a single database like so:
FROM `12345678`.`ga_sessions_2016*` s
WHERE s._TABLE_SUFFIX BETWEEN '0501' AND '0720'
What I can't figure out is how to query against not just 12345678 but also against 23456789 at the same time.
How about using a simple UNION, with a SELECT wrapping around it (I tested this using the new standard SQL option and it worked as expected):
SELECT
SUM(foo)
FROM (
SELECT
COUNT(*) AS foo
FROM
<YOUR_DATASET_1>.<YOUR_TABLE_1>
UNION ALL
SELECT
COUNT(*) AS foo
FROM
<YOUR_DATASET_1>.<YOUR_TABLE_1>)
I believe that using table wild card & union (in bigquery, use comma to achieve the union function) will get what you need very quickly, if the tables have the same schema.
select *
from
(select * from table_table_range([dataset1], date1, date2),
(select * from table_table_range([dataset2], date3, date4),
......

Optimize data retrieval from different servers with more than 10M records

SELECT party_code , max(date) AS date FROM
server1.table1 WITH (nolock) GROUP BY party_code
UNION
SELECT party_code , max(date) AS date FROM
server2.table1 WITH (nolock) GROUP BY party_code
UNION
SELECT party_code , max(date) AS date FROM
server3.table1 WITH (nolock) GROUP BY party_code
Like shown above I have similarly 17 tables on different servers, so I union them to get records. The total data sums up to more than 36 crores (360 millions) which effects the database execution time and ability to retrieve records. Can someone help me as to how to optimize this. Or any other solution to it.
First, you need a covering index on your tables. So if you do not have it already create this index on all your tables:
CREATE NONCLUSTERED INDEX IX_Table1_party_code__date
ON server1.table1 (party_code) INCLUDE (date)
Second, replace UNION with UNION ALL operators. Union does the sorting and comparing of datasets which you don't need if you need to keep records from each server separately.
If that doesn't help enough, maybe you can look in some of the other options:
Maybe you can first UNION ALL all the records (adding ServerID column in the process) and then do one GROUP BY on the dataset (on party_code and ServerID), but I can't tell for sure whatever this will be better or worse (you'll have to test).
Try using indexed views.
Staging tables that will be calculated and filled during night?
Do not use Union, instead you can use Union all and finally delete
duplicate records if any.
Insert all records in a stage(temp) table and finally delete duplicate records if any.
If number of records are huge you can you SSIS for faster process