there is table1 partitioned by date, and I create a table2 to store data from table1, and the table2 is also partitioned by date, I want to loop table1's partition, and select the data into table2 by partition. or there some other ways?
Related
I have a logging table which consists of raw data that require processing, which sometimes will require to set a destination table to avoid resource error.
Currently I am using a BigQuery View to process and persist the result in another BigQuery table, with Scheduled Query set to overwrite the table.
As the volume of data grows, I find that the cost is getting more expensive, how do I construct it in a more efficient/better practice in order to save cost?
My current BigQuery View script logic is like this:
with latest_timestamp as(
select max(timestamp) latest from persist_table
),
select col1, col2, col3 from logging_table where timestamp >= (select latest from latest_timestamp)
union all
select * from persist_table where timestamp < (select latest from latest_timestamp)
I have to use the timestamp as timestamp is the partition column, and to avoid duplicate/missing data in the result.
Not sure if there is any other better way to do this so I will be open to any suggestions.
The following steps should make you insert only the new lines, avoid you to read and insert the entire table every time. Have in mind that Bigquery charges you based on the bytes read. So using partitioning and not having to read the entire table to reinsert it every time you save costs.
Ensure all tables are partitioned by the timestamp if its not done already (logging_table and persist_table): Its reduces a lot the data needed to be read;
Change your schedule query to the following:
with latest_timestamp as(
select max(timestamp) latest from persist_table
)
select col1, col2, col3 from logging_table where timestamp > (select latest from latest_timestamp)
union all
(select t1.col1, t1.col2, t1.col3 from
(select col1, col2, col3 from logging_table where timestamp = (select latest from latest_timestamp)) t1
left join
(select * from persist_table where timestamp = (select latest from latest_timestamp)) t2
on
(t1.col1=t2.col1 and t1.col2=t2.col2 and t1.col3=t2.col3)
where
t2.col1 is null)
AND
Change the Overwrite to Append to table:
I want to do a query that inserts the result of table1 with table2 union's result. But when I try on beeline:
insert into table table3
select * from (
select * from table1 t1
where
h_time > '2019-05-01 00:00:00'
and t1.id in (select id from table4)
union all
select * from table2 t2
where
h_time > '2019-05-01 00:00:00'
and t2.id not in (select id from table4)
);
Consider that the both tables 1 and 2 have the same column numbers and datatypes have already fixed previously.
The result in table3 is only the rows of table1. And when I change the position of table 1 and 2, I get only the rows of table2.
Anyone have a guess what's happening?
Tks in advance!
it may be problem with table too, it may not show the actual count sometimes, use select * from tablename. If the count is as expected the run Analyze statement it'll re-compute and repair the statistics
I have a table containing the following columns:
seq_no | detail_no |user_id |guid_key
and another table with the following:
header| guid_key| date_entered| login_details| summary_transaction| trailer
Now I wish to map the two tables together such that the final answer is this way:
The first row should be value from header and subsequent rows should be values from seq_no, detail_no and user_id. There will be multiple rows of seq_no, detail_no and user_id. The last row should be the trailer.
The first table contains multiple rows that i need to be referenced to multiple rows in the second table. I'm new to SQL programming. I've looked up some many-many relationships but am unable to find an efficient way to do this. I am using a guid generator to write unique keys to both the tables. However, they key is not unique per row but rather a set of rows--> like for a set of data.
SQL is not a presentation tool which is what you're trying to achieve here. You're also trying to make two results sets present as one result set (you want columns from table 1 on some rows on columns from table 2 on others). You could do something like this by
create table #temp (ID int identity(1,1), col1 varchar(200), col2 varchar(200) etc.)
insert into #temp col1, col2, etc values (select seq_no, guid_key, detail_no, user_id from table1)
insert into #temp col1, col2 etc. values (select seq_no, guid_key, header, date_entered, login_details, summary_transaction from table1 inner join table2 on table1.guid_key = table2.guidkey where trailer is null)
insert into #temp col1, col2 etc. values (select seq_no, guid_key, trailer, date_entered, login_details, summary_transaction from table1 inner join table2 on table1.guid_key = table2.guidkey where trailer is not null)
select * from #temp order by col1, ID
So you enter all the headers first followed by the detail, followed by the trailer. Thus when you order by seq_no and ID they will come out in the order desired.
I have two tables A and B. Both has same column names. I want to merge these two table and load it to table C. Table C also have same column names as A and B and one more column in timestamp (this for capturing the merged time). I dont want duplicates in Table C. I tried union but getting duplicate values because one of the column in Table C in Timestamp data type.
For Example, below is my sample query
insert overwrite table TableC
select field1,field2, unix_timestamp() as field3 from table_A
UNION
select field1,field2, unix_timestamp() as field3 from table_B
The two unix_timestamp() function returns different timestamps (just a milli second difference) and I am getting duplicate data because of the timestamp.
Is there anyother way to get the same timestamp for both functions while union ?
unix_timestamp()
Gets current Unix timestamp in seconds.
This function is
non-deterministic and prevents proper optimization of queries -
this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
insert overwrite table TableC
select field1,field2, unix_timestamp(current_timestamp) as field3 from table_A
UNION
select field1,field2, unix_timestamp(current_timestamp) as field3 from table_B
Additional work-arounds
insert overwrite table TableC
select field1,field2,unix_timestamp() as field3
from ( select field1,field2 from table_A
union all select field1,field2 from table_B
) t
group by field1,field2
or
insert overwrite table TableC
select field1,field2,unix_timestamp() as field3
from ( select field1,field2 from table_A
union select field1,field2 from table_B
) t
I have a query that selects from multiple partitions
select * from table partition (P1), table2, table3, table4
union all
select * from table partition (P2), table2, table3, table4
....
select * from table partition (P12), table2, table3, table4
Table has over 30,331,246 entry's in the 12 partitions. I need to find a faster way to pull up the result (now it takes approx 35min). If I do select from the table not the partition the time will be over 135min but it's not how a query should look.
Could you please assist me in finding another way to do this ?