I use a scheduled query in Big Query which appends data from the previous day to a Big Query table. The data from the previous day is not always available when my query runs, so, to make sure that I have all the data, I need to calculate the last date available in my Big Query table.
My first attempt was to write the following query :
SELECT *
FROM sourceTable
WHERE date >= (SELECT Max(date) from destinationTable)
When I run this query, only date >= max(date) is correctly exported. However, the query processes the entire sourceTable, and not only J - max(date). Therefore, the cost is higher than expected.
I also tried to declare a variable using "DECLARE" & "SET" (https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting). This solution works fine and only J - max(date) is processed. However, BQ interprets a query with "DECLARE" as a script, so the results can't be exported automatically to a BQ table using scheduled queries.
DECLARE maxDate date;
SET maxDate = (SELECT Max(date) from destinationTable);
SELECT *
FROM sourceTable
WHERE date >= maxDate
Is there another way of doing what I would like? Or a way to declare a variable using "DECLARE" & "SET" in a scheduled query with a destination table?
Thanks!
Scripting query, when being scheduled, doesn't support setting a destination table for now. You need to use DDL/DML to make change to existing table.
DECLARE maxDate date;
SET maxDate = (SELECT Max(date) from destinationTable);
CREATE OR REPLACE destinationTable AS
SELECT *
FROM sourceTable
WHERE date >= maxDate
Is destinationTable partitioned? If not, can you recreate it as a partitioned table? If it is a partitioned table, and partitioned on the destinationTable.date column, you could do something like:
SELECT *
FROM sourceTable
WHERE date >= (SELECT MAX(_PARTITIONTIME) from destinationTable)
Since _PARTITIONTIME is a pseudo-column, there is no cost in running the subquery.
We still can't use scheduled scripting in BigQuery along with destination table.
But if your source table is date sharded then there is a workaround that can be used to achieve the desired solution (only required data scan based on an initial value from another table).
SELECT
*
FROM
sourceTable*
WHERE
_TABLE_SUFFIX >= (
SELECT
IFNULL(MAX(date),
'<default_date>')
FROM
destinationTable)
This will scan only shards that are greater than or equal to the maximum date of the destination table.
P.S. - source table is date sharded.
Related
I have a table partitioned on insert_datetime.
The below query results in 250GB processed.
select *
from [table]
If I do the below, it is reduced to 500MB. This is the desired behavior, but ideally I would not need to hard code a timestamp.
select *
from [table]
where insert_datetime > `2019-01-01 00:00:00 UTC`
However, if I use a query to dynamically get the timestamp, it goes back up to 250MB processed:
select *
from [table]
where insert_datetime > (select max(insert_datetime) from [other_table])
How can I dynamically access a slice of the partitioned table?
I am running query in a job that updates user list everyday in SQL Server.
The query below is running everyday to update the data
if object_id('report.dbo.data') is not null
drop table report.dbo.data
SELECT
UserID,
date
into report.dbo.data
from data a
where date >= '2019-01-01'
and date < getdate()
The objective of this query is to update a user list everyday. Problem here is that, running it everyday takes longer time.
For example, I might already have data till 04/20/2019. Since I run it everyday, data runs once again from 01/01/2019 till 04/25/2019 rather than just updating with new userIDs from 04/20/2019 - 04/25/2019.
Can you help me with a sample code that updates report.dbo.data with new data than running entire code to refresh all data?
Your code drops and recreates the whole table, not only data (contents) of it. Let's create an empty table report.dbo.data, if it doesn't exist and append new data only.
if object_id('report.dbo.data') is null
SELECT UserID, date into report.dbo.data from data a where 1=0 -- create empty table if needed
insert into report.dbo.data(UserID, date) -- append new data
(SELECT UserID, date from data a where date > (select max(date) from data) and date < getdate())
Instead of hard coding date as '2019-01-01' pass max(date) from data.
Insert into report.dbo.data
SELECT UserID, date from data a where date >= (select max(date) from data ) and date < getdate()
I want to be able to have todays date and time now in a table column
If my table is say Table1, basically it should display the time and date when
SELECT * FROM Table1 is run.
I've tried the following but they just show the time from the moment in time I assign the value to column
ALTER TABLE Table1
ADD TodaysDate DateTime NOT NULL DEFAULT GETDATE()
and
ALTER TABLE Table1
ADD TodaysDate DateTime
UPDATE Table1
SET TodaysDate = GETDATE()
Hope this is clear. any help is appreciated.
Thanks
In SQL Server you can use a computed column:
alter table table1 add TodaysDate as (cast(getdate() as date));
(use just getdate() for the date and time)
This adds a "virtual" column that gets calculated every time it is referenced. The use of such a thing is unclear. Well, I could imagine that if you are exporting the data to a file or another application, then it could have some use for this to be built-in.
I hope this clarifies your requirement.
The SQL Server columns with default values stores the values inside the table. When you select the values from table, the stored date time will be displayed.
There are 2 options I see without adding the column to the table itself.
You can use SELECT *, GETDATE() as TodaysDate FROM Table1
You can create a view on top of Table 1 with additional column like
CREATE VIEW vw_Table1
AS
SELECT *, GETDATE() as TodaysDate FROM dbo.Table1
then you can query the view like you mentioned (without column list)
SELECT * FROM vw_Table1
This will give you the date and time from the moment of the execution of the query.
I have the following query
SELECT q.pol_id
FROM quot q
,fgn_clm_hist fch
WHERE q.quot_id = fch.quot_id
UNION
SELECT q.pol_id
FROM tdb2wccu.quot q
WHERE q.nr_prr_ls_yr_cov IS NOT NULL
For every row in that result set, I want to create a new row in another table (call it table1) and update pol_id in the quot table (from the above result set) with the generated primary key from the inserted row in table1.
table1 has two columns. id and timestamp.
I'm using db2 10.1.
I've tried numerous things and have been unsuccessful for quite a while. Thanks!
Simple solution: create a new table for the result set of your query, which has an identity column in it. Then, after running your query, update the pol_id field with the newly generated ID in your result table.
Alteratively, you can do it more manually by using the the ROW_NUMBER() OLAP function, which I often found convenient for creating IDs. For this it is convenient to use a stored procedure which does the following:
get the maximum old id from Table1 and write it into a variable old_max_id.
after generating the result set, write the row-numbers into the table1, maybe by something like
INSERT INTO TABLE1
SELECT ROW_NUMBER() OVER (PARTITION BY <primary-key> ORDER BY <whatever-you-want>)
+ OLD_MAX_ID
, CURRENT TIMESTAMP
FROM (<here comes your SQL query>)
Either write the result set into a table or return a cursor to it. Here you should either use the same ROW_NUMBER statement as above or directly use the ID from Table1.
I am trying to insert into a Hive table from another table that does not have a column for todays date. The partition I am trying to create is at the date level. What I am trying to do is something like this:
INSERT OVERWRITE TABLE table_2_partition
PARTITION (p_date = from_unixtime(unix_timestamp() - (86400*2) , 'yyyy-MM-dd'))
SELECT * FROM table_1;
But when I run this I get the following error:
"cannot recognize input near 'from_unixtime' '(' 'unix_timestamp' in constant"
If I query a table and make one of the columns that it work just fine. Any idea how to set the partition date to current system date in HiveQL?
Thanks in advance,
Craig
What you want here is Hive dynamic partitioning. This allows the decision for which partition each record is inserted into be determined dynamically as the record is selected. In your case, that decision is based on the date when you run the query.
To use dynamic partitioning your partition clause has the partition field(s) but not the value. The value(s) that maps to the partition field(s) is the value(s) at the end of the SELECT, and in the same order.
When you use dynamic partitions for all partition fields you need to ensure that you are using nonstrict for your dynamic partition mode (hive.exec.dynamic.partition.mode).
In your case, your query would look something like:
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE table_2_partition
PARTITION (p_date)
SELECT
*
, from_unixtime(unix_timestamp() - (86400*2) , 'yyyy-MM-dd')
FROM table_1;
Instead of using unix_timestamp() and from_unixtime() functions, current_date() can used to get current date in 'yyyy-MM-dd' format.
current_date() is added in hive 1.2.0. official documentation
revised query will be :
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE table_2_partition
PARTITION (p_date)
SELECT
*
, current_date()
FROM table_1;
I hope you are running a shell script and then you can store the current date in a variable. Then you create an empty table in Hive using beeline with just partition column. Once done then while inserting the data into partition table you can add that variable as your partition column and data will be inserted.