BigQuery View to Update Table - sql

I have a logging table which consists of raw data that require processing, which sometimes will require to set a destination table to avoid resource error.
Currently I am using a BigQuery View to process and persist the result in another BigQuery table, with Scheduled Query set to overwrite the table.
As the volume of data grows, I find that the cost is getting more expensive, how do I construct it in a more efficient/better practice in order to save cost?
My current BigQuery View script logic is like this:
with latest_timestamp as(
select max(timestamp) latest from persist_table
),
select col1, col2, col3 from logging_table where timestamp >= (select latest from latest_timestamp)
union all
select * from persist_table where timestamp < (select latest from latest_timestamp)
I have to use the timestamp as timestamp is the partition column, and to avoid duplicate/missing data in the result.
Not sure if there is any other better way to do this so I will be open to any suggestions.

The following steps should make you insert only the new lines, avoid you to read and insert the entire table every time. Have in mind that Bigquery charges you based on the bytes read. So using partitioning and not having to read the entire table to reinsert it every time you save costs.
Ensure all tables are partitioned by the timestamp if its not done already (logging_table and persist_table): Its reduces a lot the data needed to be read;
Change your schedule query to the following:
with latest_timestamp as(
select max(timestamp) latest from persist_table
)
select col1, col2, col3 from logging_table where timestamp > (select latest from latest_timestamp)
union all
(select t1.col1, t1.col2, t1.col3 from
(select col1, col2, col3 from logging_table where timestamp = (select latest from latest_timestamp)) t1
left join
(select * from persist_table where timestamp = (select latest from latest_timestamp)) t2
on
(t1.col1=t2.col1 and t1.col2=t2.col2 and t1.col3=t2.col3)
where
t2.col1 is null)
AND
Change the Overwrite to Append to table:

Related

Access DB: Add new column of row number within group

I want to add a new column in access Db to put the order.
I want to know the query that sequentially increases or decreases by creating a column called Num based on the key1 column as shown in the picture below.
I would not recommend actually storing this derived information. Instead, you can compute it on the fly whenever neede - in MS Access, where window functions are not available, the simplest approach might be a correlated subquery.
If you are going to use this on a regular basis, you could create a view:
create view myview as
select
key,
value1,
1 + (select count(*) from mytable t1 where t1.key = t.key and t1.value < t.value) num
from mytable t
Try below query.
SELECT t.*, (SELECT Count(*)
from Table1 as t2
WHERE (t2.value1 <= t.value1 AND t2.key1 = t.key1)
) AS Num
FROM Table1 AS t;

Most efficient way to find distinct records, retaining unique ID

I have a large dataset stored in a SQL server table, with 1 unique ID, and many attributes. I need to select the distinct attribute records, along with one of the unique IDs associated with that unique combination.
Example dataset:
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
3|big|blue|ball
4|small|red|ball
Example Goal (2,3,4 would also have been acceptable) :
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
4|small|red|ball
I have tried a few different methods, but all of them seem to be taking very long (hours), so I was wondering if there was a more efficient approach. Failing this, my next idea is to partition the table.
I have tried:
Using Where exists, e.g.
SELECT * from Table as T1
where exists (select *
from table as T2
where
ISNULL(T1.ID,'') <> ISNULL(T2.ID,'')
AND ISNULL([T1].[Col1],'') = ISNULL([T2].[Col1],'')
AND ISNULL([T1].[Col2],'') = ISNULL([T2].[Col2],'')
)
MAX(ID) and Group By Attributes.
GROUP BY Attributes, having count > 1.
How about just using group by?
select min(id), col1, col2, col3
from t
group by col1, col2, col3;
This will probably take a while. This might be more efficient:
select t.*
from t
where t.id = (select min(t2.id)
from t t2
where t.col1 = t2.col1 and t.col2 = t2.col2 and . . .
);
This requires an index on t(col1, col2, col3, . . ., id). Given your request, that is on all columns.
In addition, this will not work for columns that are NULL. Some databases support the ANSI standard is not distinct from for null-safe comparisons. If yours does, then it should use the index for this construct as well.
SELECT Id,Col1,Col2,Col3 FROM (
SELECT Id,Col1,Col2,Col3,ROW_NUMBER() OVER (Partition By Col1,Col2,Col3 Order By ID,Col1,Col2,Col3) valid
from Table as T1) t
WHERE valid=1
Hope this helps...

How to keep records in an Oracle table based on the nearest date

I have an Oracle table T where there are multiple records with different startdates in them. I would like to delete all but to keep the one with the greatest dates among the same combination of col1,col2,col3. In this example, I want to keep the one with date as 31-May-17 and delete the other two. What would be the best possible way to achieve this in a single query without creating another staging table?
Test scripts -
create table t
(col1 number(10)
,col2 number(10)
,col3 number(10)
,col4 number(10)
,col5 date
);
insert into t values (15731,467,4087,14427,'09-Apr-17');
insert into t values (15731,467,4087,17828,'31-May-17');
insert into t values (15731,467,4087,15499,'16-Apr-17');
commit;
select * from t;[enter image description here][1]
Based on the data above, I would like to keep only the record where the date is 31-May-17 since that is the greatest of the dates having same combination of col1,col2,col3 and delete the remaining two off the table. Note that there are millions other records such as above on this table.
Apologize if this is too naive a question for Oracle experts but I am very new trying my hands on in Oracle db at this place.
You can order the rows by the absolute value of the difference between the date and sysdate. You can then use the rowid pseudocolumn to correlate between the query and the delete statement:
DELETE FROM t
WHERE rowid NOT IN (SELECT r
FROM (SELECT rowid AS r,
ROW_NUMBNER() OVER
(PARTITION BY col1, col2, col3
ORDER BY ABS(SYSDATE - col5) ASC) AS rn
FROM t)
WHERE rn = 1)
Since this is tagged oracle12c, you might as well take advantage of its features. For example, using MATCH_RECOGNIZE:
delete from t
where rowid not in (
select rowid
from t
match_recognize (
partition by col1, col2, col3
order by col5 desc
all rows per match
pattern ( ^ a x* )
define x as x.col5 = a.col5
)
)
;
This assumes you want to keep all the rows "tied" for latest start-date for a given combination of COL1, COL2, COL3. The solution can be adapted for variations of the requirement.

Hive: Select all rows with a range from the max of a column

So I am trying to write a query in Hive that will then be automated. The idea is I have a table that shows Requests with a timestamp field called updated. So there are alot of rows with the date and time at which the Request was made. Regardless of when the query is run I want to get the Requests from the last 7 days.
I tried:
SELECT col1, col2, col3, count(*) cnt
FROM table
WHERE updated BETWEEN date_sub(SELECT MAX(updated) AS maxdate FROM table, 7)
AND SELECT MAX(updated) AS maxdate FROM table
GROUP BY col1, col2, col3
HAVING cnt > 10
I have looked over this and It seems like it should do what I am looking for, however I get:
ParseException line 4:79 cannot recognize input near 'select' 'max' '(' in function specification
Any help on this error or a suggested diffrent approach would be great.
Can you try this query, if the data type of column "updated" is datatime in all tables:
SELECT col1, col2, col3, count(*) cnt
FROM table
WHERE updated BETWEEN (SELECT MAX(updated)-7 AS maxdate FROM table)
AND (SELECT MAX(updated) AS maxdate FROM table)
GROUP BY col1, col2, col3
HAVING count(*) > 10

Using SQL Merge or UPDATE / INSERT

I have a table (Customer_Master_File) that needs to updated from flat files dumped into a folder. I have an SSIS package that runs to pick up the flat files and imports them into a temp table (temp_Customer_Master_File)
What I have been unable to do is this:
for each record in the temp table, if the Customer_Number exists in the Master table, update it, if not insert the contents of the temp table.
I'm updating all fields of the record, not looking for individual field changes.
I tried the SQL Merge function but it errors when there is more than one record in the source data.
The flat files contain changes to the customer record, and there could be more than one change at a time. I just want to process each record with inserting or updating as necessary.
I also tried doing an INSERT INTO MASTER_FILE FROM TEMP_TABLE WHERE CUSTOMER_NUMBER NOT IN MASTER_FILE but this also fails with a PK error when it hits a duplicate source row.
UPDATE m SET
col2 = t.col2,
col3 = t.col3 -- etc. - all columns except Customer_Number
FROM dbo.Master_File AS m
INNER JOIN
(
SELECT
Customer_Number, rn = ROW_NUMBER() OVER
(
PARTITION BY Customer_Number ORDER BY [timestamp_column] DESC
), col2, col3, ... etc ...
FROM dbo.Temp_Table
) AS t
ON m.Customer_Number = t.Customer_Number
WHERE t.rn = 1;
INSERT dbo.Master_File(Customer_Number, col2, col3, ...etc...)
SELECT Customer_Number, col2, col3, ...etc...
FROM
(
SELECT
Customer_Number, rn = ROW_NUMBER() OVER
(
PARTITION BY Customer_Number ORDER BY [timestamp_column DESC
),
col2, col3, ...etc...
FROM dbo.Temp_Table AS t
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.Master_File AS m
WHERE m.Customer_Number = t.Customer_Number
)
) AS x WHERE rn = 1;
This takes care of multiple rows in the source table that don't already exist in the destination. I've made an assumption about column names which you'll have to adjust.
MERGE may be tempting, however there are a few reasons I shy away from it:
the syntax is daunting and hard to memorize...
you don't get any more concurrency than the above approach unless you intentionally add specific locking hints...
there are many unresolved bugs with MERGE and probably many more that have yet to be uncovered...
I recently published a cautionary tip here as well and have collected some other opinions here.