SQL for append rows based on max date - sql

This is more of a logic question as I am having a hard time wrapping my head around it.
Say I have table 1 that is truncated and populated everyday, and a time stamp column is added onto it. Everyday new records would be added to the table.
That table 1 is copied to table 2 initially, however on consequent runs I only want to add the new records from table 1 into table 2.
I know this will be a mixture of matching the columns and only importing the MAX DATES, however confused as to the actual logic of the query.
So in short I want to append only the latest rows from table 1 to table 2 based on the max date.

If you want to sync the tables daily, you may just look for timestamp_column > current_Date.
If you want to get the max dates, you can write something like this:
INSERT INTO table2 (x,y,z, timestamp_column)
SELECT x,y,z, current_timestamp() FROM table1
WHERE timestamp_column >
(SELECT IFNULL(MAX(timestamp_column), '0001-01-01' ) FROM table2);
On the other hand, I think Snowflake streams are a very good fit for this task:
https://docs.snowflake.com/en/user-guide/streams-intro.html
You can create an "Append-only" stream on table1, and use it as a source when synchronizing to table2.

Related

How to stop partition column appearing last in a SELECT * output, in Hive?

In Apache Hive, I'm trying to copy specific rows from one table to a second table that's identical apart from an additional string column (which I'm calling "report-type") at the end of the second table. Both tables are partitioned by a string field called 'dt' which has a date e.g. "2022-08-04". When I try and copy a row from table 1 to table 2, the data is inserted into table 2 with report-type and dt swapped, because the partition column seems to be forcibly listed last.
E.g. INSERT INTO table2 SELECT *, 'some_report_type' FROM table1 WHERE <some criteria>;
gives all the data in table2 in the correct columns apart from report-type is e.g. "2022-08-04" and dt is e.g. "2022-08-04"
Is there any way around this?
Two solutions I can see are recreate the table without the partitioning (ideally want to avoid) and just have dt as a regular non-partition column, or alternatively specify each of the columns in a column list in the query, but not sure if this would force "dt" to not be the last column, and the main issue with that is I have 830 columns to specify individually.
Thanks

Automatically add date for each day in SQL

I'm working on BigQuery and have created a view using multiple tables. Each day data needs to be synced with multiple platforms. I need to insert a date or some other field via SQL through which I can identify which rows were added into the view each day or which rows got updated so only that data I can take forward each day instead of syncing all every day. Best way I can think is to somehow add the the current date wherever an update to a row happens but that date needs to be constant until a further update happens for that record.
Ex:
Sample data
Say we get the view T1 on 1st September and T2 on 2nd. I need to to only spot ID:2 for 1st September and ID:3,4,5 on 2nd September. Note: no such date column is there.I need help in creating such column or any other approach to verify which rows are getting updated/added daily
You can create a BigQuery schedule queries with frequency as daily (24 hours) using below INSERT statement:
INSERT INTO dataset.T1
SELECT
*
FROM
dataset.T2
WHERE
date > (SELECT MAX(date) FROM dataset.T1);
Your table where the data is getting streamed to (in your case: sample data) needs to be configured as a partitioned table. Therefor you use "Partition by ingestion time" so that you don't need to handle the date yourself.
Configuration in BQ
After you recreated that table append your existing data to that new table with the help of the format options in BQ (append) and RUN.
Then you create a view based on that table with:
SELECT * EXCEPT (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (GROUP BY invoice_id ORDER BY _PARTITIONTIME desc) AS rank
FROM `your_dataset.your_sample_data_table`
)
WHERE rank = 1
Always use the view from that on.

Optimal SQL query for querying 31 tables (containing datestamp in tablename)

Fairly new to SQL and I was stumped on this question I received in an interview recently.
The question was along the lines of how would you count the total occurrences of 'True' for Column B in July.
Problem was; there was no date or timestamp column in the table. Instead the table naming convention was defined as "ProductX_YYYYMMDD". The assumption being that a new table is created for each day's data dump.
Is there an efficient query I can write to obtain the True COUNTs of Column B for each table (which doesn't involve ~30 JOIN or UNION statements to get the answer)?
Use STRING_SPLIT(myColumn, '_')
Then
SELECT RIGHT (SELECT LEFT(tempColumn, -4)), -2)
Now you have a temp table filled with only month |MM| and you can use
COUNT() FROM dailyTable WHERE dailyName like '07'
Add the count of every daily Table to a variable

Best way to compare two tables in SQL by matching string?

I have a program where the goal is to take data from an API, and capture the differences in data from minute to minute. It involves three tables: Table 1 (for new data), Table 2 (for previous minutes data), Results table (for the results).
The sequence of the program is like this:
Update table 1 -> Calculate the differences from table 2 and update a "Results" table with the differences -> Copy table 1 to table 2.
Then it repeats! It's simple and it works.
Here is my SQL query:
Insert into Results (symbol, bid, ask, description, Vol_Dif, Price_Dif, Time) Select * FROM(
Select symbol, bid, ask, description, Vol_Dif, Price_Dif, '$now' as Time FROM (
Select t1.symbol, t1.bid, t1.ask, t1.description, (t1.volume - t2.volume) AS Vol_Dif, (t1.totalPrice - t2.totalPrice) AS Price_Dif
FROM `Table_1` t1
Inner Join (
Select id, volume, ask, totalPrice FROM Table_2) t2
ON t2.id = t1.id) as test
The tables are identical in structure, obviously. The primary key is the 'id' field that auto-increments. And as you can see, I am comparing both tables on the basis of these 'id' fields being equal.
The PROBLEM is that the API seems to be inconsistent. One API call will have 50,000 entries. The next one will have 51,000 entries. And the entries are not just added to the end or added to the beginning, they are mixed into the middle.
So, comparing on equal ID's means I am comparing entries for DIFFERENT data, IF the API calls return a different number entries.
The data that I am trying to get the differences of is the 'bid', 'ask', 'Vol_Dif', 'Price_Dif' from minute to minute. There are many instances of the same 'symbol's, so I couldn't compare with this. The ONLY other way to compare entries from table to table, beside the matching ID's, would be matching the "description" fields.
I have tried this. The script is almost the same as above except the end of the query is
ON t2.description = t1.description
The problem is that looking for matching description fields takes 3 minutes for 50,000 entries, whereas looking for matching ID's takes 1 second.
Is there a better, faster way to do what I'm trying to do? Thanks in advance. Any help is appreciated.

How can I block bad row from a delete query

I have a query that moves year-old rows from one table to an identical "archive" table.
Sometimes, invalid dates get entered in to a dateprocessed column (used to evaluate if the row is more than a year old), and the query errors out. I want to essentially "screen" the bad rows -- i.e. where not isdate(dateprocessed) does not equal 1 -- so that the query does not try to archive them.
I have a few ideas about how to do this, but want to do this in the absolute simplest way possible. If I select the good data into a temp table in my stored procedure, then inner join it with the live table, then run the delete from live output to archive -- will it delete from the underlying live table or the new joined table?
Is there a better way to do this? Thanks for the help. I am a .NET programmer playing DBA, but really want to do this properly.
Here is the query that errors when some of the dateprocessed column values are invalid:
delete from live
output deleted.* into archive
where isdate(dateprocessed) = 1
and cast (dateprocessed as datetime) < dateadd(year, -1, getdate())
and not exists (select * from archive where live.id = archive.id)
The simplest thing to do is:
Select the correct records into a temp table
One of the fields you need to copy into the temp table should be a
unique identifier like an "ID" column
Do any additional processing in the temp table
Archive from the temp table to archive table
Delete from live table with a join with temp table using the "ID" Column. This will ensure no mistakes are made.
If you are a .NET guy you could bring every data down and do a DateTime.TryParse. Better yet just do it once to populate a real DateTime column. The the dates that don't parse you could assign a fixed date or null. And there are some dates strings that .NET will parse that SQL will not (.e.g. November 2010).