BigQuery update running median on second table

BigQuery update running median on second table - google-bigquery

I have several tables that contain sensordata that consist of 2 colums:
timestamp | value
I want a second table that adds a new line when a new value comes into one of the above tables, that takes the last xx values, calculates the median, and adds this line.
How is this best done in BQ / Scheduled query? I know how to calculate a median, just hardest is to do the append 'on schedule'.

For append data into a table in a scheduled query, you require to set the Append to table option in the Destination table write preference when the scheduled query is configured in the by BigQuery UI:

Related

how to add column after dividing working hours in SQL

My bigquery table has a start time and an finish time. I want to add number 1 to the start time, but I want to do it as much as the end time. How can I do that?
this is my bigquery table and I want to show this table to below table.
For example
MY bigquery
Results
----> Result

Automatically add date for each day in SQL

I'm working on BigQuery and have created a view using multiple tables. Each day data needs to be synced with multiple platforms. I need to insert a date or some other field via SQL through which I can identify which rows were added into the view each day or which rows got updated so only that data I can take forward each day instead of syncing all every day. Best way I can think is to somehow add the the current date wherever an update to a row happens but that date needs to be constant until a further update happens for that record.
Ex:
Sample data
Say we get the view T1 on 1st September and T2 on 2nd. I need to to only spot ID:2 for 1st September and ID:3,4,5 on 2nd September. Note: no such date column is there.I need help in creating such column or any other approach to verify which rows are getting updated/added daily

You can create a BigQuery schedule queries with frequency as daily (24 hours) using below INSERT statement:
INSERT INTO dataset.T1
SELECT
*
FROM
dataset.T2
WHERE
date > (SELECT MAX(date) FROM dataset.T1);

Your table where the data is getting streamed to (in your case: sample data) needs to be configured as a partitioned table. Therefor you use "Partition by ingestion time" so that you don't need to handle the date yourself.
Configuration in BQ
After you recreated that table append your existing data to that new table with the help of the format options in BQ (append) and RUN.
Then you create a view based on that table with:
SELECT * EXCEPT (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (GROUP BY invoice_id ORDER BY _PARTITIONTIME desc) AS rank
FROM `your_dataset.your_sample_data_table`
)
WHERE rank = 1
Always use the view from that on.

How do I get the last update time of a sequence of tables in BigQuery?

A BigQuery best practice is to split timeseries in daily tables (as "NAME_yyyyMMdd") and then use Table Wildcards to query one or more of these tables.
Sometimes it is useful to get the last update time on a certain set of data (i.e. to check correctness of the ingestion procedure). How do I get the last update time over a set of tables organized like that?

A good way to achieve that is to use the __TABLES__ meta-table. Here is a generic query I use in several projects:
SELECT
MAX(last_modified_time) LAST_MODIFIED_TIME,
IF(REGEXP_MATCH(RIGHT(table_id,8),"[0-9]{8}"),LEFT(table_id,LENGTH(table_id) - 8),table_id) AS TABLE_ID
FROM
[my_dataset.__TABLES__]
GROUP BY
TABLE_ID
It will return the last update time of every table in my_dataset. For tables organized with a daily-split structure, it will return a single value (the update time of the latest table), with the initial part of their name as TABLE_ID.

SELECT *
FROM project_name.data_set_name.INFORMATION_SCHEMA.PARTITIONS
where table_name='my_table';
Solution for Google

How to add a new column with existing and new rows to a table?

I have a table that I created with a unique key and each other column representing one day of December 2014 (eg named D20141226 for data from 26/12/2014). So the table consists of 32 columns (key + 31 days). These daily columns are indicating that a customer had a transaction on that specific day or no transaction is indicated by a 0.
Now I want to execute the same query on a daily basis, producing a list of unique keys that had a transaction on that specific day. I used this easy script:
CREATE TABLE C01012015 AS
SELECT DISTINCT CALLING_ISDN AS A_PARTY
FROM CDRICC_012015
WHERE CALL_STA_TIME ::date = '2015-01-01'
Now my question is, how can I add the content of the new daily table to the existing table with the 31 days, making it effectively a table with 32 days of data (and then continue to do so on a daily basis to store up to 360 days of data)?
Please note that new customer are doing transactions every day hence there will unique keys in the daily table that aren't in the big table holding all the previous days.
It would be ideal if those new rows would automatically get a 0 instead of a NULL but I can work around it if it gets a NULL value (not sure how to make sure it gets a 0 instead).
I thought that a FULL OUTER JOIN would be the solution but that would mean that I have to list all variables in the select statement, which becomes quite large as I add one more column each day. Is there a more elegant way to do this?
Or is SQL just not suited to this and a programming language like eg R would be much better at this?

If you have the option to change your schema completely, you should unpivot your table so that your columns are something like CUSTOMER_ID INTEGER, D DATE, DID_TRANSACTION BOOLEAN. There's a post on the Enzee Community website that suggests using a user-defined table function (UDTF) to do this. If you change your schema in this way, a simple insert will work just fine and there will be no need to add columns dynamically.
If you can't change your schema that much but you're still able to add columns, you could add a column for every day of the year up front with a default value of FALSE (assuming it's a boolean column representing whether the customer had a transaction or not on that day). You probably want to script this.
ALTER TABLE table_with_daily_columns MODIFY COLUMN (D20140101 BOOLEAN DEFAULT FALSE);
ALTER TABLE table_with_daily_columns MODIFY COLUMN (D20140102 BOOLEAN DEFAULT FALSE);
-- etc
ALTER TABLE table_with_daily_columns ADD COLUMN (D20150101 BOOLEAN DEFAULT FALSE);
GROOM TABLE table_with_daily_columns;
When you alter a table like this, Netezza creates a new table and an internal view that does a UNION of the new table and the old. You need to GROOM the table to merge the tables back into a single one for improved performance.
If you really must keep one column per day, then you'll have to use the method you described to pivot the data from your daily transaction table. Set the default value for each of your columns to 0 or FALSE as described above, then:
INSERT INTO table_with_daily_columns
SELECT
cust_id,
TRUE as D20150101
FROM C01012015;

Merge Statement VS Lookup Transformation

I am stuck with a problem with different views.
Present Scenario:
I am using SSIS packages to get data from Server A to Server B every 15 minutes.Created 10 packages for 10 different tables and also created 10 staging table for the same. In the DataFlow Task it is selecting data from server A with ID greater last imported ID and dumping them onto a Staging table.(Each table has its own stagin table).After the DataFlow task I am using a MERGE statement to merge records from Staging table to Destination table where ID is NO Matched.
Problem:
This will take care all new records inserted but if once a record is picked by SSIS job and is update at the source I am not able to pick it up again and not able to grab the updated data.
Questions:
How will I be able to achieve the Update with impacting the source database server too much.
Do I use MERGE statement and select 10,000 records every single run?(every 15 minutes)
Do I use LookUp transformation to do the updates
Some tables have more than 2 million records and growing, so what is the best approach for them.
NOTE:
I can truncate tables in destination and reinsert complete data for the first run.
Edit:
The Source has a column 'LAST_UPDATE_DATE' which I can Use in my query.

If I'm understanding your statements correctly it sounds like you're pretty close to your solution. If you currently have a merge statement that includes the insert (where source does not match destination) you should be able to easily include the update statement for the (where source matches destination).
example:
MERGE target_table as destination_table_alias
USING (
SELECT <column_name(s)>
FROM source_table
) AS source_alias
ON
[source_table].[table_identifier] = [destination_table_alias].[table_identifier]
WHEN MATCHED THEN UPDATE
SET [destination_table_alias.column_name1] = mySource.column_name1,
[destination_table_alias.column_name2] = mySource.column_name2
WHEN NOT MATCHED THEN
INSERT
([column_name1],[column_name2])
VALUES([source_alias].[column_name1],mySource.[column_name2])
So, to your points:
Update can be achieved via the 'WHEN MATCHED' logic within the merge statement
If you have the last ID of the table that you're loading, you can include this as a filter on your select statement so that the dataset is incremental.
No lookup is needed with the 'WHEN MATCHED' is utilized.
utilizing a select filter in the select portion of the merge statement.
Hope this helps

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery update running median on second table - google-bigquery

For append data into a table in a scheduled query, you require to set the Append to table option in the Destination table write preference when the scheduled query is configured in the by BigQuery UI:

Related

how to add column after dividing working hours in SQL

Automatically add date for each day in SQL

How do I get the last update time of a sequence of tables in BigQuery?

How to add a new column with existing and new rows to a table?

Merge Statement VS Lookup Transformation

Categories

Resources