Updating partitioned and clustered table in BigQuery - sql

I've created a partitioned and clustered BigQuery table for the time period of the year 2019, up to today. I can't seem to find if it is possible to update such a table (since I would need to add data for each new day). Is it possible to do it and if so, then how?
I've tried searching stackoverflow and BigQuery documentation for the answer. No results there on my part.

You could use the UPDATE statement to update this data. Your partitioned table will maintain their properties across all operations that modify it, like the DML and DDL statements, load jobs and copy jobs as well. For more information, you could check this document.
Hope it helps.

Related

How to get table/column usage statistics in Redshift

I want to find which tables/columns in Redshift remain unused in the database in order to do a clean-up.
I have been trying to parse the queries from the stl_query table, but it turns out this is a quite complex task for which I haven't found any library that I can use.
Anyone knows if this is somehow possible?
Thank you!
The column question is a tricky one. For table use information I'd look at stl_scan which records info about every table scan step performed by the system. Each of these is date-stamped so you will know when the table was "used". Just remember that system logging tables are pruned periodically and the data will go back for only a few days. So may need a process to view table use daily to get extended history.
I ponder the column question some more. One thought is that query ids will also be provided in stl_scan and this could help in identifying the columns used in the query text. For every query id that scans table_A search the query text for each column name of the table. Wouldn't be perfect but a start.

Change clustered columns in an existing bigquery table

I have a partitioned and clustered table in bigquery. I would like to add another column to the set of clustered columns. I found out that the way to fix it is creating another table as you can see here Make existing bigquery table clustered, but I can't do it because my table is the source of a Data Studio dashboard where I have many calculated fields and I don't want to lose these fields.
Any suggestion? Thanks a lot!
Gustavo.
You don't need a new table, although changing cluster column was not supported initially, it is supported afterwards (since early 2020).
Please check this documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
Unfortunately, the feature is only available through API right now.
(If you're not familiar with BigQuery API) It doesn't require you to write code, you can interact with API web interface here. For your one time maintenance, it may save you some time.
I don't think BigQuery yet allows renaming a table.
Can you use view? So copy data to another table with required modified clustering. Then have a view with same name as the old table name on the new table, so that nothing breaks on Data Studio.

How to insert/update a partitioned table in Big Query

Problem statement:
I need to insert/update a few columns in a big query table that is partitioned by date.So basically I need to do the necessary changes for each partitioned date (done by day).
(its the sessions table that is created automatically by linking the GA View to BQ so I haven't done the partition manually but its automatically taken care by google).
query reference from google_docs
my query:
I also tried the below :
Can anyone help me here ? sorry I am a bit naive with BQ.
You are trying to insert into a wildcard table, a meta-table that is actually composed of multiple tables. Wildcard table is read only and cannot be inserted into.
As Hua said, ga_sessions_* is not a partitioned table, but represents many tables, each with a different suffix.
You probably want to do this then:
INSERT INTO `p.d.ga_sessions_20191125` (visitNumber, visitId)
SELECT 1, 1574

Google Big Query - Date-Partitioned Tables with Eventual Data

Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to

I want to create a copy of a table, fewer columns than original... but still updated from it

I have a 30gb table, which has 30-40 columns. I create reports using this table and it causes performance problems. I just use 4-5 columns of this table for the reports. So that, I want to create a second table for the reports. But the second table must be updated when the original table is changed without using triggers.
No matter what my query is, When the query is executed, sql tries to cache all 30gb. When the cache is fully loaded, sql starts to use disk. Actually I want to aviod this
How can I do this?
Is there a way of doing this using ssis
thanks in advance
CREATE VIEW myView
AS
SELECT
column1,
column3,
column4 * column7
FROM
yourTable
A view is effectively just a stored query, like a macro. You can then select from that view as if it were a normal table.
Unless you go for matierialised views, it's not really a table, it's just a query. So it won't speed anything up, but it does encapsulate code and assist in controlling what data different users/logins can read.
If you are using SQL Server, what you want is an indexed view. Create a view using the column you want and then place an index on them.
An indexed view stores the data in the view. It should keep the view up-to-date with the underlying table, and it should reduce the I/O for reading the table. Note: this assumes that your 4-5 columns are much narrower than the overall table.
Dems answer with the view seems ideal, but if you are truly looking for a new table, create it and have it automatically updated with triggers.
Triggers placed on the primary table can be added for all Insert, Update and Delete actions upon it. When the action happens, the trigger fires and can be used to do additional function... such as update your new secondary table. You will pull from the Inserted and Deleted tables (MSDN)
There are many great existing articles here on triggers:
Article 1, Article 2, Google Search
You can create that second table just like you're thinking, and use triggers to update table 2 whenever table 1 is updated.
However, triggers present performance problems of their own; the speed of your inserts and updates will suffer. I would recommend looking for more conventional alternatives to improve query performance, which sounds like SQL Server since you mentioned SSIS.
Since it's only 4-5 out of 30 columns, have you tried adding an index which covers the query? I'm not sure if there are even more columns in your WHERE clause, but you should try that first. A covering index would actually do exactly what you're describing, since the table would never need to be touched by the query. Of course, this does cost a little in terms of space and insert/update performance. There's always a tradeoff.
On top of that, I can't believe that you would need to pull a large percentage of rows for any given report out of a 30 gb table. It's simply too much data for a report to have. A filtered index can improve query performance even more by only indexing the rows that are most likely to be asked for. If you have a report which lists the results for the past calendar month, you could add a condition to only index the rows WHERE report_date > '5/1/2012' for example.