How to insert/update a partitioned table in Big Query

How to insert/update a partitioned table in Big Query - google-bigquery

Problem statement:
I need to insert/update a few columns in a big query table that is partitioned by date.So basically I need to do the necessary changes for each partitioned date (done by day).
(its the sessions table that is created automatically by linking the GA View to BQ so I haven't done the partition manually but its automatically taken care by google).
query reference from google_docs
my query:
I also tried the below :
Can anyone help me here ? sorry I am a bit naive with BQ.

You are trying to insert into a wildcard table, a meta-table that is actually composed of multiple tables. Wildcard table is read only and cannot be inserted into.

As Hua said, ga_sessions_* is not a partitioned table, but represents many tables, each with a different suffix.
You probably want to do this then:
INSERT INTO `p.d.ga_sessions_20191125` (visitNumber, visitId)
SELECT 1, 1574

Related

BigQuery - creating timestamped 'versions' of tables

How can I create tables with multiple 'generations' (i.e. like on old mainframe environments with JCL), I've seen this done with Firebase analytics sample data.
e.g. I have the following table: mydataset.mytable (7), as listed on the UI.
If I expand the table details, I can see that I can select from the timestamped tables and the preview details for that data
In BigQuery, how can I go around emulating this? This looks REALLY useful.
EDIT: This is better explained with a picture!
Here's the table with the 7 snapshots:
Here, looking at the schema, I can select the snapshot I want to query:
I can't quite work out how to do this.
best wishes
Dave

You can use snapshot decorators for this
For example, below gives you version of table as of hour ago
#legacySQL
SELECT .... FROM [project:dataset.table#-3600000]
in BigQuery StandardSQL - you can use below syntax
#standardSQL
SELECT ... FROM `project.dataset.table` FOR SYSTEM TIME AS OF <timestamp_expression>
Update for
Here, looking at the schema, I can select the snapshot I want to query
That drop down represents actual sharded tables rather than snapshots.
Those are just separate tables with suffix that is presented as YYYYMMDD
Whenever you have any tables having common prefix with YYYYMMDD as a suffix in your dataset - Web UI just "collapse" them (in UI only - they are still separate tables) into one entry with count of actual tables in pair of round brackets ( )
Then, you can select which exactly table you want to deal with by selecting it from that drop down (in image from your question)
Hope, this helps you

Google Big Query - Date-Partitioned Tables with Eventual Data

Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?

You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to

oracle join depth while updating table

I have a question regarding Oracle.
I know that Oracle only support the use of aliases to the first subquery level. This poses a problem when I want to group more than one time while updating a table.
Example: I have some server groups and a database containing information about them. I have one table that contains information about the groups and one table where I store with timestamp (to be exact: I used date actually) the workload of specific servers within the groups.
Now I have for performance issues a denormalized field in the server table containing the highest workload the group had within one day.
What I would like to do is something like
update server_group
set last_day_workload=avg(workload1)
from (select max(workload) workload1
from server_performance
where server_performance.server_group_ID_fk=server_group.ID
and time>sysdate-1
group by server_performance.server_group_ID_fk)
While ID is the primary key of server_group and server_group_ID_fk a foreign key reference from the server_performance table. The solution I am using so far is writing the first join into a temporary table and update from that temporary table in the next statement. Is there a better way to do this?
In this problem it isn`t such a problem yet, but if the amount of data increase using a temporary table cost not only some time, but also a notable amount of RAM.
Thank you for your answers!

If I were you, I would work out the results that I wanted in a select statement, and then use a MERGE statement to do the necessary update.

How do I manage large data set spanning multiple tables? UNIONs vs. Big Tables?

I have an aggregate data set that spans multiple years. The data for each respective year is stored in a separate table named Data. The data is currently sitting in MS ACCESS tables, and I will be migrating it to SQL Server.
I would prefer that data for each year is kept in separate tables, to be merged and queried at runtime. I do not want to do this at the expense of efficiency, however, as each year is approx. 1.5M records of 40ish fields.
I am trying to avoid having to do an excessive number of UNIONS in the query. I would also like to avoid having to edit the query as each new year is added, leading to an ever-expanding number of UNIONs.
Is there an easy way to do these UNIONs at runtime without an extensive SQL query and high system utility? Or, if all the data should be managed in one large table, is there a quick and easy way to append all the tables together in a single query?

If you really want to store them in separate tables, then I would create a view that does that unioning for you.
create view AllData
as
(
select * from Data2001
union all
select * from Data2002
union all
select * from Data2003
)
But to be honest, if you use this, why not put all the data into 1 table. Then if you wanted you could create the views the other way.
create view Data2001
as
(
select * from AllData
where CreateDate >= '1/1/2001'
and CreateDate < '1/1/2002'
)

A single table is likely the best choice for this type of query. HOwever you have to balance that gainst the other work the db is doing.
One choice you did not mention is creating a view that contains the unions and then querying on theview. That way at least you only have to add the union statement to the view each year and all queries using the view will be correct. Personally if I did that I would write a createion query that creates the table and then adjusts the view to add the union for that table. Once it was tested and I knew it would run, I woudl schedule it as a job to run on the last day of the year.

One way to do this is by using horizontal partitioning.
You basically create a partitioning function that informs the DBMS to create separate tables for each period, each with a constraint informing the DBMS that there will only be data for a specific year in each.
At query execution time, the optimiser can decide whether it is possible to completely ignore one or more partitions to speed up execution time.
The setup overhead of such a schema is non-trivial, and it only really makes sense if you have a lot of data. Although 1.5 million rows per year might seem a lot, depending on your query plans, it shouldn't be any big deal (for a decently specced SQL server). Refer to documentation

I can't add comments due to low rep, but definitely agree with 1 table, and partitioning is helpful for large data sets, and is supported in SQL Server, where the data will be getting migrated to.
If the data is heavily used and frequently updated then monthly partitioning might be useful, but if not, given the size, partitioning probably isn't going to be very helpful.

I want to create a copy of a table, fewer columns than original... but still updated from it

I have a 30gb table, which has 30-40 columns. I create reports using this table and it causes performance problems. I just use 4-5 columns of this table for the reports. So that, I want to create a second table for the reports. But the second table must be updated when the original table is changed without using triggers.
No matter what my query is, When the query is executed, sql tries to cache all 30gb. When the cache is fully loaded, sql starts to use disk. Actually I want to aviod this
How can I do this?
Is there a way of doing this using ssis
thanks in advance

CREATE VIEW myView
AS
SELECT
column1,
column3,
column4 * column7
FROM
yourTable
A view is effectively just a stored query, like a macro. You can then select from that view as if it were a normal table.
Unless you go for matierialised views, it's not really a table, it's just a query. So it won't speed anything up, but it does encapsulate code and assist in controlling what data different users/logins can read.

If you are using SQL Server, what you want is an indexed view. Create a view using the column you want and then place an index on them.
An indexed view stores the data in the view. It should keep the view up-to-date with the underlying table, and it should reduce the I/O for reading the table. Note: this assumes that your 4-5 columns are much narrower than the overall table.

Dems answer with the view seems ideal, but if you are truly looking for a new table, create it and have it automatically updated with triggers.
Triggers placed on the primary table can be added for all Insert, Update and Delete actions upon it. When the action happens, the trigger fires and can be used to do additional function... such as update your new secondary table. You will pull from the Inserted and Deleted tables (MSDN)
There are many great existing articles here on triggers:
Article 1, Article 2, Google Search

You can create that second table just like you're thinking, and use triggers to update table 2 whenever table 1 is updated.
However, triggers present performance problems of their own; the speed of your inserts and updates will suffer. I would recommend looking for more conventional alternatives to improve query performance, which sounds like SQL Server since you mentioned SSIS.
Since it's only 4-5 out of 30 columns, have you tried adding an index which covers the query? I'm not sure if there are even more columns in your WHERE clause, but you should try that first. A covering index would actually do exactly what you're describing, since the table would never need to be touched by the query. Of course, this does cost a little in terms of space and insert/update performance. There's always a tradeoff.
On top of that, I can't believe that you would need to pull a large percentage of rows for any given report out of a 30 gb table. It's simply too much data for a report to have. A filtered index can improve query performance even more by only indexing the rows that are most likely to be asked for. If you have a report which lists the results for the past calendar month, you could add a condition to only index the rows WHERE report_date > '5/1/2012' for example.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to insert/update a partitioned table in Big Query - google-bigquery

You are trying to insert into a wildcard table, a meta-table that is actually composed of multiple tables. Wildcard table is read only and cannot be inserted into.

As Hua said, ga_sessions_* is not a partitioned table, but represents many tables, each with a different suffix. You probably want to do this then: INSERT INTO `p.d.ga_sessions_20191125` (visitNumber, visitId) SELECT 1, 1574

Related

BigQuery - creating timestamped 'versions' of tables

Google Big Query - Date-Partitioned Tables with Eventual Data

oracle join depth while updating table

How do I manage large data set spanning multiple tables? UNIONs vs. Big Tables?

I want to create a copy of a table, fewer columns than original... but still updated from it

Categories

Resources