I want to create a table which schema is exactly the same as another table. In other SQL engines, I think I was able to use "CREATE TABLE my_table (LIKE your_table)" or some variations.
I couldn't find the equivalent in BigQuery yet. Is this possible in some fashion?
Use this form:
CREATE TABLE dataset.new_table AS
SELECT *
FROM dataset.existing_table
LIMIT 0
This creates a new table with the same schema as the old one, and there is no cost due to the LIMIT 0.
Note that this does not preserve partitioning, table description, etc., however. Another option is to use the CLI (or API), making a copy of the table and then overwriting its contents, e.g.:
$ bq cp dataset.existing_table dataset.new_table
$ bq query --use_legacy_sql --replace --destination_table=dataset.new_table \
"SELECT * FROM dataset.new_table LIMIT 0;"
Now the new table has the same structure and attributes as the original did.
To create a partitioned and/or clustered table the syntax would be:
CREATE TABLE project.dataset.clustered_table
PARTITION BY DATE(created_time)
CLUSTER BY
account_id
AS SELECT * FROM project.dataset.example_table LIMIT 0
Related
I am trying to create a temporary table with a specific name so i can use it in other queries. The problem is that when i create a temp table like this:
BEGIN
BEGIN TRANSACTION;
--delete test_temp where 1=1;
create temp table test_temp as SELECT * FROM `analytics.reports.analysis` ;
COMMIT TRANSACTION;
END;
the table name is indeed test_temp but if i will try to select * from test_temp
i get nothing. the table id always generates a new unique id.
something like analytics-prod:_script3991beb3e9868774854ac09d407f1c397428a4f6.test_temp
Is there a way to make this temp table with a pre-defined name?
What I am trying to achieve basically is create a table from a select every 15min.
I thought to create a scheduled query running this transaction that will delete + populate the temp table and then I can always reference to test_temp without changing the table id.
According to the docs:
You can refer to a temporary table by name for the duration of the current multi-statement query
So if you want to refer the table in other queries, an option could be use a permanent table in a separated dataset:
...
CREATE OR REPLACE TABLE `analytics.my_temp_dataset.test_temp` AS SELECT * FROM `analytics.reports.analysis` ;
...
In that case, you don't need a temp table but a permanent table you just replace on the next run using the CREATE OR REPLACE TABLE statement. If you are using the temp table as part of a scripting stage, then I suggest using running this instead:
BEGIN
CREATE OR REPLACE TEMP TABLE _SESSION.test_temp AS
SELECT * FROM `analytics.reports.analysis`;
-- SELECT COUNT(*) CNT FROM _SESSION.test_temp;
END;
Since it is a temp table you don't need a DROP query at the end.
You can create a temporary table in a session in BigQuery.
You can create a session via the API and then create your temporary tables in the session. The tables will then be visible to all queries using that session.
There are limitations on sessions you should be aware of:
They last up to 24 hours max
No concurrent jobs running in a session
I use temporary tables with sessions to do complex transformations in an ETL pipeline. For example:
Create session
Create temporary tables
Do joins, and other transformations, write results to temporary tables
Do final transformations and output results to permanent tables
End session
I am writing a Python API for this and will put it on github at some point. I could probably post a gist if that is helpful.
In the meantime, see Google's documentation for sessions in BigQuery:
https://cloud.google.com/bigquery/docs/sessions-intro
I have a table in AWS Athena which contains 2 records. Is there a SQL query using which a new column can be inserted in to the table?
You can find more information about adding columns to table in Athena documentation
Or you can use CTAS
For example, you have a table with
CREATE EXTERNAL TABLE sample_test(
id string)
LOCATION
's3://bucket/path'
and you can create another table from sample_test with the query
CREATE TABLE new_test
AS
SELECT *, 'new' AS new_col FROM sample_test
You can use any available query after AS
This is mainly for future readers like me, who was struggling to get this working for Hive table with AVRO data and if you don't want to create new table i.e updating schema of the existing table. It works for csv using 'add columns', but not for Hive + AVRO. For Hive + AVRO, to append columns at the end, before partition columns, the solution is available at this link. However, there are couple of things to note that, we need to pass full schema to the literal attribute and not just the changes; and (not sure why but) we had to alter hive table for all 3 things in the same order - 1. add columns using add columns 2. set tblproperties and 3. set serdeproperties. Hopefully it helps someone.
I am trying to find a way to list the partitions of a table created with require_partition_filter = true however I am not able to find the way yet.
This is table creation script
CREATE TABLE mydataset.partitionedtable_partitiontime
(
x INT64 \
)
PARTITION BY DATE(_PARTITIONTIME)
OPTIONS(
require_partition_filter = true
);
Some test rows
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-05-01"), 10;
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-04-01"), 20;
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-03-01"), 30;
As expected, If a try the following query to get the partitions, I am getting an error because I need to user a filter on top of the partitioning column
SELECT _PARTITIONTIME as pt, FORMAT_TIMESTAMP("%Y%m%d", _PARTITIONTIME) as partition_id
FROM `mydataset.partitionedtable_partitiontime`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME
Error
Cannot query over table 'mydataset.partitionedtable_partitiontime' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
any ideas how to list the partitions?
EDIT: I know that it is possible to add the filter, but I am looking for a solution like "SHOW PARTITIONS TABLENAME" of Hive to list all the partitions (which are essentially metadata)
Thanks!
Here is the way to do it:
SELECT * FROM `mydataset.partitionedtable_partitiontime$__PARTITIONS_SUMMARY__`
The bigquery.jobs.create permission is required.
EDIT: Now is possible to get this information using Standard SQL:
SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'partitionedtable'
As mentioned by hlagos, you can get this data by querying the _PARTITIONTIME pseudo column, in case you are using Standard SQL, or the __PARTITIONS_SUMMARY__ meta table for Legacy SQL.
You can take a look on this GCP documentation that contains detailed information about the usage of this partitioned tables metadata.
I've got a table with two weeks worth of entries, and I would like to copy those entries into a table partitioned by date (creating it if it does not exist).
I'm writing a luigi task to do this, and I would love for it to be independent of the table schema--i.e. I wouldn't have to specify column names and types, and it would CREATE TABLE IF NOT EXISTS when necessary.
I was hoping I could use:
CREATE TABLE IF NOT EXISTS test_part
COMMENT 'This is a test table to see if partitioning works in this case'
PARTITIONED BY (event_date string)
AS select *, '2014-12-15' from source_db.source_table
where event_at <'2014-12-16' and event_at >='2014-12-15';
But this of course fails with: FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table
I tried again with "like" with basically the same results. Is there a way to do this that I am missing? It doesn't have to be atomic. Multiple sequential commands are fine.
You do not do a create table as.
You create a table first using describe source_table and then you make an insert into table partition (event_date string)
2 steps it works better.
I'm currently working on a project that uses a Redshift table with 51 columns. However, the person who made the table forgot to add a sortkey to our time column which will hurt performance for our use case if we don't add it.
How can I make a version of the table with our time column as the sortkey? I'm aware that you can't make a column a sortkey if its a member of an existing table, but I was hoping there's a way to do it that doesn't involve writing out the CREATE TABLE syntax by hand; for example, something like this would be nice:
timecube=# CREATE TABLE foo (like bar) sortkey(time);
ERROR: CREATE TABLE LIKE is not supported with DISTSTYLE, DISTKEY(), or SORTKEY() clauses
but as you can see its not supported. Is there another way? As we're still developing we don't need any of existing data.
Using traditional tools like pgdump didn't work well because they don't include any of the Redshift extras like encoding.
Redshift supports specifying the DIST and SORT keys as part of CREATE TABLE AS statements, as per the docs.
CREATE TABLE table_name
DISTSTYLE KEY
DISTKEY ( column )
SORTKEY ( column )
AS
(SELECT *
FROM source_table)
;
First step you need to do use get create table statement for existing table. Then create new table this time add sort key to new table.
Check encoding for old table ( when you load data using copy command it automatically adds compression encodings)
select "column", type, encoding
from pg_table_def where tablename = 'old_table'
When creating new table add encoding type for each column. Create table with Sort key .
Once new table is created use below command
insert into new table ( select * from old table order by time asc)