BigQuery command line tool - append to table using query - google-bigquery

Is it possible to append the results of running a query to a table using the bq command line tool? I can't see flags available to specify this, and when I run it it fails and states "table already exists"
bq query --allow_large_results --destination_table=project:DATASET.table "SELECT * FROM [project:DATASET.another_table]"
BigQuery error in query operation: Error processing job '':
Already Exists: Table project:DATASET.table

Originally BigQuery did not support the standard SQL idiom
INSERT foo SELECT a,b,c from bar where d>0;
and you had to do it their way with --append_table
But according to #Will's answer, it works now.
Originally with bq, there was
bq query --append_table ...
The help for the bq query command is
$ bq query --help
And the output shows an append_table option in the top 25% of the output.
Python script for interacting with BigQuery.
USAGE: bq.py [--global_flags] <command> [--command_flags] [args]
query Execute a query.
Examples:
bq query 'select count(*) from publicdata:samples.shakespeare'
Usage:
query <sql_query>
Flags for query:
/home/paul/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_large_results: Enables larger destination table sizes.
--[no]append_table: When a destination table is specified, whether or not to
append.
(default: 'false')
--[no]batch: Whether to run the query in batch mode.
(default: 'false')
--destination_table: Name of destination table for query results.
(default: '')
...
Instead of appending two tables together, you might be better off with a UNION ALL which is sql's version of concatenation.
In big query the comma or , operation between two tables as in SELECT something from tableA, tableB is a UNION ALL, NOT a JOIN, or at least it was the last time I looked.

Just in case someone ends up finding this question in Google, BigQuery has evolved a lot since this post and now it does support Standard.
If you want to append the results of a query to a table using the DML syntax feature of the Standard version, you could do something like:
INSERT dataset.Warehouse (warehouse, state)
SELECT *
FROM UNNEST([('warehouse #1', 'WA'),
('warehouse #2', 'CA'),
('warehouse #3', 'WA')])
As presented in the docs.
For the command line tool it follows the same idea, you just need to add the flag --use_legacy_sql=False, like so:
bq query --use_legacy_sql=False "insert into dataset.table (field1, field2) select field1, field2 from table"

According to the current documentation (March 2018): https://cloud.google.com/bigquery/docs/loading-data-local#appending_to_or_overwriting_a_table_using_a_local_file
You should add:
--noreplace or --replace=false

Related

Using Update statement with the _PARTITIONDATE Pseudo-column

I'm trying to update a table in BigQuery that is partitioned on _PARTITIONTIME and really struggling.
Source is an extract from destination that I need to backfill destination with. Destination is a large partitioned table.
To move data from source to destination, I tried this:
update t1 AS destination
set destination._PARTITIONTIME = '2022-02-09'
from t2 as source
WHERE source.id <> "1";
Because it said that the WHERE clause was required for UPDATE, but when I run it, I get a message that "update/merge must match at most one source row for each target row". I've tried... so many other methods that I can't even remember them all. INSERT INTO seemed like a no-brainer early on but it wants me to specify column names and these tables have about 800 columns each so that's less than ideal.
I would have expected this most recent attempt to work because if I do
select * from source where source.id <> "1";
I do, in fact, get results exactly the way I would expect, so that query clearly functions, but for some reason it can't load the data. This is interesting, because I created the source table by running something along the lines of:
select * from destination where DATE(createddate) = '2022-02-09' and DATE(_PARTITIONTIME) = '2022-02-10'
Is there a way to make Insert Into work for me in this instance? If there is not, does someone have an alternate approach they recommend?
You can use the bq command line tool (usually comes with the gcloud command line utility) to run a query that will overwrite a partition in a target table with your query results:
bq query --allow_large_results --replace --noflatten_results --destination_table 'target_db.target_table$20220209' "select field1, field2, field3 from source_db.source_table where _PARTITIONTIME = '2022-02-09'";
Note the $YYYYMMMDD postfix with the target_table. This indicates
that the partition corresponding to YYYYMMDD is to be overwritten
by the query results.
Make sure to distinctively select fields in your query (as a good practice) to avoid unexpected surprises. For instance, select field1, field2, field3 from table is way more explicit and readable than select * from table.

DMV request for "Description" of the table for Power BI dataset

What I am trying to achieve is to add tables and columns descriptions programmatically to a Power BI dataset.
For this reason, I use Server Analysis Services to get access to the metadata.
I run a simple request:
select *
from $System.TMSCHEMA_PARTITIONS
As a result, I get a table with columns names:
ID
TableID
Name
Description
....
Now I want to select where the "Description" is empty.
select *
from $System.TMSCHEMA_PARTITIONS
where Description IS NULL
But I can't, I always get a syntax error:
Query (3, 7) The syntax for 'Description' is incorrect.
SQL reads it as a command and I don't know how to avoid it.
I have tried adding quotes and double quotes to the name of the columns, I tried adding a table reference and all of these combined, but nothing helps.
It works for "TableID" for example.
This one works:
select *
from $System.TMSCHEMA_PARTITIONS
where len([Description]) = 0

Doubts on SELECT of Hive SQL

I am reading some Hive QL script and found this line:
SELECT 'Start time:',from_unixtime(unix_timestamp());
What does it mean? It does not look like a real "select" statement?
This will not query to DB. This is simply a Print statement.
If we run the command : SELECT 'Start time:',from_unixtime(unix_timestamp());
Out put would be :
Col1: Start time:
Col2: 2017-08-29 01:36:11
HIVE-178 SELECT without FROM should assume a one-row table with no columns
... People use this all the time with SQL Server. I expect people to
like it if it is added to Hive ...
... The Hive Language Manual now states, "As of Hive 0.13.0, FROM is
optional (for example, SELECT 1+1)."
Nothing really new -- that syntax was introduced by Sybase Transact-SQL 30 years ago, before they sold their "SQL Server" code base (and product name) to Microsoft, and long before they sold the rest to SAP.
The alternate syntax uses a DUAL (Oracle) or SYSDUMMY1 (DB2) pseudo-table.

Rowcounts of all tables in a database in Netezza

I am migrating data from MS SQL to Netezza and so I need to find the row counts of all tables in a database (in Netezza). Any query for the same would be of an immense help to me as I'm completely new to this. Thanks in advance.
This query does it directly from _v_table:
SELECT TABLENAME, RELTUPLES FROM _V_TABLE where objtype = 'TABLE' ORDER BY RELTUPLES
something like this should work:
select 'select '||chr(39)||tablename||chr(39)||' as entity, count(1) from '||tablename||' union all'
from _v_table
where object_type ='TABLE';
copy/paste the result, remove the last "union all".
I have never used Netezza but googled and found:
http://www.folkstalk.com/2009/12/netezza-count-analytic-functions.html
SELECT dept_id,
salary,
COUNT(1) OVER() total_cnt
FROM Employees
If you don't know what tables that exists:
http://www.folkstalk.com/2009/11/netezza-system-catalog-views.html
select * from _v_table;
Another way to acquire the row counts for a table (if you have access to the operating system level) is to use the Netezza nz_get_table_rowcount command. You can enter, "nz_get_table_rowcount -h" to get all of the help text on this command, but the format is:
Usage: nz_get_table_rowcount [database]
Purpose: Perform a "SELECT COUNT(*) FROM ;" to get its true rowcount.
Thus, this script results in a full table scan being performed.
Inputs: The database name is optional. If not specified, then $NZ_DATABASE
will be used instead.
The table name is required. If only one argument is specified, it
will be taken as the table name.
If two arguments are specified, the first will be taken as the
database name and the second will be taken as the table name.
Outputs: The table rowcount is returned.
Use this command in a shell script to cycle through all of the tables within a database. Use nz_get_table_names to get the list of tables within a database.

Adding column headers to hive result set

I am using a hive script on Amazon EMR to analyze some data.
And I am transferring the output to Amazon s3 bucket. Now the results of hive script do not contain column headers.
I have also tried using this:
set hive.cli.print.header=true;
But it does not help. Can you help me out?
Exactly what does your hive script look like?
Does the output from your hive script have the header data in it? Is it then being lost when you copy the output to your s3 bucket?
If you could provide some more details about exactly what you are doing that would be helpful.
Without knowing those details, here is something that you could try.
Create your hive script as follows:
USE dbase_name:
SET hive.cli.print.header=true;
SELECT some_columns FROM some_table WHERE some_condition;
Then run your script:
$ hive -f hive_script.hql > hive_output
Then copy your output to your s3 bucket
$ aws s3 cp ./hive_output s3://some_bucket_name/foo/hive_output
I guess that direct way is still impossible (HIve: writing column headers to local file?).
Some solution would be export result of DESCRIBE table_name to file:
$ hive -e 'DESCRIBE table_name' > file
And write some script that add column names into your data file. GL!
I ran into this problem today and was able to get what I needed by doing a UNION ALL between the original query and a new dummy query that creates the header row. I added a sort column on each section and set the header to 0 and the data to a 1 so I could sort by that field and ensure the header row came out on top.
create table new_table as
select
field1,
field2,
field3
from
(
select
0 as sort_col, --header row gets lowest number
'field1_name' as field1,
'field2_name' as field2,
'field3_name' as field3
from
some_small_table --table needs at least 1 row
limit 1 --only need 1 header row
union all
select
1 as sort_col, --original query goes here
field1,
field2,
field3
from
main_table
) a
order by
sort_col --make sure header row is first
It's a little bulky, but at least you can get what you need with a single query.
Hope this helps!
It might be just a typo (or a version-dependent change), but the following works for me:
set hive.cli.print.headers=true;
It's "headers" instead of "header"