how to create hive table based on non-select query - impala

In my work with impala, I like to save my show table stats tablename to a hive table, I am doing this way but got error message:
[r001d01i1p:21000] > create table mystats as
> show table stats eq_cmplx_exec_master;
error:
Query: create table mystats as show table stats eq_cmplx_exec_master
ERROR: AnalysisException: Syntax error in line 2: show table stats
eq_cmplx_exec_master ^ Encountered: SHOW Expected: SELECT, VALUES,
WITH
CAUSED BY: Exception: Syntax error
Can anyone help me to sort it out please, thank you very much.
Ideally I want the new table contains two extra columns: ID, tablename

After some research, here is what I ended up with and it works perfectly as I want:
impala-shell -i host-10-17-101-252:25003 -k -q "show table stats dbo" -d default --quiet -B -c -o table_stats.txt

Related

Trying to load data into a table using PostgreSQL?

I am trying to load data into a table using PostgresSQL10.4 and pgAdmin4. My query is below. When I try to use \copy I get the below error.
CREATE TABLE mydata (TimeDate date, Yield float(3))
SELECT * FROM mydata
\copy mydata FROM 'C:\Users\john\Desktop\Stock Prices.csv' WITH CSV;
The above results in the following error:
ERROR:
ERROR: syntax error at or near "\"
LINE 1: \copy mydata FROM 'C:\Users\john\Desktop\Stock Pri...
You can right click on that particular table & select option to 'Import/Export data" and provide file you want to load into that table.
\copy is a tool particular to the psql program. Other programs have their own variants to accomplish the same thing. In the case of PgAdmin4, you right-click on the table in the tree viewer and select "Import/Export..."

Wildcard tables and differences in table columns

I am try to use BigQuery to JOIN over a range of tables using Wildcard Tables.
The query works when all tables matched by the wildcard have the column fooid (bar*.fooid). However this column is a recent addition and when the table wildcard matches tables where the field does not exist the query fails.
Error: Cannot read non-required field 'fooid' as required INT64.
This is a simplified version of the query to demonstrate the issue, it would be selecting more columns from both foo and bar.
SELECT foo.foo_id AS foo
FROM `bar.bar*` AS bar_alias
LEFT JOIN bar.foo_map foo ON (bar_alias.fooid = foo.foo_id)
WHERE (_TABLE_SUFFIX BETWEEN '20170206' AND '20170208')
I've looked at a number of answers including BigQuery IF field exists THEN, but can't see how to use them in conjunction with a JOIN or when the tables without the column are not known.
Here is an example of how this can arise, and how to fix it by using a reference schema from an empty table where the column/field is NULLABLE. Suppose I have the following two tables:
$ bq query --use_legacy_sql=false \
"CREATE TABLE tmp_elliottb.bar20180328 (y STRING) AS SELECT 'bar';"
$ bq query --use_legacy_sql=false \
"CREATE TABLE tmp_elliottb.bar20180329 " \
"(x INT64 NOT NULL, y STRING) AS SELECT 1, 'foo';"
Column x has the NOT NULL attribute in the second table, but the column is missing from the first table. I get an error when I try to use a table wildcard:
$ bq query --use_legacy_sql=false \
"SELECT * FROM \`tmp_elliottb.bar*\` " \
"WHERE _TABLE_SUFFIX BETWEEN '20180301' AND '20180329';"
Waiting on <job id> ... (0s) Current status: DONE
Error in query string: Error processing job '<job id>': Cannot read non-required field 'x' as required INT64.
Failure details:
- query: Cannot read non-required field 'x' as required INT64.
This makes sense--I said that x is NOT NULL, but the bar20180328 table doesn't have the column. Now if I create a new table that matches the * expansion, but where the column doesn't have NOT NULL:
$ bq query --use_legacy_sql=false \
"CREATE TABLE tmp_elliottb.bar_empty (x INT64, y STRING);"
$ bq query --use_legacy_sql=false \
"SELECT * FROM \`tmp_elliottb.bar*\` " \
"WHERE _TABLE_SUFFIX BETWEEN '20180301' AND '20180329';"
...
+------+-----+
| x | y |
+------+-----+
| 1 | foo |
| NULL | bar |
+------+-----+
I get results instead of an error. In your case, you need to create a table with the expected schema called bar_empty, for example, but where none of the fields/columns that are missing for other tables have a NOT NULL attribute.
With that said, I would strongly recommend using a partitioned table instead, if possible. Among other benefits, partitioned tables are much nicer to work with because they have a consistent schema across all days.

Impala: Show tables like query

I am working with Impala and fetching the list of tables from the database with some pattern like below.
Assume i have a Database bank, and tables under this database are like below.
cust_profile
cust_quarter1_transaction
cust_quarter2_transaction
product_cust_xyz
....
....
etc
Now i am filtering like
show tables in bank like '*cust*'
It is returning the expected results like, which are the tables has a word cust in its name.
Now my requirement is i want all the tables which will have cust in its name and table should not have quarter2.
Can someone please help me how to solve this issue.
Execute from the shell and then filter
impala-shell -q "show tables in bank like '*cust*'" | grep -v 'quarter2'
Query the metastore
mysql -u root -p -e "select TBL_NAME from metastore.TBLS where TBL_NAME like '%cust%' and TBL_NAME not like '%quarter2%'";

How can I find last modified timestamp for a table in Hive?

I'm trying to fetch last modified timestamp of a table in Hive.
Please use the below command:
show TBLPROPERTIES table_name ('transient_lastDdlTime');
Get the transient_lastDdlTime from your Hive table.
SHOW CREATE TABLE table_name;
Then copy paste the transient_lastDdlTime in below query to get the value as timestamp.
SELECT CAST(from_unixtime(your_transient_lastDdlTime_value) AS timestamp);
With the help of above answers I have created a simple solution for the forthcoming developers.
time_column=`beeline --hivevar db=hiveDatabase --hivevar tab=hiveTable --silent=true --showHeader=false --outputformat=tsv2 -e 'show create table ${db}.${tab}' | egrep 'transient_lastDdlTime'`
time_value=`echo $time_column | sed 's/[|,)]//g' | awk -F '=' '{print $2}' | sed "s/'//g"`
tran_date=`date -d #$time_value +'%Y-%m-%d %H:%M:%S'`
echo $tran_date
I used beeline alias. Make sure you setup alias properly and invoke the above script. If there are no alias used then use the complete beeline command(with jdbc connection) by replacing beeline above. Leave a question in the comment if any.
Here there is already an answer for how to see last modified date for a hive table. I am just sharing how to check last modified date for a hive table partition.
Connect to hive cluster to run hive queries. In most of the cases, you can simply connect by running hive command : hive
DESCRIBE FORMATTED <database>.<table_name> PARTITION(<partition_column>=<partition_value>);
In the response you will see something like this : transient_lastDdlTime 1631640957
SELECT CAST(from_unixtime(1631640957) AS timestamp);
You may get the timestamp by executing
describe formatted table_name
you can execute the below command and convert the output of transient_lastDdlTime from timestamp to date.It will give the last modified timestamp for the table.
show create table TABLE_NAME;
if you are using mysql as metadata use following...
select TABLE_NAME, UPDATE_TIME, TABLE_SCHEMA from TABLES where TABLE_SCHEMA = 'employees';

BigQuery command line tool - append to table using query

Is it possible to append the results of running a query to a table using the bq command line tool? I can't see flags available to specify this, and when I run it it fails and states "table already exists"
bq query --allow_large_results --destination_table=project:DATASET.table "SELECT * FROM [project:DATASET.another_table]"
BigQuery error in query operation: Error processing job '':
Already Exists: Table project:DATASET.table
Originally BigQuery did not support the standard SQL idiom
INSERT foo SELECT a,b,c from bar where d>0;
and you had to do it their way with --append_table
But according to #Will's answer, it works now.
Originally with bq, there was
bq query --append_table ...
The help for the bq query command is
$ bq query --help
And the output shows an append_table option in the top 25% of the output.
Python script for interacting with BigQuery.
USAGE: bq.py [--global_flags] <command> [--command_flags] [args]
query Execute a query.
Examples:
bq query 'select count(*) from publicdata:samples.shakespeare'
Usage:
query <sql_query>
Flags for query:
/home/paul/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_large_results: Enables larger destination table sizes.
--[no]append_table: When a destination table is specified, whether or not to
append.
(default: 'false')
--[no]batch: Whether to run the query in batch mode.
(default: 'false')
--destination_table: Name of destination table for query results.
(default: '')
...
Instead of appending two tables together, you might be better off with a UNION ALL which is sql's version of concatenation.
In big query the comma or , operation between two tables as in SELECT something from tableA, tableB is a UNION ALL, NOT a JOIN, or at least it was the last time I looked.
Just in case someone ends up finding this question in Google, BigQuery has evolved a lot since this post and now it does support Standard.
If you want to append the results of a query to a table using the DML syntax feature of the Standard version, you could do something like:
INSERT dataset.Warehouse (warehouse, state)
SELECT *
FROM UNNEST([('warehouse #1', 'WA'),
('warehouse #2', 'CA'),
('warehouse #3', 'WA')])
As presented in the docs.
For the command line tool it follows the same idea, you just need to add the flag --use_legacy_sql=False, like so:
bq query --use_legacy_sql=False "insert into dataset.table (field1, field2) select field1, field2 from table"
According to the current documentation (March 2018): https://cloud.google.com/bigquery/docs/loading-data-local#appending_to_or_overwriting_a_table_using_a_local_file
You should add:
--noreplace or --replace=false