Why impala not showing all tables created by Hive - hive

I have imported all tables using sqoop into a Hive database "sqoop_import" able to see all tables imported successfully as below :-
hive> use sqoop_import;
OK
Time taken: 0.026 seconds
hive> show tables;
OK
categories
customers
departments
order_items
orders
products
Time taken: 0.025 seconds, Fetched: 6 row(s)
hive>
But when I am trying the same from impala-shell or Hue using the same user, It's showing different results as below : -
[quickstart.cloudera:21000] > use sqoop_import;
Query: use sqoop_import
[quickstart.cloudera:21000] > show tables;
Query: show tables
+--------------+
| name |
+--------------+
| customers |
| customers_nk |
+--------------+
Fetched 2 row(s) in 0.01s
[quickstart.cloudera:21000] >
What am I doing wrong?

When you import a new table with sqoop to hive, in order to see it through Impala-Shell you should INVALIDATE METADATA of the specific table. So from the Impala-Shell run the following command : impala-shell -d DB_NAME -q "INVALIDATE METADATA table_name"; .
But if you append new data files to an existing table through sqoop you need to do REFRESH. So from the Impala-Shell run the following command:
impala-shell -d DB_NAME -q "REFRESH table_name";.

Related

Recover overwritten Bigquery table with older table schema

I accidentally overwrote an existing table by using it as a temporary table to store result of another select. Is there a way to roll it back if both the old table and new table has a different table structure? Is it possible to prevent someone from overwriting a particular table to prevent this in future?
There is a comment in following question which says it is not possible to recover if table schema is different. Not sure if that has changed recently.
Is it possible to recover overwritten data in BigQuery
first overwrite your table again with something (anything) that has exact same schema as your "lost" table
Then follow same steps as in referenced post - which is :
SELECT * FROM [yourproject:yourdataset.yourtable#<time>]
You can use #0 if your table was not changed for last week or so or since creation
Or, to avoid cost - do bq cp ....
You could restore in SQL. But this loses column nullable and description fields and incurs query costs
bq query --use_legacy_sql=false "CREATE OR REPLACE TABLE project:dataset.table AS SELECT * FROM project:dataset.table FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 MINUTE)"
I recently found this to be more effective
Get a unix time stamp in milliseconds and override itself with cp
bq query --use_legacy_sql=false "SELECT DATE_DIFF(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 100 MINUTE), TIMESTAMP '1970-01-01', MILLISECOND)"
bq cp project:dataset.table#1625288152215 project:dataset.table
Before you do it you can check with the following
bq show --schema --format=prettyjson project:dataset.table#1625288152215 > schema-a.json
bq show --schema --format=prettyjson project:dataset.table > schema-b.json
diff schema-a.json schema-b.json

Create a date table in hive

How to create a table in hive which should have all the dates from 1st Jan 2016 till today (01-01-2016 to 12-10-2016)?
The table would have only one column i.e. the date column.
Thanks.
You can generate this data yourself.
Go to Hive shell and execute :
CREATE TABLE tbl1 (col1 date)
Default format for Date type in hive : YYYY-­MM-­DD. So we will generate data in this format.
Now generate data using shell script. Open terminal and fire :
gedit /tmp/test.sh
Copy this code :
#!/bin/bash
DATE=2016-01-01
for i in {0..285}
do
NEXT_DATE=$(date +%Y-%m-%d -d "$DATE + $i day")
echo "$NEXT_DATE"
done
You don't have execute permission by default, use :
chmod 777 /tmp/test.sh
Now fire :
/tmp/test.sh >/tmp/test.csv
You got data in test.csv
2016-01-01
2016-01-02
2016-01-03
2016-01-04
........
Now go back to hive shell and fire :
load data local inpath '/tmp/test.csv' into table tbl1;
Your table with data is ready.
You can download Date dimension in excel format from the Kimball Group
Save Excel as csv, put in HDFS, create an external table on top of it.
I suggest you to create date_dim and keep all the columns in it. Date dimension should be in the warehouse. You can select only date column or create a view with necessary columns.
Also you can generate date range in Hive, see this answer: https://stackoverflow.com/a/55440454/2700344

How do I find what user owns a HIVE database?

I want to confirm which user is the owner of a database in HIVE. Where would I find this information?
DESCRIBE|DESC DATABASE shows the name of the database, its comment (if one has been set), and its root location on the filesystem. The uses of SCHEMA and DATABASE are interchangeable – they mean the same thing. DESCRIBE SCHEMA is added in Hive 0.15 (HIVE-8803).
EXTENDED also shows the database properties.
DESCRIBE DATABASE [EXTENDED] db_name;
DESCRIBE SCHEMA [EXTENDED] db_name; -- (Note: Hive 0.15.0 and later)
These examples show that cards database was created by cloudera user:
hive> SET hive.cli.print.header=true;
hive> describe database cards;
OK
db_name comment location owner_name owner_type parameters
cards hdfs://quickstart.cloudera:8020/user/hive/warehouse/cards.db cloudera USER
Time taken: 0.013 seconds, Fetched: 1 row(s)
hive> desc schema cards;
OK
db_name comment location owner_name owner_type parameters
cards hdfs://quickstart.cloudera:8020/user/hive/warehouse/cards.db cloudera USER
Time taken: 0.022 seconds, Fetched: 1 row(s)
Alternatively,
Hive database is nothing but a hdfs directory in Hive warehouse dir location with .db extension. You can get user by simply from hadoop fs -ls command:
For a directory it returns list of its direct children as in Unix. A directory is listed as:
permissions userid groupid modification_date modification_time dirname
Files within a directory are order by filename by default.
Example:
hadoop fs -ls /user/hive/warehouse/*.db |awk '{print $3,$NF}'
Both will solve your problem:
hive> describe database extended db_name;
hive> describe schema extended db_name;
The output will have the owner user name.
If you have configured the hive to an external metastore like mysql or derby , you can query the metastore table DBS to get the information .
Query
select NAME,OWNER_NAME,OWNER_TYPE from DBS;
Output
+--------------+------------+------------+
| NAME | OWNER_NAME | OWNER_TYPE |
+--------------+------------+------------+
| default | public | ROLE |
| employee | addy | USER |
| test | addy | USER |

When to use Sqoop --create-hive-table

Can anyone tell the difference between create-hive-table & hive-import method? Both will create a hive table, but still what is the significance of each?
hive-import command:
hive-import commands automatically populates the metadata for the populating tables in hive metastore. If the table in Hive does not exist yet, Sqoop
will simply create it based on the metadata fetched for your table or query. If the table already exists, Sqoop will import data into the existing table. If you’re creating a new Hive table, Sqoop will convert the data types of each column from your source table to a type compatible with Hive.
create-hive-table command:
Sqoop can generate a hive table (using create-hive-tablecommand) based on the table from an existing relational data source. If set, then the job will fail if the target hive table exists. By default this property is false.
Using create-hive-table command involves three steps: importing data into HDFS, creating hive table and then loading the HDFS data into Hive. This can be shortened to one step by using hive-import.
During a hive-import, Sqoop will first do a normal HDFS import to a temporary location. After a successful import, Sqoop generates two queries: one for creating a table and another one for loading the data from a temporary location. You can specify any temporary location using either the --target-dir or --warehouse-dir parameter.
Added a example below for above description
Using create-hive-table command:
Involves three steps:
Importing data from RDBMS to HDFS
sqoop import --connect jdbc:mysql://localhost:3306/hadoopexample --table employees --split-by empid -m 1;
Creating hive table using create-hive-table command
sqoop create-hive-table --connect jdbc:mysql://localhost:3306/hadoopexample --table employees --fields-terminated-by ',';
Loading data into Hive
hive> load data inpath "employees" into table employees;
Loading data to table default.employees
Table default.employees stats: [numFiles=1, totalSize=70]
OK
Time taken: 2.269 seconds
hive> select * from employees;
OK
1001 emp1 101
1002 emp2 102
1003 emp3 101
1004 emp4 101
1005 emp5 103
Time taken: 0.334 seconds, Fetched: 5 row(s)
Using hive-import command:
sqoop import --connect jdbc:mysql://localhost:3306/hadoopexample --table departments --split-by deptid -m 1 --hive-import;
The difference is that create-hive-table will create table in Hive based on the source table in database but will NOT transfer any data. Command "import --hive-import" will both create table in Hive and import data from the source table.

bashscript for hive queries giving errors

I am trying to execute a bashcript containing hive queries. But, when i execute the script it shows that raw_data and central table not found. I already have these tables in hive. Below, is the bash script. Kindly suggest what's wrong.
#!/bin/bash
hive -e
"CREATE TEMPORARY FUNCTION rowSequence AS 'com.hive.udf.UDFRowSequence'; "
hive -e "
create table staging (id String,speed String,time String,time_id int);"
hive -e "
insert into table staging select marker.marker.id,
marker.marker.speed ,
marker.marker.time as time,
rowSequence() as time_id
from raw_data
lateral view explode (raw_data.markers.marker)marker as marker;"
hive -e "
create table processed (plc string,direction string,table int,speed string,time_id string,day int);"
hive -e "
insert into table processed select c.plc,c.direction,c.table,t.speed as speed,t.time_id,0 from central c JOIN staging t ON (t.id = c.boxno);"
Include use [databasename]
Eg..
hive -e "use dummy_database;create table staging (id String,speed
String,time String,time_id int);"
hive -e "use dummy_database;insert into table staging select
marker.marker.id,
marker.marker.speed ,
marker.marker.time as time,
rowSequence() as time_id
from raw_data
lateral view explode (raw_data.markers.marker)marker as marker;"