Insert overwrite in hive doesn't work properly - hive

I'm creating external tables in hive and then user insert overwrite directory ... to add the files. However the second time that I run my query I expect the old files to get deleted and new files replace them (because I have the overwrite option). However that is not the case and new files gets added to the directory without removing the old files which causes inconsistency in the data. What is going wrong here?

I was going to submit a bug but this is existing issue : HIVE-13997 -- apply the patch if you wish to use overwrite directory for expected results.
From Test what I have found is:
overwrite directory and overwrite table works differently: you should use overwrite table if you wish to overwrite the entire directory.
created this table for test: create external table t2 (a1 int,a2 string) LOCATION '/user/cloudera/t2';
overwrite directory:
The directory is, as you would expect, OVERWRITten; in other words, if the specified path exists, it is clobbered and replaced with the output.
hive> insert overwrite directory '/user/cloudera/t2' select * from sqoop_import.departments;
so if above statement is going to write data at location /user/cloudera/t2/000000_0 then only this location is overwritten.
#~~~~ BEFORE ~~~~
[cloudera#quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 17:42 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
-rwxr-xr-x 1 cloudera cloudera 88 2016-07-25 15:48 /user/cloudera/t2/part-m-00000
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 15:48 /user/cloudera/t2/part-m-00001
#~~~~ AFTER: Note the timestamp ~~~~
[cloudera#quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 18:01 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
-rwxr-xr-x 1 cloudera cloudera 88 2016-07-25 15:48 /user/cloudera/t2/part-m-00000
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 15:48 /user/cloudera/t2/part-m-00001
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
overwrite table:
The contents of the chosen table or partition are replaced with the output of corresponding select statement.
hive> insert overwrite table t2 select * from sqoop_import.departments;
Now entire dir is overwritten:
[cloudera#quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 18:03 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
so in conclusion, overwrite directory only overwrite direct path of generated file not the directory. see Writing-data-into-the-file-system-from-queries

Related

Apache Drill and Apache Kudu - not able to run "select * from <some table>" using Apache Drill, for the table created in Kudu through Apache Impala

I'm able to connect to Kudu through Apache Drill, and able to list tables fine. But when I have to fetch data from the table "impala::default.customer" below, I tried different options but none is working for me.
The table in Kudu was created through Impala-Shell as external table.
Initial connection to Kudu and listing objects
ubuntu#ubuntu-VirtualBox:~/Downloads/apache-drill-1.19.0/bin$ sudo ./drill-embedded
Apache Drill 1.19.0
"A Drill is a terrible thing to waste."
apache drill> SHOW DATABASES;
+--------------------+
| SCHEMA_NAME |
+--------------------+
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| information_schema |
| kudu |
| sys |
+--------------------+
7 rows selected (24.818 seconds)
apache drill> use kudu;
+------+----------------------------------+
| ok | summary |
+------+----------------------------------+
| true | Default schema changed to [kudu] |
+------+----------------------------------+
1 row selected (0.357 seconds)
apache drill (kudu)> SHOW TABLES;
+--------------+--------------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+--------------------------------+
| kudu | impala::default.customer |
| kudu | impala::default.my_first_table |
+--------------+--------------------------------+
2 rows selected (9.045 seconds)
apache drill (kudu)> show tables;
+--------------+--------------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+--------------------------------+
| kudu | impala::default.customer |
| kudu | impala::default.my_first_table |
+--------------+--------------------------------+
Now when trying to run "select * from impala::default.customer ", not able to run it at all.
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala::default`.customer;
Error: VALIDATION ERROR: Schema [[impala::default]] is not valid with respect to either root schema or current default schema.
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `default`.customer;
Error: VALIDATION ERROR: Schema [[default]] is not valid with respect to either root schema or current default schema.
Current default schema: kudu
[Error Id: 8a4ca4da-2488-4775-b2f3-443b8b4b17ef ] (state=,code=0)
Current default schema: kudu
[Error Id: ce96ea13-392f-4910-9f6c-789a6052b5c1 ] (state=,code=0)
apache drill (kudu)>
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala`::`default`.customer;
Error: PARSE ERROR: Encountered ":" at line 1, column 23.
SQL Query: SELECT * FROM `impala`::`default`.customer
^
[Error Id: 5aacdd98-db6e-4308-9b33-90118efa3625 ] (state=,code=0)
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala::`.`default`.customer;
Error: VALIDATION ERROR: Schema [[impala::, default]] is not valid with respect to either root schema or current default schema.
Current default schema: kudu
[Error Id: 5450bd90-dfcd-4efe-a8d3-b517be85b10a ] (state=,code=0)
>>>>>>>>>>>
In Drill conventions, the first part of the FROM clause is the storage plugin, in this case kudu. When you ran the SHOW TABLES query, you saw that the table name is actually impala::default.my_first_table. If I'm reading that correctly, that whole bit is the table name and the query below is how you should escape it.
Note the back tick before impala and after first_table but nowhere else.
SELECT *
FROM kudu.`impala::default.my_first_table`
Does that work for you?

How to load all the mappers into HIVE table?

Hello Below is the use case:
Below are the sqoop generated files for the table statewise testing details:
$ hadoop fs -ls /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS
Found 5 items
-rw-r--r-- 1 cloudera supergroup 0 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 19674 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00000
-rw-r--r-- 1 cloudera supergroup 19716 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00001
-rw-r--r-- 1 cloudera supergroup 18761 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00002
-rw-r--r-- 1 cloudera supergroup 20176 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00003
I would like to load all of them into table in HIVE ...
i am unable to do so ...
HIVE :
load data '/STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00000' into table STATEWISE_TESTING_DETAILS;
Fails :
FAILED: ParseException line 1:10 missing INPATH at ''/STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00000'' near ''
Seems like you are missing INPATH in LOAD DATA statement. So try the command like below
load data inpath '/STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00000' into table STATEWISE_TESTING_DETAILS;
Also Sqoop is capable to moving data directly to Hive. This way you can make Sqoop to do the load data function or even the Hive table creation.
Check out this: -
https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#_importing_data_into_hive
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-access/content/using_sqoop_to_move_data_into_hive.html

Can I display column headings when querying via gcloud dataproc jobs submit spark-sql?

I'm issuing a spark-sql job to dataproc that simply displays some data from a table:
gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "select * from mydb.mytable limit 10"
When the data is returned and outputted to stdout I don't see column headings, I only see the raw data, whitespace delimited. I'd really like the output to be formatted better, specifically I'd like to see column headings. I tried this:
gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "SET hive.cli.print.header=true;select * from mydb.mytable limit 10"
But it had no affect.
Is there a way to get spark-sql to display column headings on dataproc?
If there is a way to get the data displayed like so:
+----+-------+
| ID | Name |
+----+-------+
| 1 | Jim |
| 2 | Ann |
| 3 | Simon |
+----+-------+
then that would be even better.
I have been performing some tests with a Dataproc cluster, and it looks like it is not possible to retrieve query results with the column names using Spark SQL. However, this is more a Apache Spark SQL issue, rather than Dataproc, so I have added that tag to your question too, in order for it to receive a better attention.
If you get into the Spark SQL console in your Dataproc cluster (by SSHing in the master and typing spark-sql), you will see that the result for SELECT queries does not include the table name:
SELECT * FROM mytable;
18/04/17 10:31:51 INFO org.apache.hadoop.mapred.FileInputFormat: Total input files to process : 3
2 Ann
1 Jim
3 Simon
There's no change if using instead SELECT ID FROM mytable;. Therefore, the issue is not in the gcloud dataproc jobs sbmit spark-sql command, but instead in the fact that Spark SQL does not provide this type of data.
If you do not necessarily have to use Spark SQL, you can try using HIVE instead. HIVE does provide the type of information you want (including the column names plus a prettier formatting):
user#project:~$ gcloud dataproc jobs submit hive --cluster <CLUSTER_NAME> -e "SELECT * FROM mytable;"
Job [JOB_ID] submitted.
Waiting for job output...
+-------------+---------------+--+
| mytable.id | mytable.name |
+-------------+---------------+--+
| 2 | Ann |
| 1 | Jim |
| 3 | Simon |
+-------------+---------------+--+
Job [JOB_ID] finished successfully.

Presto can't fetch content in HIVE table

My environment:
hadoop 1.0.4
hive 0.12
hbase 0.94.14
presto 0.56
All packages are installed on pseudo machine. The services are not running on localhost but
on the host name with a static IP.
presto conf:
coordinator=false
datasources=jmx,hive
http-server.http.port=8081
presto-metastore.db.type=h2
presto-metastore.db.filename=/root
task.max-memory=1GB
discovery.uri=http://<HOSTNAME>:8081
In presto cli I can get the table in hive successfully:
presto:default> show tables;
Table
-------------------
ht1
k_business_d_
k_os_business_d_
...
tt1_
(11 rows)
Query 20140114_072809_00002_5zhjn, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:11 [11 rows, 291B] [0 rows/s, 26B/s]
but when I try to query data from any table the result always be empty: (no error information)
presto:default> select * from k_business_d_;
key | business | business_name | collect_time | numofalarm | numofhost | test
-----+----------+---------------+--------------+------------+-----------+------
(0 rows)
Query 20140114_072839_00003_5zhjn, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]
If I executed the same sql in HIVE, the result show there are 1 row in the table.
hive> select * from k_business_d_;
OK
9223370648089975807|2 2 测试机 2014-01-04 00:00:00 NULL 1.0 NULL
Time taken: 2.574 seconds, Fetched: 1 row(s)
Why presto can't fetch from HIVE tables?
It looks like this is an external table that uses HBase via org.apache.hadoop.hive.hbase.HBaseStorageHandler. This is not supported yet, but one mailing list post indicates it might be possible if you copy the appropriate jars to the Hive plugin directory: https://groups.google.com/d/msg/presto-users/U7vx8PhnZAA/9edzcK76tD8J

sqoop Import lastmodified gives duplicate records. It does'nt merger

I am facing a kind of wierd issue. Most of places I found that when lastmodified is used old and new files will be merged to remove duplicated. How ever in my case it is not happening.
I used :
sqoop import --connect "jdbc:mysql://<hostname>:3306/<dbname>" --username root -password password --table LoginRoles --hive-import --create-hive-table --hive-table LoginRoles --hive-delims-replacement " "
Table got created and data was loaded properly in /user/hive/warehouse location.
LoginRoleId LoginRole CreatedDate ModifiedDate
1 admin1 2013-09-30 14:21:28 2013-09-30 16:03:39
2 admin2 2013-09-30 14:36:23 2013-09-30 15:53:19
3 admin3 2013-09-30 14:39:13 2013-09-30 14:39:13
4 admin5 2013-09-30 14:40:55 2013-09-30 14:40:55
Now I ran below query and Modified date got updated to '2013-09-30
17:03:44'
update loginroles set ModifiedDate=now(),loginrole="admin4" where LoginRoleID=4;
When I ran the job as below using Sqoop job -exec mymodified
sqoop job --create mymodified -- import --connect "jdbc:mysql://<hostname>:3306/<dbname>" --username root -password password --table LoginRoles --hive-import --hive-table LoginRoles --hive-delims-replacement " " --check-column ModifiedDate --incremental lastmodified --last-value '2013-09-30 16:03:39'
I see total 5 rows in hive as below.
1 admin1 2013-09-30 14:21:28.0 2013-09-30 16:03:39.0
4 admin4 2013-09-30 14:40:55.0 2013-09-30 17:03:44.0
2 admin2 2013-09-30 14:36:23.0 2013-09-30 15:53:19.0
3 admin3 2013-09-30 14:39:13.0 2013-09-30 14:39:13.0
4 admin5 2013-09-30 14:40:55.0 2013-09-30 14:40:55.0
I am sure I am missing something important and subtle.
Version details of sqoop used
Sqoop 1.4.3-cdh4.3.0
git commit id 7a52f9aa97cba43aae8b700f7e93f97dcdb0b21a
Compiled by jenkins on Mon May 27 20:33:21 PDT 2013
This approach does not work at this point of time. I have posted in cloudera google group and for now this wont work. I will have to use workaround to create staging folders and cleaning them. Below link helped me to solve the issue.
http://himanshubaweja.com/post/7529434265/analytics-reached-mysql-limit-lets-hive