How to load all the mappers into HIVE table? - hive

Hello Below is the use case:
Below are the sqoop generated files for the table statewise testing details:
$ hadoop fs -ls /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS
Found 5 items
-rw-r--r-- 1 cloudera supergroup 0 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 19674 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00000
-rw-r--r-- 1 cloudera supergroup 19716 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00001
-rw-r--r-- 1 cloudera supergroup 18761 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00002
-rw-r--r-- 1 cloudera supergroup 20176 2020-11-23 15:38 /STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00003
I would like to load all of them into table in HIVE ...
i am unable to do so ...
HIVE :
load data '/STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00000' into table STATEWISE_TESTING_DETAILS;
Fails :
FAILED: ParseException line 1:10 missing INPATH at ''/STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00000'' near ''

Seems like you are missing INPATH in LOAD DATA statement. So try the command like below
load data inpath '/STATEWISE_TESTING_DETAILS_23_Nov/STATEWISE_TESTING_DETAILS/part-m-00000' into table STATEWISE_TESTING_DETAILS;
Also Sqoop is capable to moving data directly to Hive. This way you can make Sqoop to do the load data function or even the Hive table creation.
Check out this: -
https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#_importing_data_into_hive
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-access/content/using_sqoop_to_move_data_into_hive.html

Related

Hive query fails with file not found error that should not be possible

I have a hive external table that is partitioned by the date and time it was inserted, say for example 20200331_0505 which is in YYYYMMDD_HHMM format.
Currently there is only one partition:
> hdfs dfs -ls /path/to/external/table
-rw-r----- 2020-03-31 05:06 /path/to/external/table/_SUCCESS
drwxr-x--- 2020-03-31 05:06 /path/to/external/table/loaddate=20200331_0505
And if I run a hive query to find the partitions:
select distinct loaddate from table;
+----------------+
| loaddate |
+----------------+
| 20200331_0505 |
+----------------+
That is expected and what I want to see, but if I run this:
select * from table where loaddate=(select max(loaddate) from table);
Then I get this error:
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 3, vertexId=vertex_1585179445264_14095_4_00, diagnostics=[Vertex vertex_1585179445264_14095_4_00 [Map 3] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: <Table> initializer failed, vertex=vertex_1585179445264_14095_4_00 [Map 3], java.lang.RuntimeException: ORC split generation failed with exception: java.io.FileNotFoundException: File hdfs://path/to/external/table/loaddate=20200327_0513 does not exist.
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
So it is trying to load a partition that does not exist, 20200327_0513, what could be causing this?
When you delete partitions either directly with the rm command or with something like a SaveMode.Overwrite write command, it does not alert hive to the changes in partitions, so you need to tell hive that the partitions have changed. There are many ways to do this, the way I chose to fix it was:
msck repair table <table> sync partitions

String/INT96 to Datatime - Amazon Athena/SQL - DDL/DML

I have hosted my data on S3 Bucket in parquet format and i am trying to access it using Athena. I can see i can successfully access the hosted table. I detected something fishy when i try to access a column "createdon".
createdon is a timestamp column and it reflects same on Athena table, but when i try to query it using the provided SQL below query
SELECT createdon FROM "uat-raw"."opportunity" limit 10;
Unexpected output :
+51140-02-21 19:00:00.000
+51140-02-21 21:46:40.000
+51140-02-22 00:50:00.000
+51140-02-22 03:53:20.000
+51140-02-22 06:56:40.000
+51140-02-22 09:43:20.000
+51140-02-22 12:46:40.000
Expected output:
2019-02-21 19:00:00.000
2019-02-21 21:46:40.000
2019-02-22 00:50:00.000
2019-02-22 03:53:20.000
2019-02-22 06:56:40.000
2019-02-22 09:43:20.000
2019-02-22 12:46:40.000
can any one help me with the same ?? and also i have attached pic for more information.
I expect an SQL query which i can use to query my data on S3 from Athena.

Insert overwrite in hive doesn't work properly

I'm creating external tables in hive and then user insert overwrite directory ... to add the files. However the second time that I run my query I expect the old files to get deleted and new files replace them (because I have the overwrite option). However that is not the case and new files gets added to the directory without removing the old files which causes inconsistency in the data. What is going wrong here?
I was going to submit a bug but this is existing issue : HIVE-13997 -- apply the patch if you wish to use overwrite directory for expected results.
From Test what I have found is:
overwrite directory and overwrite table works differently: you should use overwrite table if you wish to overwrite the entire directory.
created this table for test: create external table t2 (a1 int,a2 string) LOCATION '/user/cloudera/t2';
overwrite directory:
The directory is, as you would expect, OVERWRITten; in other words, if the specified path exists, it is clobbered and replaced with the output.
hive> insert overwrite directory '/user/cloudera/t2' select * from sqoop_import.departments;
so if above statement is going to write data at location /user/cloudera/t2/000000_0 then only this location is overwritten.
#~~~~ BEFORE ~~~~
[cloudera#quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 17:42 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
-rwxr-xr-x 1 cloudera cloudera 88 2016-07-25 15:48 /user/cloudera/t2/part-m-00000
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 15:48 /user/cloudera/t2/part-m-00001
#~~~~ AFTER: Note the timestamp ~~~~
[cloudera#quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 18:01 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
-rwxr-xr-x 1 cloudera cloudera 88 2016-07-25 15:48 /user/cloudera/t2/part-m-00000
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 15:48 /user/cloudera/t2/part-m-00001
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
overwrite table:
The contents of the chosen table or partition are replaced with the output of corresponding select statement.
hive> insert overwrite table t2 select * from sqoop_import.departments;
Now entire dir is overwritten:
[cloudera#quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 18:03 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
so in conclusion, overwrite directory only overwrite direct path of generated file not the directory. see Writing-data-into-the-file-system-from-queries

Presto can't fetch content in HIVE table

My environment:
hadoop 1.0.4
hive 0.12
hbase 0.94.14
presto 0.56
All packages are installed on pseudo machine. The services are not running on localhost but
on the host name with a static IP.
presto conf:
coordinator=false
datasources=jmx,hive
http-server.http.port=8081
presto-metastore.db.type=h2
presto-metastore.db.filename=/root
task.max-memory=1GB
discovery.uri=http://<HOSTNAME>:8081
In presto cli I can get the table in hive successfully:
presto:default> show tables;
Table
-------------------
ht1
k_business_d_
k_os_business_d_
...
tt1_
(11 rows)
Query 20140114_072809_00002_5zhjn, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:11 [11 rows, 291B] [0 rows/s, 26B/s]
but when I try to query data from any table the result always be empty: (no error information)
presto:default> select * from k_business_d_;
key | business | business_name | collect_time | numofalarm | numofhost | test
-----+----------+---------------+--------------+------------+-----------+------
(0 rows)
Query 20140114_072839_00003_5zhjn, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]
If I executed the same sql in HIVE, the result show there are 1 row in the table.
hive> select * from k_business_d_;
OK
9223370648089975807|2 2 测试机 2014-01-04 00:00:00 NULL 1.0 NULL
Time taken: 2.574 seconds, Fetched: 1 row(s)
Why presto can't fetch from HIVE tables?
It looks like this is an external table that uses HBase via org.apache.hadoop.hive.hbase.HBaseStorageHandler. This is not supported yet, but one mailing list post indicates it might be possible if you copy the appropriate jars to the Hive plugin directory: https://groups.google.com/d/msg/presto-users/U7vx8PhnZAA/9edzcK76tD8J

sqoop Import lastmodified gives duplicate records. It does'nt merger

I am facing a kind of wierd issue. Most of places I found that when lastmodified is used old and new files will be merged to remove duplicated. How ever in my case it is not happening.
I used :
sqoop import --connect "jdbc:mysql://<hostname>:3306/<dbname>" --username root -password password --table LoginRoles --hive-import --create-hive-table --hive-table LoginRoles --hive-delims-replacement " "
Table got created and data was loaded properly in /user/hive/warehouse location.
LoginRoleId LoginRole CreatedDate ModifiedDate
1 admin1 2013-09-30 14:21:28 2013-09-30 16:03:39
2 admin2 2013-09-30 14:36:23 2013-09-30 15:53:19
3 admin3 2013-09-30 14:39:13 2013-09-30 14:39:13
4 admin5 2013-09-30 14:40:55 2013-09-30 14:40:55
Now I ran below query and Modified date got updated to '2013-09-30
17:03:44'
update loginroles set ModifiedDate=now(),loginrole="admin4" where LoginRoleID=4;
When I ran the job as below using Sqoop job -exec mymodified
sqoop job --create mymodified -- import --connect "jdbc:mysql://<hostname>:3306/<dbname>" --username root -password password --table LoginRoles --hive-import --hive-table LoginRoles --hive-delims-replacement " " --check-column ModifiedDate --incremental lastmodified --last-value '2013-09-30 16:03:39'
I see total 5 rows in hive as below.
1 admin1 2013-09-30 14:21:28.0 2013-09-30 16:03:39.0
4 admin4 2013-09-30 14:40:55.0 2013-09-30 17:03:44.0
2 admin2 2013-09-30 14:36:23.0 2013-09-30 15:53:19.0
3 admin3 2013-09-30 14:39:13.0 2013-09-30 14:39:13.0
4 admin5 2013-09-30 14:40:55.0 2013-09-30 14:40:55.0
I am sure I am missing something important and subtle.
Version details of sqoop used
Sqoop 1.4.3-cdh4.3.0
git commit id 7a52f9aa97cba43aae8b700f7e93f97dcdb0b21a
Compiled by jenkins on Mon May 27 20:33:21 PDT 2013
This approach does not work at this point of time. I have posted in cloudera google group and for now this wont work. I will have to use workaround to create staging folders and cleaning them. Below link helped me to solve the issue.
http://himanshubaweja.com/post/7529434265/analytics-reached-mysql-limit-lets-hive