Delimiters to be used while creating hive table - hive

Imported logs Table from SQL server to HDFS in compressed format(.gz).
sqoop import --connect "jdbc:jtds:sqlserver://ServerName:1433/Test" --username sa --password root --table log --target-dir hdfs://localhost:50071/TestMain --fields-terminated-by "¤" --hive-import --create-hive-table --compress --split-by Logid
Created a external table in hive on top of this data.
CREATE EXTERNAL TABLE TestMain(LogMessage varchar(2000))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "¤"
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:50071/TestMain';
Those logs have a column of datatype nvarchar(max) in SQL. Which should be the corresponding datatype to be used in Hive?
I tried using the string datatype in hive but facing the below issue:
While running the select query in hive, I can see only first few words of the field. I can’t see the whole column value.
Example:
That field have the below value in SQL:
Message: Procedure or function 'XYZ' expects parameter '#ABC', which was not supplied.
Stacktrace: This has whole 5 lines error stack trace.
Value visible while querying in Hive:
Procedure or function 'XYZ' expects parameter '#ABC', which was not supplied.
It seems to be some issue with the field and line delimiter.
Hive supports only new line as the line delimiter. I think this is causing the issue.
Kindly suggest a solution or better way to query this data in HDFS.

Related

How to create/copy data to partitions in hive manually

I am working on a hive solution wherein I need to append some values to the high volume files. So instead of appending them, I am trying using map-reduce method
The approach is below
Table creation:
create external table demo_project_data(data string) PARTITIONED BY (business_date string, src_sys_file_nm string, prd_typ_cd string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
LOCATION '/user/hive/warehouse/demo/project/data';
hadoop fs -mkdir -p /user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd'
echo "ALTER TABLE demo_project_data ADD IF NOT EXISTS PARTITION(business_date='20180707',src_sys_file_nm='a ch_ach_fotp_20180707_1.dat.gz',prd_typ_cd='ach')
LOCATION '/user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd';"|hive
hadoop fs -cp /apps/tdi/data/a_b_c_20180707_1.dat.gz /user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd'
echo "INSERT OVERWRITE DIRECTORY '/user/20180707' select *,'~karthick~kb~demo' from demo_project_data where src_sys_file_nm='a_b_c_20180707_1.dat.gz' and business_date='20180707' and prd_typ_cd='abcd';"|hive
I have some amount of data in the file but I dont see any results in the above query. The files are properly copied under the correct location.
What is that I am making wrong? Query has no issues
Also I will be looping over multiple dates. I would like to know if this is the right way to do it.
You can Use below command to fetch the results from the partition -
MSCK REPAIR TABLE <tablename>;
Refer,
MSCK REPAIR TABLE:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

Hue on Cloudera - NULL values (importing file)

Yesterday I installed Cloudera QuickStart VM 5.8. After the import operation of files from the database by HUE, in some tables there were a NULL value (the entire column). In previous steps data display them properly as they should be imported.
First Pic.
Second Pic.
can you run the command describe formatted table_name in hive shell and see what is the field delimiter and then go to the warehouse directory and see if the delimiter in the data and in the table definition is same.i am sure it will not be same thats why you see null.
i am assuming you have imported the data in the default warehouse directory.
then you can do one of the following
1) delete your hive table and create it again with correct delimiter as it is in the actual data ( row format delimited fields terminated by "your delimitor" and give location as your data file
or
2) delete the data that is imported and run sqoop import again and give the fields-terminated-by " the delimitor in the hive table definition"
Once check datatype of second(col_1) and third(col_2) in original database from where your exporting.
This can not be case of missing delimiter, else fourth(col_3) would not have populated correctly, which is correct.

Creating text table from Impala partitioned parquet table

I have a parquet table formatted as follows:
.impala_insert_staging
yearmonth=2013-04
yearmonth=2013-05
yearmonth=2013-06
...
yearmonth=2016-04
Underneath each of these directories are my parquet files. I need to get them into my another table which just has a
.impala_insert_staging
file.
Please help.
The best I found is to pull the files in locally and sqoop them back up into a text table.
To pull the parquet table down I performed the following:
impala-shell -i <ip-addr> -B -q "use default; select * from <table>" -o filename '--output_delimiter=\x1A'
Unfortunately this adds the yearmonth value as another column on my table. So I either go into my 750GB file and sed/awk out that last column or use mysqlimport (since I'm using MySQL as well) to import only the columns I'm interested in.
Finally I'll sqoop up the data to a new text table.
sqoop import --connect jdbc:mysql://<mysqlip> --table <mysql_table> -uroot -p<pass> --hive-import --hive-table <new_db_text>

Incremental updates in HIVE using SQOOP appends data into middle of the table

I am trying to append the new data from SQLServer to Hive using the following command
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password passwd --table testable --where "ID > 11854" --hive-import -hive-table hivedb.hivetesttable --fields-terminated-by ',' -m 1
This command appends the data.
But when I run
select * from hivetesttable;
it doesnot show the new data at the end.
This is because the sqoop import statement for appending the new data result the mapper output as part-m-00000-copy
So my data in the hive table directory looks like
part-m-00000
part-m-00000-copy
part-m-00001
part-m-00002
Is there any way to append the data at end by changing the name of mapper?
Hive, similarly to any other relational database, doesn't guarantee any order unless you explicitly use ORDER BY clause.
You're correct in your analysis - the reason why the data appears in the "middle" is that Hive will read one file after another based on lexicographical sort and Sqoop simply names the files that they will get appended somewhere in the middle of that list.
However this operation is fully valid - Sqoop appended data to Hive table and because your query doesn't have any explicit ORDER BY statement the result have no guarantees with regards to order. In fact Hive itself can change this behavior and read files based on time of creation without breaking any compatibility.
I'm also interested to see how this is affecting your use case? I'm assuming that the query to list all rows is just a test one. Do you have any issues with actual production queries?

Hadoop Hive: create external table with dynamic location

I am trying to create a Hive external table that points to an S3 output file.
The file name should reflect the current date (it is always a new file).
I tried this:
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION concat('s3://BlobStore/Exports/Daily_', from_unixtime(unix_STRING(),'yyyy-MM-dd'));
but I get an error:
FAILED: Parse Error: line 3:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
is there any way to dynamically specify table location?
OK, I found the hive variables feature.
So I pass the location in the cli as follows
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/
and then use the variable in the hive command
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${s3File}';
This function doesn't work at my side ,
how did you make this happen ?
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/