Hive: How to load data produced by apache pig into a hive table? - hive

I am trying to load the output of pig into a hive table. The data are stored as avro schema on HDFS. In the pig job, I am simply doing:
data = LOAD 'path' using AvroStorage();
data = FILTER BY some property;
STORE data into 'outputpath' using AvroStorage();
I am trying to load it into a hive table by doing:
load data inpath 'outputpath' into table table_with_avro_schema parititon(somepartition);
However, I am getting an error saying that:
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Invalid partition key & values; keys [somepartition, ], values [])
Can someone please suggests what I am doing wrong here? Thanks a lot!

I just figured out that it is because LOAD operation does not deserialize the data. It simply acts like a copy operation. Thus, in order to fix it, you should follow these steps:
1. CREATE EXTERNAL TABLE some_table LIKE SOME_TABLE_WITH_SAME_SCHEMA;
2. LOAD DATA INPATH 'SOME_PATH' INTO some_table ;
3. INSERT INTO TARGET_TABLE SELECT * FROM some_table;
Basically, we should first load data into an external table and then insert it into the target hive table.

Related

How do I load data into Cloudera Impala Table?

I'm loading data into a Cloudera Impala ODBC table using a post SQL statement but I'm getting a "URI path must be absolute" error. Below is my SQL.
REFRESH sw_cfnusdata.CPN_Sales_Data;
DROP TABLE IF EXISTS sw_cfnusdata.CPN_Sales_Data_parquet;
CREATE TABLE IF NOT EXISTS sw_cfnusdata.CPN_Sales_Data_parquet LIKE
sw_cfnusdata.CPN_Sales_Data STORED AS PARQUET;
REFRESH sw_cfnusdata.CPN_Sales_Data_parquet;
LOAD DATA INPATH 'data/shared_workspace/sw_cfnusdata/Alteryx_CPN_Sales_Data' OVERWRITE INTO TABLE sw_cfnusdata.CPN_Sales_Data_parquet;
REFRESH sw_cfnusdata.CPN_Sales_Data_parquet;
COMPUTE STATS sw_cfnusdata.CPN_Sales_Data;
DROP TABLE sw_cfnusdata.CPN_Sales_Data;
Any ideas on what I'm missing here. I tried the same statement without the Compute Stats function and still got the same error. Thank you in advance.
You need to provide hdfs path.
Upload that file into hdfs and try same command with hdfs path like hdfs://DEV/data/sampletable.
Or else you can upload the file into local disc and try below command
load data local inpath "/data/sampletable.txt" into table sampletable;
So, below section need to be changed and you need to add either hdfs path or local path.
LOAD DATA INPATH 'data/shared_workspace/sw_cfnusdata/Alteryx_CPN_Sales_Data' OVERWRITE INTO TABLE sw_cfnusdata.CPN_Sales_Data_parquet;

Presto failed: com.facebook.presto.spi.type.VarcharType

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType
I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Loading Avro Data into BigQuery via command-line?

I have created an avro-hive table and loaded data into avro-table from another table using hive insert-overwrite command.I can see the data in avro-hive table but when i try to load this into bigQuery table, It gives an error.
Table schema:-
CREATE TABLE `adityadb1.gold_hcth_prfl_datatype_acceptence`(
`prfl_id` bigint,
`crd_dtl` array< struct < cust_crd_id:bigint,crd_nbr:string,crd_typ_cde:string,crd_typ_cde_desc:string,crdhldr_nm:string,crd_exprn_dte:string,acct_nbr:string,cre_sys_cde:string,cre_sys_cde_desc:string,last_upd_sys_cde:string,last_upd_sys_cde_desc:string,cre_tmst:string,last_upd_tmst:string,str_nbr:int,lng_crd_nbr:string>>)
STORED AS AVRO;
Error that i am getting:-
Error encountered during job execution:
Error while reading data, error message: The Apache Avro library failed to read data with the follwing error: Cannot resolve:
I am using following command to load the data into bigquery:-
bq load --source_format=AVRO dataset.tableName avro-filePath
Make sure that there is data available in your gs folder where you are pointing and the data contains the schema (it should if your created it from Hive). Here you have an example of how load data
bq --location=US load --source_format=AVRO --noreplace my_dataset.my_avro_table gs://myfolder/mytablefolder/part-m-00001.avro

Getting ClassCastException in Hive ORC table

Using cloudera 8.1. In Hive, loaded a table in ORC format with a CSV file. Getting this error on attempting to query the loaded table:
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.IntWritable
This is common issue I see lots of people make,
You can create hive external table with CSV format and then say
"INSERT INTO TABLE FINAL SELECT * FROM TEMP_TABLE" which will copy the CSV data into ORC table.
By using this method Hive will convert the CSV data into ORC using inbuilt libraries.

Can Pig be used to LOAD from Parquet table in HDFS with partition, and add partitions as columns?

I have an Impala partitioned table, store as Parquet. Can I use Pig to load data from this table, and add partitions as columns?
The Parquet table is defined as:
create table test.test_pig (
name: chararray,
id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;
And the Pig script is like:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);
However, gender and age are missing when DUMP A. Only name and id are displayed.
I have tried with:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);
But I would receive error like:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable
schema: left is "name:bytearray,id:long,gender:bytearray,age:int",
right is "name:bytearray,id:long"
Hope to get some advice here. Thank you!
You should test with the org.apache.hcatalog.pig.HCatLoader library.
Normally, Pig supports read from/write into partitioned tables;
read:
This load statement will load all partitions of the specified table.
/* myscript.pig */
A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader();
...
...
If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-RunningPigwithHCatalog
write
HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions
However, I think this hasn't been yet properly tested with parquet files (at least not by the Cloudera guys) :
Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html