Can Pig be used to LOAD from Parquet table in HDFS with partition, and add partitions as columns? - apache-pig

I have an Impala partitioned table, store as Parquet. Can I use Pig to load data from this table, and add partitions as columns?
The Parquet table is defined as:
create table test.test_pig (
name: chararray,
id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;
And the Pig script is like:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);
However, gender and age are missing when DUMP A. Only name and id are displayed.
I have tried with:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);
But I would receive error like:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable
schema: left is "name:bytearray,id:long,gender:bytearray,age:int",
right is "name:bytearray,id:long"
Hope to get some advice here. Thank you!

You should test with the org.apache.hcatalog.pig.HCatLoader library.
Normally, Pig supports read from/write into partitioned tables;
read:
This load statement will load all partitions of the specified table.
/* myscript.pig */
A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader();
...
...
If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-RunningPigwithHCatalog
write
HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions
However, I think this hasn't been yet properly tested with parquet files (at least not by the Cloudera guys) :
Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html

Related

Creating external hive table in databricks

I am using databricks community edition.
I am using a hive query to create an external table , the query is running without any error but the table is not getting populated with the specified file that has been specified in the hive query.
Any help would be appreciated .
from official docs ... make sure your s3/storage location path and schema (with respects to the file format [TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM]) are correct
DROP TABLE IF EXISTS <example-table> // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true) // deletes the data
CREATE TABLE <example-table>
USING org.apache.spark.sql.parquet
OPTIONS (PATH "<your-s3-path>")
AS SELECT <your-sql-query-here>
// alternative
CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"

How to load data to Hive table and make it also accessible in Impala

I have a table in Hive:
CREATE EXTERNAL TABLE sr2015(
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',', 'skip.header.line.count'='1',
'quoteChar'= "\"")
The table is loaded data this way:
LOAD DATA INPATH "hdfs:///user/rxie/SR2015.csv" INTO TABLE sr2015;
Why the table is only accessible in Hive? when I attempt to access it in HUE/Impala Editor I got the following error:
AnalysisException: Could not resolve table reference: 'sr2015'
which seems saying there is no such a table, but the table does show up in the left panel.
In Impala-shell, error is different as below:
ERROR: AnalysisException: Failed to load metadata for table: 'sr2015'
CAUSED BY: TableLoadingException: Failed to load metadata for table:
sr2015 CAUSED BY: InvalidStorageDescriptorException: Impala does not
support tables of this type. REASON: SerDe library
'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.
I have always been thinking Hive table and Impala table are essentially the same and difference is Impala is a more efficient query engine.
Can anyone help sort it out? Thank you very much.
Assuming that sr2015 is located in DB called db, in order to make the table visible in Impala, you need to either issue
invalidate metadata db;
or
invalidate metadata db.sr2015;
in Impala shell
However in your case, the reason is probably the version of Impala you're using, since it doesn't support the table format altogether

Presto failed: com.facebook.presto.spi.type.VarcharType

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType
I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Hive: How to load data produced by apache pig into a hive table?

I am trying to load the output of pig into a hive table. The data are stored as avro schema on HDFS. In the pig job, I am simply doing:
data = LOAD 'path' using AvroStorage();
data = FILTER BY some property;
STORE data into 'outputpath' using AvroStorage();
I am trying to load it into a hive table by doing:
load data inpath 'outputpath' into table table_with_avro_schema parititon(somepartition);
However, I am getting an error saying that:
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Invalid partition key & values; keys [somepartition, ], values [])
Can someone please suggests what I am doing wrong here? Thanks a lot!
I just figured out that it is because LOAD operation does not deserialize the data. It simply acts like a copy operation. Thus, in order to fix it, you should follow these steps:
1. CREATE EXTERNAL TABLE some_table LIKE SOME_TABLE_WITH_SAME_SCHEMA;
2. LOAD DATA INPATH 'SOME_PATH' INTO some_table ;
3. INSERT INTO TARGET_TABLE SELECT * FROM some_table;
Basically, we should first load data into an external table and then insert it into the target hive table.

PigLatin - insert data into existing partition?

I have a file test_file_1.txt containing:
20140101,value1
20140102,value2
and file test_file_2.txt containing:
20140103,value3
20140104,value4
In HCatalog there is a table:
create table stage.partition_pk (value string)
Partitioned by(date string)
stored as orc;
These two scripts work nicely:
Script 1:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer();
Script 2:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',')
AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer();
Table partition_pk contains four partitions - everything is as expected.
But lets say, there is another file containing data that should be inserterd in one of existing partitions.
Pig is unable to write into partition that contain data (or I missed something?)
How do you manage loading into existing partitions (on not empty nonpartitioned tables)?
Do you read partition, union it with new data, delete partition (how?) and insert it as new partition?
Coming from HCatalog's site, https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat, it says: " Once a partition is created records cannot be added to it, removed from it, or updated in it.". So, by the nature of HCatalog, you can't add data to an existing partition that already has data in it.
There are bugs around this that they are working on. Some of the bugs were fixed in Hive 0.13:
https://issues.apache.org/jira/browse/HIVE-6405 (Still unresolved) - The bug used to track the other bugs
https://issues.apache.org/jira/browse/HIVE-6406 (Resolved in 0.13) - separate table property for mutable
https://issues.apache.org/jira/browse/HIVE-6476 (Still unresolved) - Specific to dynamic partitioning
https://issues.apache.org/jira/browse/HIVE-6475 (Resolved in 0.13) - Specific to static partitioning
https://issues.apache.org/jira/browse/HIVE-6465 (Still unresolved) - Adds DDL support to HCatalog
Basically, it looks like if you don't want to use dynamic partitioning, then 0.13 might work for you . You just need to remember to set the appropriate property
What I've found that works for me is to create another partition key that I call build_num. I then pass the value of this parameter via the command line and set it in the store statement. Like so:
create table stage.partition_pk (value string)
Partitioned by(date string,build_num string)
stored as orc;
STORE LoadFile into 'partition_pk' using org.apache.hcatalog.pig.HCatStorer('build_num=${build_num}';
Just don't include the build_num partition in your queries. I generally set the build_num to a timestamp when I ran the job;
Try using multiple partitions:
create table stage.partition_pk (value string) Partitioned by(date string, counter string) stored as orc;
Storing look like this:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer('date=20161120, counter=0');
So now you can store data into the same date partition again by increasing the counter.