PigLatin - insert data into existing partition? - apache-pig

I have a file test_file_1.txt containing:
20140101,value1
20140102,value2
and file test_file_2.txt containing:
20140103,value3
20140104,value4
In HCatalog there is a table:
create table stage.partition_pk (value string)
Partitioned by(date string)
stored as orc;
These two scripts work nicely:
Script 1:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer();
Script 2:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',')
AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer();
Table partition_pk contains four partitions - everything is as expected.
But lets say, there is another file containing data that should be inserterd in one of existing partitions.
Pig is unable to write into partition that contain data (or I missed something?)
How do you manage loading into existing partitions (on not empty nonpartitioned tables)?
Do you read partition, union it with new data, delete partition (how?) and insert it as new partition?

Coming from HCatalog's site, https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat, it says: " Once a partition is created records cannot be added to it, removed from it, or updated in it.". So, by the nature of HCatalog, you can't add data to an existing partition that already has data in it.
There are bugs around this that they are working on. Some of the bugs were fixed in Hive 0.13:
https://issues.apache.org/jira/browse/HIVE-6405 (Still unresolved) - The bug used to track the other bugs
https://issues.apache.org/jira/browse/HIVE-6406 (Resolved in 0.13) - separate table property for mutable
https://issues.apache.org/jira/browse/HIVE-6476 (Still unresolved) - Specific to dynamic partitioning
https://issues.apache.org/jira/browse/HIVE-6475 (Resolved in 0.13) - Specific to static partitioning
https://issues.apache.org/jira/browse/HIVE-6465 (Still unresolved) - Adds DDL support to HCatalog
Basically, it looks like if you don't want to use dynamic partitioning, then 0.13 might work for you . You just need to remember to set the appropriate property
What I've found that works for me is to create another partition key that I call build_num. I then pass the value of this parameter via the command line and set it in the store statement. Like so:
create table stage.partition_pk (value string)
Partitioned by(date string,build_num string)
stored as orc;
STORE LoadFile into 'partition_pk' using org.apache.hcatalog.pig.HCatStorer('build_num=${build_num}';
Just don't include the build_num partition in your queries. I generally set the build_num to a timestamp when I ran the job;

Try using multiple partitions:
create table stage.partition_pk (value string) Partitioned by(date string, counter string) stored as orc;
Storing look like this:
LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer('date=20161120, counter=0');
So now you can store data into the same date partition again by increasing the counter.

Related

Unable to load managed table with maptype column (complex datatype) from external table in hive

I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]

Creating external hive table in databricks

I am using databricks community edition.
I am using a hive query to create an external table , the query is running without any error but the table is not getting populated with the specified file that has been specified in the hive query.
Any help would be appreciated .
from official docs ... make sure your s3/storage location path and schema (with respects to the file format [TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM]) are correct
DROP TABLE IF EXISTS <example-table> // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true) // deletes the data
CREATE TABLE <example-table>
USING org.apache.spark.sql.parquet
OPTIONS (PATH "<your-s3-path>")
AS SELECT <your-sql-query-here>
// alternative
CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"

How to deserialize the ProtoBuf serialized HBase columns in Hive?

I have used ProtoBuf's to serialize the class and store in HBase Columns.
I want to reduce the number of Map Reduce jobs for simple aggregations, so I need SQL like tool to query the data.
If I use Hive, Is it possible to extend the HBaseStorageHandler and write our own Serde for each Table?
Or any other good solution to is available.
Updated:
I created the HBase table as
create 'hive:users' , 'i'
and inserted user data from java api,
public static final byte[] INFO_FAMILY = Bytes.toBytes("i");
private static final byte[] USER_COL = Bytes.toBytes(0);
public Put mkPut(User u)
{
Put p = new Put(Bytes.toBytes(u.userid));
p.addColumn(INFO_FAMILY, USER_COL, UserConverter.fromDomainToProto(u).toByteArray());
return p;
}
my scan gave results as:
hbase(main):016:0> scan 'hive:users'
ROW COLUMN+CELL
kim123 column=i:\x00, timestamp=1521409843085, value=\x0A\x06kim123\x12\x06kimkim\x1A\x10kim123#gmail.com
1 row(s) in 0.0340 seconds
When I query the table in Hive, I don't see any records.
Here is the command I used to create table.
create external table users(userid binary, userobj binary)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping" = ":key, i:0", "hbase.table.default.storage.type" = "binary")
tblproperties("hbase.table.name" = "hive:users");
when I query the hive table I don't see the record inserted from hbase.
Can you please tell me what is wrong here?
You could try writing a UDF which would take binary protobuf and convert it to some readable structure (comma separated or json). You would have to make sure to map values as binary data.

Writing data using PIG to HIVE external table

I wanted to create an external table and load data into it through pig script. I followed the below approach:
Ok. Create a external hive table with a schema layout somewhere in HDFS directory. Lets say
create external table emp_records(id int,
name String,
city String)
row formatted delimited
fields terminated by '|'
location '/user/cloudera/outputfiles/usecase1';
Just create a table like above and no need to load any file into that directory.
Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script use as below
A = LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');
Ensure that destination location and delimiter and schema layout of final FOREACH statement in you Pigscript matches with Hive DDL schema.
My problem is, when I first created the table, it is creating a directory in hdfs, and when I tried to store a file using script, it throws an error saying "folder already exists". It looks like pig store always writes to a new directory with only specific name?
Is there any way to avoid this issue?
And are there any other attributes we can use with STORE command in PIG to write to a specific diretory/file everytime?
Thanks
Ram
YES you can use the HCatalog for achieving your result.
remember you have to run your Pig script like:
pig -useHCatalog your_pig_script.pig
or if you are using grunt shell then simply use:
pig -useHCatalog
next is your store command to store your relation directly into hive tables use:
STORE C INTO 'HIVE_DATABASE.EXTERNAL_TABLE_NAME' USING org.apache.hive.hcatalog.pig.HCatStorer();

Can Pig be used to LOAD from Parquet table in HDFS with partition, and add partitions as columns?

I have an Impala partitioned table, store as Parquet. Can I use Pig to load data from this table, and add partitions as columns?
The Parquet table is defined as:
create table test.test_pig (
name: chararray,
id bigint
)
partitioned by (gender chararray, age int)
stored as parquet;
And the Pig script is like:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long);
However, gender and age are missing when DUMP A. Only name and id are displayed.
I have tried with:
A = LOAD '/test/test_pig' USING parquet.pig.ParquetLoader AS (name: bytearray, id: long, gender: chararray, age: int);
But I would receive error like:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable
schema: left is "name:bytearray,id:long,gender:bytearray,age:int",
right is "name:bytearray,id:long"
Hope to get some advice here. Thank you!
You should test with the org.apache.hcatalog.pig.HCatLoader library.
Normally, Pig supports read from/write into partitioned tables;
read:
This load statement will load all partitions of the specified table.
/* myscript.pig */
A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader();
...
...
If only some partitions of the specified table are needed, include a partition filter statement immediately following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-RunningPigwithHCatalog
write
HCatOutputFormat will trigger on dynamic partitioning usage if necessary (if a key value is not specified) and will inspect the data to write it out appropriately.
https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions
However, I think this hasn't been yet properly tested with parquet files (at least not by the Cloudera guys) :
Parquet has not been tested with HCatalog. Without HCatalog, Pig cannot correctly read dynamically partitioned tables; that is true for all file formats.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_parquet.html