I'm pretty new to PIG programming, can we read or write data to Hive ORC table. Our requirement is to dump data from One hive ORC table to another ORC table, by doing some massaging of data in PIG script. Can any one share with me Samples for ORC reading or Writing. Please also share any quick learning sites of PIG.
A = LOAD 'analysis.bic_orc' USING org.apache.hcatalog.pig.HCatLoader();
I found this in Hortonworks website, I haven't personally used Pig to load ORC files.
http://hortonworks.com/community/forums/topic/pig-is-much-slower-than-hive-when-reading-orc-files-using-hcatalog/
Hope this helps.
Related
Could you help me to load a couple of parquet files to Snowflake.
I've got about 250 parquet-files which stored in AWS stage.
250 files = 250 different tables.
I'd like to dynamically load them into Snowflake tables.
So, I need:
Get schema from parquet file... I've read that I could get the schema from parquet file using parquet-tools (Apache).
Create table using schema from the parquet file
Load data from parquet-file to this table.
Could anyone help me how to do that? Does exist the most efficient way to realize it? (by using GUI Snowflake, for example). Can't find it.
Thanks.
If the schema of the files is same you can put them in a single stage and use the Infer-Schema function. This will give you the schema of the parquet files.
https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html
In case all files have different schema then I'm afraid you have to infer the schema on each file.
I want to create a hive table which will store data with orc format and snappy compression. Will power bi be able to read from that table? Also do you suggest any other format/compression for my table?
ORC is a special file format only going to work with hive and its highly optimized for HDFS read operations. And power BI can connect to hive using hive odbc data connection. So, i think if you have to use hive all the time, you can use this format to store the data. But if you want flexibility of both hive and impala and use cludera provided impala ODBC driver, you can think of using parquet.
Now, both orc and parquet has their own advantages and disadvantages. And main deciding factor can be tools that access the data, how nested data is, and how many columns are there .
If you have many columns with nested data and if you want to use both hive and impala to access data, go with parquet. And if you have few columns with flat data structure and huge amount of data, go with orc.
Is it possible to read hive table (or HDFS data in parquet format) in Streamsets Data collector? I don't want to use Transformer for this.
Reading the raw files in parquet is counter to the way that data collector works so that would be a better use case for transformer.
But I have successfully used the jdbc origin either from Impala or hive to achieve this, there are some additional hurdles to jump with the jdbc source.
Lets say we have JSON data and we want to generate some results for business users.So does following seems to be good approach?
Loading data into hive from HDFS and then analyse it from pig using hcatalog. I have below question in this regards.
Q. Is it ok to load data from hcatalog and analyse it into pig, will this have performance overhead compare to directly read data from pig by keeping it into the HDFS.
I would personally prefer to do ETL using Pig.In your case JSON data can be loaded using JsonLoader and can be stored using JsonStorage.So I would load the data using Jsonloader and then store them in csv.Then I would use Hive to analyze this data.
JSON load
http://joshualande.com/read-write-json-apache-pig/
Alternative we can use twitter elephantbird json loader
http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/
I'm currently importing from Mysql into HDFS using Sqoop in avro format, this works great. However what's the best way to load these files into HIVE?
Since avro files contain the schema I can pull the files down to the local file system, use avro tools and create the table with the extracted schema but this seems excessive?
Also if a column is dropped from a table in mysql can I still load the old files into a new HIVE table created with the new avro schema (dropped column missing)?
After version 9.1, Hive has come packaged with an Avro Hive SerDe. This allows Hive to read from Avro files directly while Avro still "owns" the schema.
For you second question, you can define the Avro schema with column defaults. When you add a new column just make sure to specify a default, and all your old Avro files will work just find in a new Hive table.
To get started, you can find the documentation here and the book Programming Hive (available on Safari Books Online) has a section on the Avro HiveSerde which you might find more readable.