We have a requirement to run HiveQL incrementally and export results to a file in avro fromat, and we need to export the records.
Following are the 2 ways i am looking at and challenges i see in using them.
Option 1: using Pig and customer loader:
a. Writing a custom pig loader which run the hive query incemental.
b. write a pig flow and create a relation to the results of hive loader.
c. save the result in avro file.
Option 2. SQOOP export - i couldn't find a why to export hive query results in incrementally.
So far with my analysis i am think going with option 1 will better suit for my requirement.
Can someone please explain if you think we can acheive this easily in sqoop?
Sqoop can export data from HDFS directory to target databases, not files. In this case sqoop cannot
Read increment results unless you have separate hive table or partition (which results in new directory)
Write into external files in avro format.
Related
We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.
Directory structure:
/data/table1/INGEST_DATE=20180101
/data/table1/INGEST_DATE=20180102
/data/table1/INGEST_DATE=20180103 etc.
Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).
I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details.
Thanks in Advance
If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.
With Spark SQL you can get Partition pruning automatically.
I'm loading data from HDFS to mySQL using SQOOP, in this data one record has got more than 70 fields, making it difficult to define the schema while creating the table in RDBMS.
Is there a way to use AVRO tables to dynamically create the table with schema in RDBMS using SQOOP?
Or is there any some tool which does the same?
This is not supported in sqoop today. From the sqoop documentation
The export tool exports a set of files from HDFS back to an RDBMS. The
target table must already exist in the database. The input files are
read and parsed into a set of records according to the user-specified
delimiters.
I have a csv in my local directory and i wish to create a hive table of it..The problem is csv has many columns...
In authors words Sqoop means Sql-to-Hadoop.. you can't use Sqoop to import data from your local to hdfs in any way.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
Imports individual tables or entire databases to files in HDFS
Generates Java classes to allow you to interact with your imported data
Provides the ability to import from SQL databases straight into your Hive data warehouse
For more information follow below links:
http://blog.cloudera.com/blog/2009/06/introducing-sqoop/
http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html
I've been trying to store csv data into a table in a database using a pig script.
But instead of inserting the data into a table in a database I created a new file in the metastore.
Can someone please let me know if it is possible to insert data into a table in a database with a pig script, and if so what that script might look like?
You can take a look at DBStorage, but be sure to include the JDBC jar in your pig script and declaring the UDF.
The documentation for the storage UDF is here:
http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/DBStorage.html
you can use:
STORE into tablename USING org.apache.hcatalog.pig.HCatStorer()
My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations.
Is there a way I could query hive tables or read hive table data within a pig script?
Thanks
The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the
same tables. For now, you can try to work with the data as it is stored in HDFS.
Note that Howl has been renamed to HCatalog.