Incremental append to file - hive

I have table in MySQL I want to import data using sqoop. I imported the data and stored in HDFS as a file. Now I want to run incremental update on this
file in hdfs.
Say I have 100 rows in MYSQL table. The file in HDFS contains data worth of first 50 rows. How can I incremental update this file.
I am talking about files not Hive tables.
I want incremental data as a separate file not merged file. For example the first part file contains 50 records, then I need a part file that contains the next 50 records. I mean to say can we do incremental update on files?

You can not update HDFS file in this case.
But this is a common use-case. sqoop-merge tool can solve this. You need to perform sqoop incremental import and save output in different hdfs files.
As per documentation,
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.
Sample Command:
sqoop merge --new-data newer --onto older --target-dir merged \
--jar-file datatypes.jar --class-name Foo --merge-key id

Related

Spark HDFS Direct Read vs Hive External table read

We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.
Directory structure:
/data/table1/INGEST_DATE=20180101
/data/table1/INGEST_DATE=20180102
/data/table1/INGEST_DATE=20180103 etc.
Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).
I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details.
Thanks in Advance
If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.
With Spark SQL you can get Partition pruning automatically.

loading a pg_dump off of s3 into redshift

I'm trying to load a complete database dump into Redshift. Is there a single command to restore the data from a pg_dump living on s3 into Redshift? If not, what are the best steps for tackling this?
Thanks
If you have a non compressed pg_dump this should be possible using a psql command (you may need to manually edit to get the right syntax, depending on your versions and options set).
However this is a very inefficient and slow way to load redshift and I do not recommend it. If your tables are large it could take days or weeks!
What you need to do is this:
create target tables on redshift based upon the source table, but
considering sort keys and distribution.
unload you postgres source tables into csv files using postgres
"copy" command
If the source csv files are very big (e.g. more than say 100MB),
consider splitting these into separate files as they will load
faster (redshift will parallelize)
gzip the csv files (recommended but not essential)
upload these csv files to s3, with a separate folder per table
load the data into redshift from s3 by using the redshift copy
command

Merge files from Data lake store

I have a package that daily imports a file to Data lake store. So that is the same file with different values(same columns etc). My idea is to merge those files into a single file on Data lake, for a monthly report. I want to investigate U-SQL, so my question is:
Is that possible to do with U-SQL?
If its not possible is there any other options to do that?
It is very easily possible to merge records from two files and write a new file. Here are the steps
Read all of the new file using EXTRACT
Read all the records of the current master file using EXTRACT
Use UNION ALL to merge the records: https://msdn.microsoft.com/en-us/library/azure/mt621340.aspx
Write output to master file using OUTPUT statement
For a quick U-SQL tutorial go here: https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-get-started

How hive create a table from a file present in HDFS?

I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS.
I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file.
My questions are
1. Does hive create a separate file directory?
2. Do I have to create hive table first and import data from HDFS?
3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
My questions are
Does hive create a separate file directory?
YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
create a external table with 300 column and point to the existing
file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using "insert into
table30col select ... from table300col". Note: hive will create the
file with 30 columns during this insert operation.
Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative.
I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.

SQOOP Incremental Import with Lastmodified

I am trying to understand SQOOP- Incremental Imports with "LastModified" option.
Since the HDFS does not meant for file updates, how this is handled internally?
Is it via create a separate file and point original to this new file?
In case of append - create a new file with new records?? may be??
But how last modified flag updates HDFS data... Logic behind this?
--append mode
You are only adding new data. Each Sqoop incremental import operation will add part files in the hdfs directory. For example - part-m-00000 , part-m-00001
--lastmodified mode
There are updates too in addition to newly added data. When you try to run this command 2nd time, it will give you error (because target directory is same):
Error during import: --merge-key or --append is required when using --incremental lastmodified and the output directory exists.
Now if you add --append, it will simply add new files to the same directory. Now you have to manually merge data in two files using Sqoop Merge.
As per the docs,
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.
Otherwise, you can opt for --merge-key, it will take care of merging automatically.