I am trying to understand SQOOP- Incremental Imports with "LastModified" option.
Since the HDFS does not meant for file updates, how this is handled internally?
Is it via create a separate file and point original to this new file?
In case of append - create a new file with new records?? may be??
But how last modified flag updates HDFS data... Logic behind this?
--append mode
You are only adding new data. Each Sqoop incremental import operation will add part files in the hdfs directory. For example - part-m-00000 , part-m-00001
--lastmodified mode
There are updates too in addition to newly added data. When you try to run this command 2nd time, it will give you error (because target directory is same):
Error during import: --merge-key or --append is required when using --incremental lastmodified and the output directory exists.
Now if you add --append, it will simply add new files to the same directory. Now you have to manually merge data in two files using Sqoop Merge.
As per the docs,
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.
Otherwise, you can opt for --merge-key, it will take care of merging automatically.
Related
I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.
INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source
Better do not do this and modify your program to accept any number of files in the directory.
Each reducer (or mapper if it runs on map-only) creates it's own file. These reducers do know nothing about each other, they named during creation. Files are marked as 000001_0,000002_0. But it can be 000001_1 also if attempt number 0 has failed and attempt number 1 has succeeded. Also if table is partitioned and there is no distribute by partition key at the end, each reducer will create it's own file in each partition.
You can force it to work on a single final reducer (it can be done for example if you add order by clause or setting set mapred.reduce.tasks = 1;). But bear in mind that this solution is not scalable, because too many data will cause performance problems on single reducer. Also What will happen if attempt 0 has failed and it was restarted and attempt 1 succeeded? It will create 000001_1 instead of 000001_0.
I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.
I have table in MySQL I want to import data using sqoop. I imported the data and stored in HDFS as a file. Now I want to run incremental update on this
file in hdfs.
Say I have 100 rows in MYSQL table. The file in HDFS contains data worth of first 50 rows. How can I incremental update this file.
I am talking about files not Hive tables.
I want incremental data as a separate file not merged file. For example the first part file contains 50 records, then I need a part file that contains the next 50 records. I mean to say can we do incremental update on files?
You can not update HDFS file in this case.
But this is a common use-case. sqoop-merge tool can solve this. You need to perform sqoop incremental import and save output in different hdfs files.
As per documentation,
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.
Sample Command:
sqoop merge --new-data newer --onto older --target-dir merged \
--jar-file datatypes.jar --class-name Foo --merge-key id
I have an oracle dmp file and I need to import data into a table.
The data in the dump contains new rows and few updated rows.
I am using import command and IGNORE=Y, so it imports all the new rows well. But it doesn't import/overwrite the existing rows (it shows a warning of unique key constraint violated).
Is there some option to make the import UPDATE the existing rows with new data?
No. If you were using data pump then you could use the TABLE_EXISTS_ACTION=TRUNCATE option to remove all existing rows and import everything from the dump file, but as you want to update existing rows and leave any rows not in the new file alone - i.e. not delete them (I think, since you only mention updating, though that isn't clear) - that might not be appropriate. And as your dump file is from the old exp tool rather than expdp that's moot anyway, unless you can re-export the data.
If you do want to delete existing rows that are not in the dump then you could truncate all the affected tables before importing. But that would be a separate step that you'd have to perform yourself, its not something imp will do for you; and the tables would be empty for a while, so you'd have to have downtime to do it.
Alternatively you could import into new staging tables - in a different schema sinceimp doesn't support renaming either - and then use those to merge the new data into the real tables. That may be the least disruptive approach. You'd still have to design and write all the merge statements though. There's no built-in way to do this automatically.
You can import into temp table and then do record recon by joining with it.
Use impdp option REMAP_TABLE to load existing file into temp table.
impdp .... REMAP_TABLE=TMP_TABLE_NAME
when load is done run MERGE statement on existing table from temp table.
Let's say I have a table:
db.table
I load the table and do some transforms on it, and, finally, attempt to store it
mytable = LOAD 'db.table' USING HCatLoader();
.
.
-- My transforms
.
.
STORE mytable_final INTO 'db.table' USING HCatStorer();
But the code complains I'm writing into a table with existing data.
I've looked at this JIRA ticket, which seems to be inactive (I have tried using FORCE and OVERWRITE in several places in the STORE command).
I've also looked at this SO post, but the author is loading from one location and storing in a different location. If I use what is in that post, the result from the transformation is no data. Deleting the files isn't an option. I'm thinking of storing the files temporarily, but I don't know if this is the best option.
I am trying to get the behavior you get in Hive using INSERT OVERWRITE.
I am not familiar with HCatLoader and HCatStorer. But if you LOAD from and STORE to HDFS, Pig provides shell commands that enable you to do the deleting and moving from within your script.
STORE A INTO '/this/path/is/temporary';
RMF '/this/path/is/permanent';
MV '/this/path/is/temporary' '/this/path/is/permanent';