How to identify deleted records in sql server while importing to hadoop using Sqoop - sql

While importing data from sql server or any RDBMS database to hadoop using Sqoop, we can get newly appended records or modified records using incremental append or last modified or some free form queries.
Is there anyway we can identify deleted records? Considering when record is deleted it will not exist in sql table.
One workaround is to load full table using Sqoop and compare with previous table in hive.
Is there any other best way to do?

No, you can not get deleted records using sqoop.
A better workaround could be:
Create a boolean field status(default true) in your SQL Server table.
Whenever you need to delete that record don't delete just update with marking status false.
If you are using last-modified increment import, you will get this changed data in HDFS.
Later (after sqqop import) you can delete all these records with status false.

If you are syncing the entire partition or table then you can identify deleted records after sqoop import before merging them using full join with existing target partition or table. Records existing in target table/partition which do not exist in imported data are those deleted on source database since last sync.

Incremental sqooping does not handle deleted records out of the box. There are two approach you may want to consider.
Please look at this post.

Related

Checking if a group of data exist in the sql server database

I was asked to do copy of a csv file(which resides on a server and is updated everyday) into the database but a group of 5-6 field are the parameter which will decide whether we can enter the data or not.
The condition for insertion is that if it is a completely new entry then it will be entered if it is a copy then skip that row and if it is a different entry then update the entry in the database.
Can someone help me with how can I do this? can I do if exist query but that will be a costly operation as it has to match every record? Or any SSIS activity might help with this?
you can make form this query MERGE and when matched do nothing
https://www.mssqltips.com/sqlservertip/1704/using-merge-in-sql-server-to-insert-update-and-delete-at-the-same-time/

How can I create a Primary Key in Oracle when composite key and a "generated always as identity" option won't work?

I'm working on an SSIS project that pulls data form Excel and loads to Oracle Database every month. I plan to pull data from Excel file and load to Oracle stage table. I will be using a merge statement because the data that gets loaded each month is a rolling 12 month list and the data can change, so need to be able to INSERT when records don't match or UPDATE when they do. My control flow looks like this: Truncate Stage Table (to clear out table from last package run)---> DATA FLOW from Excel to Stage Table---> Merge to Target Table in Oracle.
My problem is that the data in the source Excel file doesn't have any unique columns to select a primary key or a composite key, as it is a possibility (although very unlikely) that a new record could have the exact same information. I am unable to utilize the "generated always as identity" because my SSIS package needs to truncate at the beginning of each job to clear out the Stage Table. This would generate the same ID numbers in the new load and create problems in the Target Table.
Any suggestions as to how I can get around this problem?
Welcome to SO and ETL. Instead of using a staging table, in SSIS use two sources: Excel file and existing production table. Sort both inputs and then perform a merge join on the unique identifier. From there, use a derived column transformation to add a new column called 'Action' which will mark a row as either an INSERT/UPDATE/DELETE based on whether the join key is NULL. So:
NULL from file means DELETE (not in file, in database)
NULL from database means INSERT (in file, not in database)
Not NULL for both means UPDATE (in file, in database)
From there, use a conditional split to split rows to either a OLE DB Destination (INSERT), or SQL Command (UPDATE or DELETE). You can now remove the stage environment and MERGE command from your process. This has the added benefit of removing the ETL load from the SQL Server, assuming SSIS is running on a separate server.
Note: The sort transformation has the option to remove duplicates.

Crate db cannot query data in a shard

I have a instance of Crate 1.0.2 and I dropped a table from it. Then re-created table with same name and slightly modified schema. Then I imported data using copy from command. File argument to copy from command consists of 10,000 records and copy from command runs ok. When I check table tab in crate web console, it shows many partitions added and each partition having few records. If I add number of records column on this tab, it comes close to 10k but when I fire a command "select count(*) from mytable", it returns around 8000 records only. On further investigation found that there are certain partitions on which data cannot be queried at all. Has any one seen this problem? Does it have anything to do with table drop and creation with same name ? I also observed that when a table is dropped, not all files related to that table are deleted from path.data. Are these directories a reason for those partitions become non-query able? While importing, I saw "Document already exists" exception. I know my data does not have any duplicate value for primary column.
Some questions to clarify the issue:
Have you run refresh table mytable after your copy command has finished?
Are you sure that with the new schema of the table, there are no duplicate records?
Since 1.x versions are not supported anymore, could you try with CrateDB 2.1.6 which is the current stable version to see if the problem persists?

Pentaho update/insert

I am trying to have a setup in Pentaho where :
My source data is in MySQL DB and target database is Amazon redshift.
I want to have incremental loads on Redshift database table, based on the last updated timestamp from MySQL DB table.
Primary key is student ID.
Can I implement this using update/insert in Pentaho ?
Insert/Update step in Pentaho Data Integration serves the purpose of inserting the row if it doesn't exist in the destination table or updating it if it's already there. It has nothing to do with incremental loads, but if your loads should be inserting or updating the record based on some Change Data Capture mechanism then this is the right step at the end of the process.
For example you could go one of two ways:
If you have a CDC then limit the data at Table Input for MySQL since you already know the last time a record has been modified (last load)
If you don't have a CDC and you are comparing entire tables then go for joining the sets to produce rows that has changed and then perform a load (slower solution)

Is it possible to overwrite with a SSIS Insert or similar?

I have a .csv file that gets pivoted into 6 million rows during a SSIS package. I have a table in SQLServer 2005 of 25 million + rows. The .csv file has data that duplicates data in the table, is it possible for rows to get updated if it already exists or what would be the best method to achieve this efficiently?
Comparing 6m rows against 25m rows is not going to be too efficient with a lookup or a SQL command data flow component being called for each row to do an upsert. In these cases, sometimes it is most efficient to load them quickly into a staging table and use a single set-based SQL command to do the upsert.
Even if you do decide to do the lookup - split the flow into two streams, one which inserts and the other which inserts into a staging table for an update operation.
If you don't mind losing the old data (ie. the latest file is all that matters, not what's in the table) you could erase all the records in the table and insert them again.
You could also load into a temporary table and determine what needs to be updated and what needs to be inserted from there.
You can use the Lookup task to identify any matching rows in the CSV and the table, then pass the output of this to another table or data flow and use a SQL task to perform the required Update.