I have one query regarding inserting record in database using file endpoint.
I want to insert json type record in db. I create json file and all those file data i inserted into database. My query is i can insert all those data in database successfully but that is continuously inserted data and error occurred Duplicate entry '1' for key 'PRIMARY'
How can i solve this error?I don't want to insert data recursively.How can i do this only once?
I used following flow
**File->Json to Object->Splitter->Database**
please help me
You can use an Idempotent Message Filter (after the Splitter) to ensure that duplicate entries are discarded. If you json representation has an unique identifier, use the Idempotent Message Filter
<idempotent-message-filter idExpression="#[entry.id]">
<simple-text-file-store directory="./idempotent"/>
</idempotent-message-filter>
Otherwise, use the Idempotent Secure Hash Message Filter (which will filter messages based on their hash value)
<idempotent-secure-hash-filter messageDigestAlgorithm="SHA26">
<simple-text-file-store directory="./idempotent"/>
</idempotent-secure-hash-message-filter>
Please check the following reference for more info.
Personally I would try to avoid an idempotent filter with a simple message store as it will prevent potential ulterior updates of the data in the DB.
If your DBMS suports it I would try using an UPSERT mechanism that will effectively render your query idempotent. This could be done with this in postgresql and with this in mysql.
You can check the duplicates easily using .ack queries in Mule...
.ack are the query that runs immediate after normal query automatically ...
You need to create .ack query which will run immediately after your insert query and will check the rows already inserted and set the flag...
Check here how to do it with .ack query :-
http://training.middlewareschool.com/mule/database-transport/
and here :-
http://www.mulesoft.org/documentation/display/current/JDBC+Transport+Reference#JDBCTransportReference-Acknowledgment
Related
The problem I'm trying to tackle is inserting and/or updating dynamic tables in a sink within an Azure Data Factory data flow. I've managed to get the source data, transform it how I want it and then send it to a sink. The pipeline ran successfully and it said it copied 37 rows (as expected) but investigation showed that no data was actually deposited in the target table. This was because the Table Action on the sink was set to 'None'. So in trying to fix this last part, it seems I don't have the 'Create' option but do have the 'Recreate' option (see screenshot of the sink below) which is not what I want as the datasource will eventually only have changed data. I need the process to create the table if it doesn't exist and then Upsert data. (Recreate drops the table and then creates it).
If I change the sink type from Inline to Dataset, then I can select Insert and Upsert, etc options but this is then not dynamic as I need to select a specific dataset.
So has anyone come across the same issue and have you managed to have dynamic sinks in your data flow where the table is created if it doesn't exist, then upsert data.
I guess I can add a Pre SQL script which takes care of the 'create the table if it doesn't exist' but I still can't select the Upsert option with inline tables.
For the CREATE TABLE IF NOT EXISTS issue, I would recommend a Stored Procedure that is executed in the pipeline prior to the Data Flow.
For Inline vs Dataset, you can make the Dataset very flexible:
So still based on your runtime table name and no schema, so no need to target a specific table.
For the UPSERT issue, make sure you have an AlterRow activity before the Sink:
The issue I am facing in my nodejs application is identical to this user's question: Cannot insert new value to BigQuery table after updating with new column using streaming API.
To my understanding changes such as widening a table's schema may require some period of time before streamed inserts can reference the new columns otherwise a 'no such field' error is returned. For me this error is not always consistent as sometimes I am able to successfully insert.
However, I specifically wanted to know if you could alternatively use a load job instead of streaming? If so what drawbacks does it have as I am not sure of the difference even having read the documentation.
Alternatively, if I do use streaming but with the ignoreUnknownValues option, does that mean that all of the data is eventually inserted including data referencing new columns? Just that new columns are not queryable until the table schema is finished updating?
I can not delete the range defined by where.
My query:
delete from `dataset.events1` as t where t.group='error';
Result:
Error: UPDATE or DELETE statement over table dataset.events1 would affect rows in the streaming buffer, which is not supported.
According to the BQ docs:
Rows that were written to a table recently via streaming (using the tabledata.insertall method) cannot be modified using UPDATE, DELETE, or MERGE statements. Recent writes are typically those that occur within the last 30 minutes. Note that all other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements.
This looks like the error you're facing.
You can check if your table has a streaming buffer attached through the BigQuery API.
This error message is considered as an expected behavior when querying rows that were recently streamed into the table in order to maintain the data consistency. Based on this, it is required to wait until the buffer is flushed, which can take up to 90 minutes to become available for copy/export and other operations, otherwise you would get the same error.
To validate if the table has an active streaming buffer process, you can check the tables.get response and verify if it contains a section named streamingBuffer.
I'm trying to create a data sync using Mule Soft so that Db1 is checked for any updates based on LastModified Date and if so the updates are applied to Db2.
I've got the script to work to a point where when the script is first started, the data is copied from Db1 to Db2. After which the script constantly updating the records in Db2. (Below is my flow Diagram)
I've tried to setup recordVars in the message enricher (in Batch_Step) to see if records exists and route them accordingly in Choice (in Batch_Step1).
I've also enabled water mark in Poll for timestamp but nothing is working to avoid constant updating of inserted records.
Below are screenshot of my configs:
Watermark Setup:
Db1 query:
BatchStep Accept Expression:
Message Enricher:
Choice Setup:
Add LastModifiedDate in the Select statement from Db1 so watermark will able to access the field payload.LastModifiedDate.
Also, what is your query in Db2 batch_step? check it, cause it might always getting results that possibly caused to always have payload.size > 0.
I am new to Pentaho Data Integration; I need to integrate one database to another location as ETL Job. I want to count the number of insert/updat during the ETL job, and insert that count to another table . Can anyone help me on this?
I don't think that there's a built-in functionality for returning the number of affected rows of an Insert/Update step in PDI to date.
Nevertheless, most database vendors are able to provide you with the ability to get the number of affected rows from a given operation.
In PostgreSQL, for instance, it would look like this:
/* Count affected rows from INSERT */
WITH inserted_rows AS (
INSERT INTO ...
VALUES
...
RETURNING 1
)
SELECT count(*) FROM inserted_rows;
/* Count affected rows from UPDATE */
WITH updated_rows AS (
UPDATE ...
SET ...
WHERE ...
RETURNING 1
)
SELECT count(*) FROM updated_rows;
However, you're aiming to do that from within a PDI job, so I suggest that you try to get to a point where you control the SQL script.
Suggestion: Save the source data in a file on the target DB server, then use it, perhaps with a bulk loading functionality, to insert/update, then save the number of affected rows into a PDI variable. Note that you may need to use the SQL script step in the Job's scope.
EDIT: the implementation is a matter of chosen design, so the suggested solution is one of many. On a very high level, you could do something like the following.
Transformation I - extract data from source
Get the data from the source, be it a database or anything else
Prepare it for output in a way that it fits the target DB's structure
Save a CSV file using the text file output step on the file system
Parent Job
If the PDI server is the same as the target DB server:
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table (ideally, this table could also contain the time-stamp of the operation so you could keep track of things)
If the PDI server is NOT the same as the target DB server:
Upload the source data file to the server, e.g. with the FTP/SFTP file upload steps
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table
EDIT 2: another suggested solution
As suggested by #user3123116, you can use the Compare Fields step (if not part of your environment, check the marketplace for it).
The only shortcoming I see is that you have to query the target database before inserting/updating, which is, of course, less performant.
Eventually it could look like so (note that this is just the comparison and counting part):
Also note that you can split the input of the source data stream (COPY, not DISTRIBUTE), and do your insert/update, but this stream must wait for the stream of the field comparison to end the query on the target database, otherwise you might end up with the wrong statistics.
The "Compare Fields" step will take 2 streams as input for comparison, and its output is 4 distinct streams for "Identical", Changed", "Added", and "Removed" records. You can count those 4, and then process the "Changed", "Added", and "Removed" records with an Insert/Update.
You can do it from the Logging option inside the Transformation settings. Please follow the below steps :
Click on Edit menu --> Settings
Switch to Logging Tab
Select Step from the left menu
Provide the Log Connection & Log table name(Say StepLog)
Select the required fields for logging(LINES_OUTPUT - for inserted count & LINES_UPDATED - for updated count)
Click on SQL button and create the table by clicking on the Execute button
Now all the steps will be logged into the Log table(StepLog), you can use it for further actions.
Enjoy