Azure Data Factory - delete data from a MongoDb (Atlas) Collection - azure-data-factory-2

I'm trying to use Azure Data Factory (V2) to copy data to a MongoDb database on Atlas, using the MongoDB Atlas connector but I have an issue.
I want to do an Upsert but the data I want to copy has no primary key, and as the documentation says:
Note: Data Factory automatically generates an _id for a document if an
_id isn't specified either in the original document or by column mapping. This means that you must ensure that, for upsert to work as
expected, your document has an ID.
This means the first load works fine, but then subsequent loads just insert more data rather than replacing current records.
I also can't find anything native to Data Factory that would allow me to do a delete on the target collection before running the Copy step.
My fallback will be to create a small Function to delete the data in the target collection before inserting fresh, as below. A full wipe and replace. But before doing that I wondered if anyone had tried something similar before and could suggest something within Data Factory that I have missed that would meet my needs.

As per the document, You cannot delete multiple documents at once from the MongoDB Atlas. As an alternative, you can use the db.collection.deleteMany() method in the embedded MongoDB Shell to delete multiple documents in a single operation.
It has been recommended to use Mongo Shell to delete via query. To delete all documents from a collection, pass an empty filter document {} to the db.collection.deleteMany() method.
Eg: db.movies.deleteMany({})

Related

Why isn't there an option to upsert data in Azure Data Factory inline sink

The problem I'm trying to tackle is inserting and/or updating dynamic tables in a sink within an Azure Data Factory data flow. I've managed to get the source data, transform it how I want it and then send it to a sink. The pipeline ran successfully and it said it copied 37 rows (as expected) but investigation showed that no data was actually deposited in the target table. This was because the Table Action on the sink was set to 'None'. So in trying to fix this last part, it seems I don't have the 'Create' option but do have the 'Recreate' option (see screenshot of the sink below) which is not what I want as the datasource will eventually only have changed data. I need the process to create the table if it doesn't exist and then Upsert data. (Recreate drops the table and then creates it).
If I change the sink type from Inline to Dataset, then I can select Insert and Upsert, etc options but this is then not dynamic as I need to select a specific dataset.
So has anyone come across the same issue and have you managed to have dynamic sinks in your data flow where the table is created if it doesn't exist, then upsert data.
I guess I can add a Pre SQL script which takes care of the 'create the table if it doesn't exist' but I still can't select the Upsert option with inline tables.
For the CREATE TABLE IF NOT EXISTS issue, I would recommend a Stored Procedure that is executed in the pipeline prior to the Data Flow.
For Inline vs Dataset, you can make the Dataset very flexible:
So still based on your runtime table name and no schema, so no need to target a specific table.
For the UPSERT issue, make sure you have an AlterRow activity before the Sink:

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

How to performance test an Update endpoint in JMeter while automatically updating row value in payload

I have an Update entity endpoint in my .NET Core Microservice API that needs to be tested for performance. For all other endpoints, I am able to store the ID in a CSV file and load it before processing, however I want to reuse the values in the CSV for update, which requires updating and keeping track of the Row Version attribute for the ID.
I will be testing using 100 Users and 100 Orders, so I will need to match every user to one order so they don't try updating the same entity.
Steps:
Read CSV with ID and current row version
Call Update endpoint on the ID and row version, read in new Row Version from response body
Store the new row version and the ID within JMeter to reuse in the test
Call Update endpoint on the ID and new row version
The problem with storing inside of the CSV is JMeter will be reading and writing from the same file. I am looking for a way to use a Java like collection inside of my script to not have to read and write from a file.
The dictionary would look like {'q28937-3423572903485-324875', rowVersion: 42}
Add a Post Processor as a child of your HTTP request (the first update) to extract the new ID and rowVersion.
Then in the next update, you should use Jmeter variables ${ID} and ${rowVersion} which holds the new values that you extracted using Post Processor.
Note that variables are not shared between threads, from Jmeter user manual best practices - 16.13
Variables are local to a thread; a variable set in one thread cannot be read in another. This is by design.
Also check
Using RegEx (Regular Expression Extractor) with JMeter guide.
Using CSV DATA SET CONFIG guide.
CSV data set config Jmeter User Manual

BigQuery: Cannot insert new value after updating several schema fields using streaming API

The issue I am facing in my nodejs application is identical to this user's question: Cannot insert new value to BigQuery table after updating with new column using streaming API.
To my understanding changes such as widening a table's schema may require some period of time before streamed inserts can reference the new columns otherwise a 'no such field' error is returned. For me this error is not always consistent as sometimes I am able to successfully insert.
However, I specifically wanted to know if you could alternatively use a load job instead of streaming? If so what drawbacks does it have as I am not sure of the difference even having read the documentation.
Alternatively, if I do use streaming but with the ignoreUnknownValues option, does that mean that all of the data is eventually inserted including data referencing new columns? Just that new columns are not queryable until the table schema is finished updating?

Doctrine schema changes while keeping data?

We're developing a Doctrine backed website using YAML to define our schema. Our schema changes regularly (including fk relations) so we need to do a lot of:
Doctrine::generateModelsFromYaml(APPPATH . 'models/yaml', APPPATH . 'models', array('generateTableClasses' => true));
Doctrine::dropDatabases();
Doctrine::createDatabases();
Doctrine::createTablesFromModels();
We would like to keep existing data and store it back in the re-created database. So I copy the data into a temporary database before the main db is dropped.
How do I get the data from the "old-scheme DB copy" to the "new-scheme DB"? (the new scheme only contains NEW columns, NO COLUMNS ARE REMOVED)
NOTE:
This obviously doesn't work because the column count doesn't match.
SELECT * FROM copy.Table INTO newscheme.Table
This obviously does work, however this is consuming too much time to write for every table:
SELECT old.col, old.col2, old.col3,'somenewdefaultvalue' FROM copy.Table as old INTO newscheme.Table
Have you looked into Migrations? They allow you to alter your database schema in programmatical way. WIthout losing data (unless you remove colums, of course)
How about writing a script (using the Doctrine classes for example) which parses the yaml schema files (both the previous version and the "next" version) and generates the sql scripts to run? It would be a one-time job and not require that much work. The benefit of generating manual migration scripts is that you can easily store them in the version control system and replay version steps later on. If that's not something you need, you can just gather up changes in the code and do it directly through the database driver.
Of course, the more fancy your schema changes becomes, the harder the maintenance will get i.e. column name changes, null to not null etc.