Referring to existing Dynamo DB in Lambda functions - amazon-s3

I'm trying to read an existing dynamo DB in Lambda function, but resources in YAML creates a new table. If I could do that, someone please help me how? Also I need to use an existing S3 bucket

If you change your resources frequently or even occasionally then you should use Parameter Store. This will allow your lambda function to pick up the correct table names at runtime.
Anytime you update/change your table to have a new name, you just update the value in parameter store and your Lambda will automatically refer to the new table.
https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html

Related

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

Azure Data Factory - delete data from a MongoDb (Atlas) Collection

I'm trying to use Azure Data Factory (V2) to copy data to a MongoDb database on Atlas, using the MongoDB Atlas connector but I have an issue.
I want to do an Upsert but the data I want to copy has no primary key, and as the documentation says:
Note: Data Factory automatically generates an _id for a document if an
_id isn't specified either in the original document or by column mapping. This means that you must ensure that, for upsert to work as
expected, your document has an ID.
This means the first load works fine, but then subsequent loads just insert more data rather than replacing current records.
I also can't find anything native to Data Factory that would allow me to do a delete on the target collection before running the Copy step.
My fallback will be to create a small Function to delete the data in the target collection before inserting fresh, as below. A full wipe and replace. But before doing that I wondered if anyone had tried something similar before and could suggest something within Data Factory that I have missed that would meet my needs.
As per the document, You cannot delete multiple documents at once from the MongoDB Atlas. As an alternative, you can use the db.collection.deleteMany() method in the embedded MongoDB Shell to delete multiple documents in a single operation.
It has been recommended to use Mongo Shell to delete via query. To delete all documents from a collection, pass an empty filter document {} to the db.collection.deleteMany() method.
Eg: db.movies.deleteMany({})

Azure Data Factory V2: How to pass a file name to stored procedure variable

I have a big fact Azure SQL table with the following structure:
Company Revenue
-------------------
A 100
B 200
C 100
. .
. .
. .
I am now building a stored procedure on Azure Data Factory V2 that will delete all records of a special company from the Azure SQL fact table above in a monthly basis. For this exercise this special company shall be identified by the variable #company. The structure of the stored procedure was created as:
#company NVARCHAR(5)
DELETE FROM table
WHERE [company] = #company
As I will have different Excel files from each company that will be inserting data into this table in a monthly basis (with Copy Activity), I want to use the stored procedure above to delete the old data from that company before I add the most updated one.
I would like then to pass to the variable "#company" the name of that Excel file (stored in a blob container) so that the stored procedure knows what is the relevant data to be deleted from the fact table. For example: If the Excel file is "A", the stored procedure shall be "delete from table where company = A".
Any ideas on how to pass the Excel file names to the variable "#company" and set this up on Azure Data Factory V2?
Any ideas on how to pass the Excel file names to the variable
"#company" and set this up on Azure Data Factory V2?
Based on your description, I found Event-based trigger in azure data factory maybe will meet your needs. Event-based trigger runs pipelines in response to an event, such as the arrival of a file, or the deletion of a file, in Azure Blob Storage.
So, when the new excel file created in the blob storage (BTW, it only supports V2 storage account,more details please refer to this article), you could get the #triggerBody().folderPath and #triggerBody().fileName. To use the values of these properties in a pipeline, you must map the properties to pipeline parameters. After mapping the properties to parameters, you can access the values captured by the trigger through the #pipeline.parameters.parameterName expression throughout the pipeline. (doc)
You could get the filename and pass it into your stored procedure. Then do the deletion and copy activities.
I you use the stored procedure activity you can fill the parameters from pipeline parameters: https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-stored-procedure
You can click on "Import" to automatically add the parameters after selecting a stored procedure or add them by hand. Then you can use a pipeline expression to dynamically fill them, e.g. as suggested by Jay Gong using the trigger parameters #triggerBody().folderPath and #triggerBody().fileName.
Alternatively, instead of using a stored procedure, you could add a pre-copy script to your copy activity:
This shows up only for appropriate sinks, such as a database table. You can fill this script dynamically as well. In your case, this could look like:
#{concat('DELETE FROM table
WHERE [company] = ''',triggerBody().fileName,'''')}
It might also make sense, to add a parameter to the pipeline containing the filename, and setting it to #triggerBody().fileName or any more complex expression, in case you are using it multiple times.

Enumerate blob names in Azure Data Factory v2

I need to enumerate all the blob names that sit in an Azure Blobs container and dump the list to a file in another blob storage.
The part that I cannot master is the enumeration.
Thanks.
Get metadata activity is what you want.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
Please use childItems to get all the files. And then use a foreach to iterate the childItems
Inside the for each activity, you may want to check if each item is a file. You could use if activity and the following expression.
Then in the "If true" activity, assume you want to copy data, you could use #item().name to get each of your file name.
You could find more documentations with this link.

AWS Lambda - Is there a way to pass parameters to a lambda function when an event occurs

I have a DynamoDB table and whenever a new record is added, I want to archive the old data to S3. So I thought I could use AWS Lambda. So the lambda function will get the new record that is newly added/modified. But I want to pass(to the lambda function) an additional parameter of the s3 path to which the record has to be uploaded.
One way is to have whatever I want to pass to the lamda function in another table/s3. But this(the parameter) will change as each record is inserted into the main table. So I can't read this from my lambda function. (By the time the lambda function gets executed for the first inserted record, few more records would have been inserted)
Is there a way to pass params to the lambda function?
P.S: I want to execute the lambda asynchronously.
Thanks...
why not to add this parameters (s3 path) to your dynamodb table (where the new raw is added -not in another table, but at the same table that lambda is listening on)
You can now accomplish this by:
Attaching a DynamoDB Stream to your Dynamo Table with a view to NEW_AND_OLD_IMAGES
Creating an event source on your lambda function to read the DynamoDB stream
Add an environment variable to your lambda function to indicate where to write the data to in S3
You'll still have to derive the details of where to store the record from the record itself, but you can indicate the bucket or table name in the environment variable.