How can I do contract testing for data? - datacontract

Our application ingests files from various data sources into a SaaS-based structured document store.
Occasionally one of our data providers might change the file from our expected format and mess up our ETL jobs.
It would be great if there were some form of contract testing like what Spring Cloud Contracts are for APIs, except for files.
Does anyone know of anything like that?

Related

Best Practice for deploying ADF pipelines

I am brand new to ADF and am creating my very first data factory. I am using the UI option (if anyone can point me to any documents for using code I'd be most grateful).
I will have 3 different environments - dev/test/prod. Each of these have got slightly different configs (yes I know!). So my datasets and linked services will need to change for each environment. What is the best way to do this? How would you approach this?
(p.s: We also have BitBucket and Jenkins/Octopus for CI/CD, so ideally would like to create scripts to automate this if possible.)
Thank you
You can create data factory using code. You can find code with detailed information here
There are 2 approach to deploy ADF pipeline.
ARM template
Custom approach (Json files, via REST API) - With this approach, we can fully automate CI/CD process as collaboration branch will be our source for deployment. This is the reason why the approach is also known as (direct) deployment from code (JSON files).
Refer this blog by Kamil Nowinski
Scope of the question is broad. But, this video by Mohamed Radwan practically shows how you can deploy and manage 3 different environments i.e. ADF-DEV, ADF-PROD and ADF-UAT.

Can we automate ETL in Azure?

I am currently working on a very interesting ETL project using Azure to transform my data manually. However, transforming data manually can be exhausting and lengthy when I start having several source files to process. My pipeline is working fine for now because I have only a few files to transform but what if I have thousands of excel files?
So what I want to achieve is that I want to extend the project and extract the excel files that are coming from Email using the logic app then apply ETL directly on top of them. Is there any way I can automate ETL in Azure. Can I do ETL without modifying the pipeline for a different type of data manually? How can I make my pipeline flexible to be able to handle data transformation for various types of source data?
Thank you in advance for your help.
Can I do ETL without modifying the pipeline for a different type of
data manually?
According to your description, i suppose that you already knew the ADF connector is supported in the Logic App. You could execute ADF pipeline in the Logic App flow and even pass parameters into ADF pipeline.
Normally, the source and sink service should be fixed in one copy activity, but you could define dynamic file path in the datasets. So you don't need to create multiple copy activities.
If the data types are different, you could try to pass the parameter from Logic App into ADF. Then before the data transmission, you could use Switch activity to route the transmission into different branches.

Accessing a single RavenDB from different applications

I have a web project that stores objects in raven db. For simplicity the classes live in the web project.
I now have a batch job that is a separate application that will need to query the same database and extract information from it.
Is there a way I can tell raven to map the documents to the classes in the batch job project that have the same properties as those in the web project.
I could create a shared dll with just these classes in if that's needed. seems unnecessary hassle though
As long as the structure of the classes you are deserializing into partially matches the structure of the data, it shouldn't make a difference.
The RavenDB server doesn't care at all what classes you use in the client. You certainly could share a dll, or even share a portable dll if you are targeting a different platform. But you are correct that it is not necessary.
However, you should be aware of the Raven-Clr-Type metadata value. The RavenDB client sets this when storing the original document. It is consumed back by the client to assist with deserialization, but it is not fully enforced. The logic basically is this:
is there ClrType metadata?
if yes, do we have that type loaded in the current app domain?
if yes, then deserialize into that type
if none of the above, then deserialize dynamically and cast into the type
requested (basically, duck-typing)
You can review this bit of the internals in the source code on github here.

How to handle multiple data sources in one WCF Domain Service?

I'm working on creating a WCF Domain Service which at the moment provides access to a database. I created the Entity Model, added the DomainService (LinqToEntitiesDomainService) and everything works so far.
But there are cases when my data doesn't come from the DB but somewhere else (for instance an uploaded file). Are there any best practices out there how to handle this different data sources properly without resorting to writing two completely different data providers? It would be great to access both types with one interface. Is there already something I can use?
I'm fairly new to this so any advice apart from that is highly appreciated.
How many cases where the data comes from a file? How many files? How will you know if a file is there? Are you going to poll the directory? what format are the files? (XML support is possible)
Microsoft's documentation suggests that you can create a custom host endpoint, but I don't know what limitations there are.

Website data retrieval

An recent article has prompted me to pick up a project I have been working on for a while. I want to create a web service front end for a number of sites to allow automated completion of forms and data retrieval from the results, and other areas of the site. I have acheived a degree of success using Selenium and custom code however I am looking to extend this to a stage where adding additional sites is a trivial task (maybe one which doesn't require a developer even).
The Kapow web data server looks to achieve a lot of this however I am told it is quite expensive (currently awaiting a quote). Has anyone had experience with this, or can suggest any alternatives (Open Source ideally)?
Disclaimer: I realise the potential legality issues around automating data retrieval from 3rd party websites - this tool is designed to be used in a price comparison system and all of the websites integrated with it will be done with the express permission of the owners. Where the sites provide an API this will clearly be the favoured approach.
Thanks
Realised it's been a while since I posted this, however should anyone come across it, I have had lots of success in using the WSO2 framework (particularly the mashup server) for this. For data mining tasks I have also used a Java library that this wraps - webharvest - which has achieved everything I needed