SQL-Pandas : transform with Sum on past and new data - sql

I have some user records in xlsx file and I am storing those records in SQL Database by applying some groupby-transform-Sum() functionality of pandas if there multiple instances for particular combination monthly basis (Exa. Id-Location-Designation-Day-PolicySold)
Until now all of the past and newly added data used to maintained in xlsx file only, but moving on only recent 3 months data will be available and I need to store this new data into SQL Db ensuring past data which is already present in SQL DB( for several months,years) intact and ensuring no duplicate entries .
Can anyone suggest me efficient approach to handle this?
My current approach is :
1. Read the past data from SQL table before performing new write operation.
2. Read new data from xlsx.
3. Merge both
4. Apply Groupby.transformation with Sum() to convert daily to monthly data

Related

Azure data factory - data copying from adls to my DB in ascending order of date (not working)

I am trying to copy past 6 months data from ADLS to My Azure SQL DB using data factory pipeline.
When I entered start time and end time here in the source filter like below, it is copying data from current date and going back to old date (descending order) but my requirement is to copy data from old date to current date(ascending order). Please suggest how to get the data in ascending date order
There is no option to sort the source files in copy data activity.
You can store the file names with dates in a SQL table. And pull the list by sorting to process files in order.
Get the list of item names using the Get Metadata activity.
Pass the child items to ForEach activity.
Inside ForEach, get the last modified date of each item using the Get Metadata activity.
Pass the item name and Get Metadata output date to a Store procedure activity to insert data into a SQL table.
Add lookup activity to the ForEach activity, and get the files in order by date.
Pass the lookup activity output to another ForEach activity and copy that file data to sink.

Azure Data Factory CDS Connector Random Records

I have an online implementation of Dynamics 365 where I need to use Azure Data factory to do the following:
1 - pull some records based on a fetchxml criteria
2 - select random 100 records out of these
3 - insert them into a different entity in D365.
I am using the ADF CDS connector which only supports copy activity (does not support data flows as yet)
What I am hoping I can do is the following:
Task 1 - copy all records into a csv file and add an extra column that contains a random ID as an extra column
Issue here: When I do this, and use the rand() function, all the numbers returned are the same:
The same issue happens if I try to use #guid() all values come back the same.
Question 1 - Is there a reason why rand() and guid() are returning the same values for all records and is there a way to work around it.
Question 2 - Is there another way that I can't think of to achieve what I am trying to do: pick random x number of records from a dataset?
Is there a reason why rand() and guid() are returning the same values
for all records and is there a way to work around it?
This is because ADF will execute this expression first and just one time. Then use it's result as the column value. So you get the same data. As a workaround, you need to copy records into csv file first. Then use the csv file as source in Data Flow, then use Derived Column to add randID column. You can also create an Azure Function to add randID column and invoke it in ADF.

SSIS Check Excel source rows redirect rows to another table on 'x' number of field matches

I work in a sales based environment and our data consists of 'leads'.
Let's say we record CompanyName, PhoneNumber, Address1 & PostCode(ZIP). These rows a seeded with a unique ID in the schema.
The leads come in from various sources and are compiled onto a spread sheet and then imported into SQL 2012 using SSIS.
After a validation check to see if a file exists we then use a simple data flow which consists of an Excel source, Derived Column, Data Conversion and finally an OLE DB Destination.
My requirement I'm sure has a relatively simple solution. I understand what I need to achieve is the first step. I need to take a sample of data from the last rolling two months, if 2 or more fields in the source excel file match the corresponding field in the destination sql table then I want to redirect to another table.
I am unsure of which combination of components I could use to achieve this. I believe that Fuzzy lookup may not be what I am looking for as I am looking to find exact field matches, I have looked at the lookup component but I am unsure if this is the way to go.
Could anyone please provide some advice on how I can best achieve this as simply as possible.
You can use the Lookup to check for matches in your existing table. However, it will be fairly complicated to implement the requirement of checking for any two or more fields matching. Your expression would be long and complex basically consisting of:
(using pseudo code for readability)
IIF((a=a AND b=b) OR (a=a AND c=c) OR (b=b AND c=c) OR ...and so on
for every combination of two columns you want to test
I would do this by importing the entire spreadsheet to a staging table, and doing the existing rows check in a SQL stored proc that moves the data to the desired destination table.

Vb.Net compare two Access Data tables for differences

I've created a piece of code that allows me to populate a data table after looping through some information. I would then like to compare it with the information that was gathered the last time the tool was executed and copy the information in to a new data table. Finally the code will then take a copy of the new information gathered in order to have it checked for next time. The system should basically work like this:
Get new information
Compare against last times information
Copy information from task 1 ready for next time task 2 is done.
I've done some reading up and a lot of Inner Joins are being thrown about, but my understanding of this is that it will return fields that are the same, not different.
How would I go about attempting this?
Update
I forgot to mention that I've already achieved steps 1 and 3, I can store the data, copy the data for the next run, but can't do step 2, comparing the data
How about using a SQL "Describe table" query to get the structure?
If you are talking about differences in the records contained, then you will need to keep an "old data" table and your current data table and do a right or left join to find data that is in one but not the other.

Derived date calculation

I am currently entering data into a SQL Server database using SSIS. The plan is for it to do this each week but the day that it happens may differ depending on when the data will be pushed through.
I use SSIS to grab data from an Excel worksheet and enter each row into the database (about 150 rows per week). The only common denominator is the date between all the rows. I want to add a date to each of the rows on the day that it gets pushed through. Because the push date may differ I can't use the current date I want to use a week from the previous date entered for that row.
But because there are about 150 rows I don't know how to achieve this. It would be nice if I could set this up in SQL Server where every time a new set of rows are entered it adds 7 days from the previous set of rows. But I would also be happy to do this in SSIS.
Does anyone have any clue how to achieve this? Alternatively, I don't mind doing this in C# either.
Here's one way to do what you want:
Create a column for tracking the data entry date in your target table.
Add an Execute SQL Task before the Data Flow Task. This task will retrieve the latest data entry date + 7 days. The query should be something like:
select dateadd(day,7,max(trackdate)) from targettable
Assign the SQL result to a package variable.
Add a Derived Column Transformation between your Source and Destination components in the Data Flow Task. Create a dummy column to hold the tracking date and assign the variable to it.
When you map the Excel to table in a Data Flow task, map the dummy column created earlier to the tracking date column. Now when you write the data to DB, your tracking column will have the desired date.
Derived Column Transformation