Data modeling to define productivity metrics

Data modeling to define productivity metrics - sql

I have a requirement to calculate productivity in our issue tracking software(Jira). Idea is I want to capture the data as an issue progresses through diff development stages (like audit trail of the issue for specific field in it).
Now I want to build a data model that enables me to capture metrics like avg. amount of time it took for issues to move between 2 stages (for eg, In Progress > UAT). Avg. time for each developer etc.
Above audit trail view would give me data in this format
Audit ID | Issue ID| Developer | Issue-stage| Data_Update_dt
A001 | 101 | D01 | In Progress| 31-May-17 00:25:00
A002 | 101 | D01 | UAT | 31-May-17 06:25:00
I am trying to understand the design to calculate difference between A002 and A001, time reqd to move from In progress to UAT. What is the best way to do it.
Please advice.

I would create a dimensional model/star schema that has an 'accumulating snapshot fact' as its central fact, with one row per issue as it moves through the system.
Accumulating Snapshot Facts
In the fact table you'd include key dates/times for each stage. You'd also be able to add 'lag' measures to the fact, precalculated as the gap between one stage and the next, and/or time spent in a stage.
The fact would be surrounded by dimensions for dates, times and developers.
Then you'd be able to calculate averages on those lags and be able to analyse by developer.

Related

get size of my kube audit log ingested daily in azure

I would like to know how can I get the size (in terms of GB) of my kub audit log ingested on daily basis. Is there a KQL query which I can run in my log analytics workspace to find that out?
The reason I want is because I would like to calculate the azure consumption. Thanks

By using the usage table, it is possible to review how much data was ingested into an LA workspace.
Scope spans from solutions to data types (which correlates usually to the destination table, but not always).
Kube-audit is only exportable by default to the AzureDiagnostic table, a table shared among many azure resources, hence - it is impossible to differentiate the source of each record within the total count.
for example, I've being using the following query to review how much data was ingested at the scope of my AzureDiagnostic table in the last 10 days:
Usage
| where TimeGenerated > startofday(ago(10d))
| where DataType == 'AzureDiagnostics'
| summarize IngestedGB = sum(Quantity) / 1000 by bin(TimeGenerated, 1h)
| render timechart
In my case all data originated from Kube-audit logs, but, it shouldn't be the case of most users:
AzureDiagnostics
| where TimeGenerated > startofday(ago(10d))
| summarize count() by bin(TimeGenerated, 1h), Category
| render timechart

How to manually test a data retention requirement in a search functionality?

Say, data needs to be kept for 2years. Then all data that were created 2years + 1day ago should not be displayed and be deleted from the server. How do you manually test that?
I’m new to testing and I can’t think of any other ways. Also, we cannot do automation due to time constraints.

You can create the data with backdating of more than two years in the database and can test, if it is being deleted or not automatically, In other ways ,you can change the current business date from the database and can test it

For the data retention functionality a manual tester needs to remember the search data so that the tester can perform the test cases for the search retention feature.
By Taking an example of a social networking app , being a manual tester you need to remember all the users that you searched for recently.
To check the time period of retention you can take the help from the backend developer so that they can change the time period (from like one year to 10 min) for testing purpose.
Even if you delete the search history and then you start typing the already entered search result the related result should pop on the first location of the search result. Data retention policies concern what data should be stored or archived, where that should happen, and for exactly how long. Once the retention time period for a particular data set expires, it can be deleted or moved as historical data to secondary or tertiary storage, depending on the requirement

Let’s us understand with an example, that we have below data in our database table based on past search made by users. Now with the help of this table, you can perform this testing with minimum effort and optimum result. We have Current Date as - ‘2022-03-10’ and Status column states that data is available / not available in database, where Visible means available, while Expired means deleted from table.  Search Keyword
Search On Date
Search Expiry Date
Status
sport
2022-03-05
2024-03-04
Visible
cricket news
2020-03-10
2022-03-09
Expired - Deleted
holy books
2020-03-11
2022-03-10
Visible
dance
2020-03-12
2022-03-11
Visible

SSIS ForEach ADO Enumerator - Performance Issues

This is a best practice/other approach question about using a ADO Enumerator ForEach loop.
My data is financial accounts, coming from a source system into a data warehouse.
The current structure of the data is a list of financial transactions eg.
+-----------------------+----------+-----------+------------+------+
| AccountGUID | Increase | Decrease | Date | Tags |
+-----------------------+----------+-----------+------------+------+
| 00000-0000-0000-00000 | 0 | 100.00 | 01-01-2018 | Val1 |
| 00000-0000-0000-00000 | 200.00 | 0 | 03-01-2018 | Val3 |
| 00000-0000-0000-00000 | 400.00 | 0 | 06-01-2018 | Val1 |
| 00000-0000-0000-00000 | 0 | 170.00 | 08-01-2018 | Val1 |
| 00000-0000-0000-00002 | 200.00 | 0 | 04-01-2018 | Val1 |
| 00000-0000-0000-00002 | 0 | 100.00 | 09-01-2018 | Val1 |
+-----------------------+----------+-----------+------------+------+
My SSIS Package, current has two forEach Loops
All Time Balances
End Of Month Balances
All Time Balances
Passes AccountGUID into the loop and selects all transactions for that account. It then orders them by date with the first transaction being first and assigns it a sequence number.
Once the sequence number is assigned, it begins to count the current balances based on the increase and decrease cols, along with the tag col to work out which balance its dealing with.
It finishes this off by assigning the latest record with a Current flag.
All Time Balances - Work Flow
->Get All Account ID's in Staging table
|-> Write all Account GUID's to object variable
|--> ADO Enumerator ForEach - Loop Account GUID List - Write GUID to variable
|---> (Data Flow) Select all transactions for Account GUID
|----> (Data Flow) Order all transactions by date and assign Sequence number
|-----> (Data Flow) Run each row through a script component transformation to calculate running totals for each record
|------> (Data Flow) Insert balance data into staging table
End Of Month Balances
The second package, End of Month does something very similar with the exception of a second loop. The select will find the earliest transnational record and the latest transnational record. Using those two dates it will figure out all the months between those two and loop for each of those months.
Inside the date loop, it does pretty much the same thing, works out the balances based on tags and stamps the end of month record for each account.
The Issue/Question
All of this currently works fine, but the performance is horrible.
In one database with approx 8000 Accounts and 500,000 transactions. This process takes upwards of a day to run. This being one of our smaller clients, I tremble at the idea of running it for our heavy databases.
Is there a better approach to doing this, using SQL cursors or so other neat way I have not seen?

Ok, so I have managed to take my package execution from around 3 days to about 11 minutes all up.
I ran a profiler and standard windows stats while running the loops and found a few interesting things.
Firstly, there was almost no utilization of HDD, CPU, RAM or network during the execution of the packages. It told me what I kind of already knew, that it was not running as quickly as it could.
What I did notice, between each execution of the loop there was a 1 to 2ms delay before the next instance of the loop started executing.
Eventually I found that every time a new instance of the loop began, SSIS created a new connection to the SQL database, it appears that this is SSIS's default behavior. Whenever you create a Source or Destination, you are adding a connection delay to your project.
The Fix:
Now this was an odd fix, you need to go into your connection manager (The odd bit) it must be the onscreen window not in the right hand project manager window.
If you select your connect that is referenced in the loop, the properties window on the right side (In my layout anyway) you will see the option called "RetainSameConnection" which be default is set to false.
By setting this to true, I eliminated the 2ms delay.
Considerations:
In doing this I created a heap of other issues, which really just highlighted areas of my package that I had not thought out well.
Some things that appears to be impacted by this change were stored procedures that used temp tables, these seemed to break instantly. I assume that is because of how SQL handles temp tables, in closing the connection and reopening, you can be pretty certain that the temp table is gone. With the same connection setting, the chance of running into temp tables appears to be an issue again.
I removed all temp tables and replaced them with CTE statements, this appears to fix this issue.
The second major issue I found was with tasks that ran parallel and both used the same connection manager. From this I received an error that SQL is still trying to run the previous statement. This bombed out my package.
To get around this, I created a duplicate connection manager (All up I made three connection managers for the same database).
Once I had my connections set up, I went into each of my parallel Source and Destinations and assigned them their own connection manager. This appears to have resolved the last error I received.
Conclusion:
They may be more unforeseen issues in doing this, but for now my packages are lightening quick and this highlighted some faults in my design.

How to store availability information in SQL, including recurring items

So I'm developing a database for an agency that manages many relief staff.
Relief workers set their availability for each day in one of three categories (day, evening, night).
We also need to be able to set some part-time relief workers as busy on weekly, biweekly, and in one instance, on a 9-week rotation. Since we're already developing recurring patterns of availability here, we might as well also give the relief workers the option of setting recurring availability days.
We also need to be able to query the database, and determine if an employee is available for a given day.
But here's the gotcha - we need to be able to use change data capture. So I'm not sure if calculating availability is the best option.
My SQL prototype table looks like this:
TABLE Availability Day
employee_id_fk | workday (DATETIME) | day | eve | night (all booleans)| worksite_code_fk (can be null)
I'm really struggling how to wrap my head around recurring events. I could create say, a years worth, of availability days following a pattern in 'x' day cycle. But how far ahead of time do we store information? I can see running into problems when we reach the end of the data set.
I was thinking of storing say, 6 months of information, then adding a server side task that runs monthly to keep the tables updated with 6 months of data, but my intuition is telling me this is a bad fix.

For absolutely flexibility in the future and keeping data from bloating my first thought would be something like
Calendar Dimension Table - Make it for like 100 years or Whatever you Want make it include day of week information etc.
Time Dimension Table - Hour, Minutes, every 15 what ever but only for 24 hour period
Shifts Table - 1 record per shift e.g. Day, Evening, and Night
Specific Availability Table - Relationship to Calendar & Time with Start & Stops recommend 1 record per day so even if they choose a range of 7 days split that to 1 record perday and 1 record per shift.
Recurring Availability Table - for day of week (1-7),Month,WeekOfYear, whatever you can think of. But again I am thinking 1 record per value so if they are available Mondays and Tuesday's that would be 2 rows. and if multiple shifts then it would be multiple rows.
Now and here is the perhaps the weird part, I would put a Available Column on the Specific and Recurring Availability Tables, maybe make it a tiny int and store something like 0 not available, 1 available, 2 maybe available, 3 available with notice.
If you want to take into account Availability with Notice you could add columns for that too such as x # of days. If you want full flexibility maybe that becomes a related table too.
The queries would be complex but you could use a stored procedure or a table valued function to handle it fairly routinely.

How to efficiently store ever changing datasets (search results) for periodic reports

I am having trouble coming up with a good way to store a dataset that continually changes.
I want to track and periodically report on the contents of specific websites. For example, for a certain website I want to keep track of all the PDF documents that are available. Then I want to report periodically (say, quarterly) on the number of documents, PDF version number and various other statistics. In addition, I want to track the change of these metric over time. E.g. I want to graph the increase in PDF documents offered on the website over time.
My input is basically a long list of URLs that point to all the PDF documents on the website. These inputs arrive intermittently, but they may not coincide with the dates I want to run the reports on. For example, in Q4 2010 I may get two lists of URLs, several weeks apart. In Q1 2011 I may get just one.
I am having trouble figuring out how to efficiently store this input data in a database of some sorts so that I can easily generate the correct reports.
On the one hand, I could simply insert the complete list into a table each time I recieve a new list, along with a date of import. But I fear that the table will grow quite big in a short time, and most of it will be duplicate URLs.
But, on the other hand I fear that it may get quite complicated to maintain a list of unique URLs or documents. Especially when documents are added, removed and then re-added over time. I fear I might get into the complexities of creating a temporal database. And I shudder to think what happens when the document itself is updated but the URL stays the same (in that case the metadata might change, such as the PDF version, file size, etcetera).
Can anyone recommend me a good way to store this data so I can generate reports from it? I would especially like to have the ability to retroactively generate reports. E.g, when I want to track a new website in Q1 2011, I would like to be able to generate a report from both the Q4 2010 data as well, even though the Q1 2011 data has already been imported.
Thanks in advance!

Why not just a single table, called something like URL_HISTORY:
URL VARCHAR (PK)
START_DATE DATE (PK)
END_DATE DATE
VERSION VARCHAR
Have END_DATE as either NULL or a suitable dummy date (eg. 31-Dec-9999) where the version has not been superceded; set END_DATE to be the last valid date where the version has been superceded, and create a new record for the new version - eg.
+------------------+-------------+--------------+---------+
|URL | START_DATE | END_DATE | VERSION |
|..\Harry.pdf | 01-OCT-2009 | 31-DEC-9999 | 1.1.0 |
|..\SarahJane.pdf | 01-OCT-2009 | 31-DEC-2009 | 1.1.0 |
|..\SarahJane.pdf | 01-JAN-2010 | 31-DEC-9999 | 1.1.1 |
+------------------+-------------+--------------+---------+

What about using a document database and instead of saving each url you save a document that has a collection of urls. At this point whenever you execute whatever process that iterates over all the urls you get all of the documents that existing a time frame or whatever qualifications you have on that and then run all of the urls across each of the documents.
This could also be emulated in sql server by just serializing your object to json or xml and storing the output in a fitting column.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas