Data traceability to identify data referred for calculation - traceability

We need to perform certain calculations on a set of transactions using custom logic (will be written in Java or Python).
The calculations will be performed on transactions for specific period (e.g. 1st Jan to 31st 2017) and as at the time of calculation e.g. 31-Jan-2018.
It is possible for users to add (or cancel) back-dated transactions at any time. There will be hundreds of thousands transactions and calculation runs can be performed multiple times for the same time period.
Therefore, the business needs to know which transactions were used for which calculation run.
Does anyone know if there are any tools that can assist in in this data traceability to identify data that used for specific calculation?
I think it is difficult for any tool as our custom code knows the data it has used.
We are thinking of storing transactions (just identifiers) referred for each calculation in a database which can be used by data visualisation tools by the business. Given volume of transactions, this will take time (may be in hours) to insert those many records but it will be acceptable.
I will appreciate if anyone who faced similar problem can share their experience and how this was resolved. I am not sure if there is any standard pattern as it is probably not a common problem.
Thanks

Related

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

Fast reporting with user parameters and temp result sets

I have come across a problem with reporting from SQL Server databases using SSRS, that I wonder if you could help me with.
When you have a huge amount of data in a table, and you want to select only those rows within a certain criteria, and you want to allow the users to specify that criteria (for example, it might be a start date and end date), and you then want to take that data (within the criteria) and perform a ton of other transformations on it, including producing various temporary result sets along the way (using CTEs or Table Variables or Temp tables) to finally produce the report, this basically takes ages in SQL. You can do it, but your users might have to wait an hour or two from the moment they've hit View Report, to their report being rendered.
I don't know much about MDX or DAX, cubes or tabular models, but I wonder if there is a quicker way to do what I want. Note the important aspect of the problem: the user is specifying a criteria that has to go all the way back to the original table, and then various transformations (including temp result sets) have to be applied to produce the final report.
What is the best way to do this? Am I doing it the only way possible? I know it's a broad question, but I'd like to know, theoretically, what the answer is. Where should I be looking? Should I be looking at Cubes? Tabular Models? Should I be using R in SQL Server?
There is always a balance when it comes to handling large datasets. Sometimes it makes sense to do some of the work ahead of time so that on-demand reports can run in a reasonable amount of time.
In order for a model to be a good option here are some general guidelines:
Many reports would be able to use common attributes from the model
The data involves aggregates, not just lists of records
The data does not need to be live
You have plenty of development and testing time
Anyone who would be using it as a data source will have to have be
trained on the structure and be at least slightly familiar with MDX
Another option for you to consider is to have a stored procedure that "prepares" the data for you overnight in a separate table. This table could be well indexed because the write time is not as important. They report would then point to this table to be able to quickly retrieve the data it needs to present. This shifts most of the preparation/aggregation work. You can still of course have parameters that limit how much of this data you pull back.
Based on the little bit of information you've given us (300 million rows in a single non-normalized table), there is definitely a faster way. However, there will not be any quick solutions and you haven't provided enough information for me to give any recommendations.
I think you may need to seek some professional help to review your infrastructure and needs along with your usage and objectives so you can be pointed in the right direction.

Best way to track sales/inventory history for a POS system?

So, I'm writing a POS system, and I want it to be able to keep track of an inventory and generate reports based on past sales.
I'm pretty familiar with database design and that sort of thing, but I'm not quite sure how to approach this particular problem. The first thing I thought was to have tables that track item sales by day, week, month, and year, and then have the program keep track of how much time has elapsed so it knows when to reset these particular records. But now I'm thinking there's got to be a much simpler approach to it than that.
Another thing I thought of doing is to query the sales transaction table based on time stamps, but I'm not sure if that's a step in the right direction either.
I know that there are simpler ways of doing this for things like orders and order history with customers, but what about for the store itself, if they want to track how much product they've sold over the course of a week, month, year, etc? Is it a similar approach? Different? I can't really find anything that speaks to this particular problem.
I would go with your second thought - create a table for transactions with a timestamp, and use the timestamp to do reports (and partitions if necessary). If you know you will be querying by the timestamp very frequently, you can create an index on it to improve performance.
Whether you are tracking customer orders or store sales shouldn't make a difference in the design unless there is some major requirement difference.
Will this be a system where store owners are autonomous or will it be a system with a load of POS terminals that report back to a central hub?
If this is for autonomous store owners you have to start worrying about things like backups and data archiving. Stuff that store owners don't really care about. If you look online you'll probably find some cloud providers that do all this POS stuff for store owners.
On the other hand the general design pattern for larger businesses I have seen is as follows:
On your POS terminals hold minimum required data that is needed at the POS terminal. Minimal reporting is required at the terminal.
Replicate all POS data to a central database server that keeps and merges all different POS terminals. This is your detailed operational reporting. Once data is replicated here it can be deleted from the terminal
Often the store guys aren't too interested in the longer trends but it depends on the business.
Now you can run a report by month or year off the central database server (as can your store owners) and just summarise up to month/year in place. At this point there is no need to create summary tables.
Eventually you'll run into performance issues as data size increases.
The answer to this is not to build summary tables because then your user / reporting system gets complicated because you have to pick the correct table.
The answer is to apply standard performance tuning techniques such as:
Improving server hardware (Just adding RAM often is the most cost effective)
Adding Indexes (including indexed views)
Implementing partitioning
Consider using cubes for reporting
If this is not sufficient you might then want to consider the overhead of batch jobs that populate summary tables. But again Indexed Views can cover this off to a limited extent without requiring summary tables.
You need to understand data sizes, growth and report requirements before considering any design options such as summary tables.

Where does min/max/average calculation take place?

I have a data logging application. I record 10,000 temperatures every 30 seconds. I need to be able to calculate the min/max/average temperature of each of the 10,000 items over a hourly/daily/weekly basis. Can the min/max/Average calculation be performed on the server or does each document need to be downloaded to the client for the calculation to be performed?
Andrew
Either calculate or store a summary in the DB/ on the server. Keep the original data as well, if this is important.
Calculating a summary early & sending that to the client/ human level, is far more efficient than trucking around 10,000 samples that nobody usually wants to drill into.
A really good summary having average, min, max & standard deviation would be statistically comprehensive for almost all purposes.
When the client really wants, then you can bring down the big dataset (10k samples) and display it.
Definitely you want to calculate it on the server, but there are multiple approaches you might consider:
You could store these in a specific documents that you manually update with each sample. This could work, but you would be putting a lot of stress on a single document, and it could lead to concurrency issues.
You can write a Map/Reduce index to calculate the totals. Every time you write a new document, RavenDB will update your index with the new totals. You can divide total value by total count to get an average, and you can easily use min and max functions. Since you want to view these results by different time intervals, you'll need multiple indexes.
I actually wrote a small demo program that does exactly that. Instead of temperature, it's recording PSI values from simulated pressure gauges. But the concepts are identical. There are a few shortcuts in there that you can probably pick up on if you read the comments closely.
Project Site: Raven Sensors
I wrote this when the current version of RavenDB was 2.0.2261. I haven't updated it in a while, but it should still work and be relevant.
I haven't done much with it yet, but RavenDB 2.5 added a feature called Dynamic Aggregation. It is also exposed through the studio as Dynamic Reporting. Essentially, this does the aggregation at query time. You may find it much easier to express the aggregations you are interested in, but it could be considerably slower than a map-reduce approach. You may want to experiment. The performance difference may come down to how many items are in the set being aggregated.

Best way to report comparisons of one agency to the rest of the state/nation

When attempting to do some benchmarking type reports, I run into the issue of extreme slowness due to the amount of data residing in the database, and this will get incrementally worse. I'm curious of what would be considered the best approach for reports that show for example a percentage of patients entering the hospital within a certain date range that were there due to a specific condition, as well as how that particular hospital compares to the state percentage and also the national percentage. Of course this is all based on the hospitals whose data resides in the database. I have just been writing stored procedures to calculate these percentages, but I know this isn't the best approach. I'm curious how other more experienced reporting professionals would tackle this. I'm currently using SSRS for reporting. I know a little about SSAS, but not enough to know if I should consider it for this type of reporting.
This all depends on the data-structure and the kind of calculations you have to do.
You try to narrow down the amount of data you have to process and the complexity of operations in every possible way
If you have lots of data on a slow system you first try to select the needed data, transfer it to the calculation point and then keep it cached as long as you can.
If you have huge amounts of data you try to preprocess it as much as you can. E.g. for datawarehouses you have a datetime-table with year/month/day/day-of-week/week-of-year etc in it and just constraints to them in the other tables. Like this you can avoid timeconsuming calculations.
If the operations are complex you have to analyze them to make them simpler/faster but on this point it is impossible to predict how much (if at all) there is some room.
It all depends on your understanding of the data-structure and processes you need them for, in order to improve everything as much as you can.
I myself haven't worked with SSAS yet but this is also a great tool but (imho) more for lots of different analysis.