I am looking for a tool or system to take a look at the database and identify values that are out of the ordinary. I don't need anything to do real time checks, just a system which does processing overnight or at scheduled points. I am looking for a system at two levels:
Database wide: Eg: Compare salaries of all employees and identify ones that are too low or too high from the average.
Per employee: Eg: Check salary history for employee and identify payments that are out of the ordinary for the employee.
The two above are only examples, take for instance the case with ATM withdrawals, Shopping order history, Invoice history, etc.
You could use Analysis Services and a data mining model.
Obviously you'd have to adapt the code, but here's a sample from Microsoft:
http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=83
"This sample shows how the clustering algorithm can be used to perform automatic data validation through the use of the PredictCaseLikelihood() function. To exercise the sample, enter values into the form and click the submit button. If the combination of values has a reasonable likelihood, the form will accept the values. If not, additional elements of the prediction query indicate the value likely to be unacceptable. Checking the “Show Details” box on the form will show the query that was sent in addition to the probability ratios used to determine the outlying values."
I don't have MySQL installed at the moment but I guess the first can be achieved with a query similar to this (off the top of my head, not tested, could not work at all):
SELECT name, salary FROM emp WHERE salary>(SELECT AVG(salary) FROM emp);
Or, a more complex query would be:
SELECT name, salary from emp WHERE salary - (SELECT AVG(salary) FROM emp) >
(SELECT AVG(salary - (SELECT AVG(salary) FROM emp)) FROM emp);
The 2nd one basically selects the employees whose salaries differ from the average of the salaries by more than the average of the difference in all the employees' salaries.
Lemme know if it works.
The hard part is defining "out of the ordinary."
What you're trying to do is what fraud detection software for figuring out when somebody is laundering money is all about. Your simple example is an easy one. The more complex ones are done with databases, statistics, data mining, and rules engines that contain lots of rules. It's not an easy problem, unless you want to restrict yourself to the trivial case that you cited.
If you manage to turn it into an easy problem, you'll be a wealthy person. Good luck.
There are different methods for finding outliers: distance-based, cluster-based, etc.
You could use Data Applied's outlier detection or clustering analytics. The first one automatically finds records which are most different from their N closest neighbors. The second finds large groups (clusters) of records, and identifies records which don't fit well any cluster. They make it free for small data sets, and it's online (http://www.data-applied.com). You don't have to write code, but you can use their Web API if you want.
Related
I am new to database design and I am trying to practice with available government statistics for a small country. I have found almost 100 tables that store information collected for given years and months from a specific region. Some tables are updated monthly, while others are updated annually. I believe this means that in each table, there will be a natural composite PK made up of the year and month, or simply a PK made up of the year.
ER Diagram
In the above image, each parent attribute of Trip Survey represents one of the many data tables I have collected from public databanks specific to the region being researched (e.g. satisfaction_level, motivation_level, amount_spent all represent different surveys on the same population). Does it make sense to combine all of the tables into one table (e.g. Trip Survey)?
I'm not sure if my relationships are accurate (total and partial participation). My goal is to be able to queries the data to find points of correlation and make predictions about the future. I want to try and connect all of the tables over time.
The surveys collected can cover nearly any topic, but the common thread is they represent a moment in time, either monthly or annually. I want to eventually add a table of significant political events that may reflect outliers from trends.
Example Result: When motivation levels were low in 2018, spending was also down and length of stay was shorter relative to 'n' period.
As a newbie, any and all help is greatly appreciated.
Thank you
Simplify simplify simplify.
Start with one table, with at least some columns you comprehend. Load it into some dbms (pick one with geospatial capabilities and windowing functions, you may want them later: recent versions of MariaDB, MySQL and PostreSQL are fine choices). Import your table. This can be a pain in the axx neck to get right, but do your best to get it right anyhow.
Don't worry about primary keys or unique indexes when starting out. You're just exploring the data, not building it. Don't worry about buying or renting a server: most laptops can handle this kind of exploration just fine.
Pick a client program that keeps a history of the queries you put into it. HeidiSQL is a good choice. The relatively new Datagrip from Jetbrains is worth a look. Avoid Microsoft's SQL Server Management Studio: no history feature. (You'll often want to go back to something you tried a few hours or days ago when you're exploring, so the query history feature is vital.)
Then fiddle around with queries, especially aggregates ... e.g.
SELECT COUNT(*), year, origin, destination
FROM trip
GROUP BY year, origin, destination;
Look for interesting stuff you can glean from the one table. Get the hang of it. Then add another table that can be JOINed easily to the first table. Repeat your exploration.
That should get you started. Once you begin to understand your dataset, you can start ranking stuff, working out quintiles, and all that.
And, when you have to update or augment your data without reloading it, you'll need various primary / unique keys. That's in the future for you.
Basically we are building a reporting dashboard for our software. We are giving the Clients the ability to view basic reporting information.
Example: (I've removed 99% of the complexity of our actual system out of this example, as this should still get across what I'm trying to do)
One example metric would be...the number of unique products viewed over a certain time period. AKA, if 5 products were each viewed by customers 100 times over the course of a month. If you run the report for that month, it should just say 5 for number of products viewed.
Are there any recommendations on how to go about storing data in such a way where it can be queried for any time range, and return a unique count of products viewed. For the sake of this example...lets say there is a rule that the application cannot query the source tables directly, and we have to store summary data in a different database and query it from there.
As a side note, we have tons of other metrics we are storing, which we store aggregated by day. But this particular metric is different because of the uniqueness issue.
I personally don't think it's possible. And our current solution is that we offer 4 pre-computed time ranges where metrics affected by uniqueness are available. If you use a custom time range, then that metric is no longer available because we don't have the data pre-computed.
Your problem is that you're trying to change the grain of the fact table. This can't be done.
Your best option is what I think you are doing now - define aggregate fact tables at the grain of day, week and month to support your performance constraint.
You can address the custom time range simply by advising your users that this will be slower than the standard aggregations. For example, a user wanting to know the counts of unique products sold on Tuesdays can write a query like this, at the expense of some performance loss:
select distinct dim_prod.pcode
,count(*)
from fact_sale
join dim_prod on dim_prod.pkey = fact_sale.pkey
join dim_date on dim_date.dkey = fact_sale.dkey
where dim_date.day_name = 'Tuesday'
group by dim_prod.pcode
The query could also be written against a daily aggregate rather than a transactional fact, and as it would be scanning less data it would run faster, maybe even meeting your need
From the information that you have provided, I think you are trying to measure ' number of unique products viewed over a month (for example)'.
Not sure if you are using Kimball methodologies to design your fact tables. I believe in Kimball methodology, an Accumulating Snapshot Fact table will be recommended to meet such a requirement.
I might be preaching to the converted(apologies in that case), but if not then I would let you go through the following link where the experts have explained the concept in detail:
http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
I have also included another link from Kimball, which explains different types of fact tables in detail:
http://www.kimballgroup.com/2014/06/design-tip-167-complementary-fact-table-types/
Hope that explains the concepts in detail. More than happy to answer any questions(to the best of my ability)
Cheers
Nithin
We have a product backed by a DB (currently Oracle, planning to support MS SQL Server as well) with several dozen tables. For simplicity let's take one table called TASK.
We have a use case when we need to present the user the number of tasks having specific criteria. For example, suppose that among many columns the TASK table has, there are 3 columns suitable for this use case:
PRIORITY- possible values LOW, MEDIUM, HIGH
OWNER - possible values are users registered in the system (can be 10s)
STATUS- possible values IDLE, IN_PROCESS, DONE
So we want to display the user exactly how many tasks are LOW, MEDIUM, HIGH, how many of them are owned by some specific user, and how many pertain to different statuses. Of course the basic implementation would be to maintain these counts up-to-date, on every modification to the TASK table. However what complicates the matter is the fact that the user can additionally filter the result by some criteria that can include (or not) part of the columns mentioned above.
For example, the use might want to see those counts only for tasks that are owned by him and have been created last month. The number of possible filter combinations is endless here, so needless to say maintaining up-to-date counts is impossible.
So the question is: how this problem can be solved without serious impact on the DB performance? Can it be solved solely over DB, or should we resort to using other data stores, like sparse data store? It feels like a problem that is present allover in many companies. For example in Amazon store, you can see the counts on categories while using arbitrary text search criteria, which means that they also calculate it on the spot instead of maintaining it up-to-date all the time.
One last thing: we can accept a certain functional limitation, saying that the count should be exact up to 100, but starting from 100 it can just say "over 100 tasks". Maybe this mitigation can allow us to emit more efficient SQL queries.
Thank you!
As I understand you would like to have info about 3 different distributions: across PRIORITY, OWNER and STATUS. I suppose the best way to solve this problem is to maintain 3 different data sources (like SQL query, aggregated info in DB or Redis, etc.).
The simplest way to calculate this data I see as build separate SQL query for each distribution. For example, for priority it would something like:
SELECT USER_ID, PRIORITY, COUNT(*)
FROM TASKS
[WHERE <additional search criterias>]
GROUP BY PRIORITY
Of course, it is not the most efficient way in terms of database performance but it allows to maintain counts up to date.
If you would like to store aggregated values which may significantly decrease database loads (it depends on rows count) so you probably need to build a cube which dimensions should be available search criteria. With this approach, you may implement limitation functionality.
I'm reading the article explaining the nested-loop-join algorithm and I don't exactly understand the actual working principle of nested selects. Here is an example provided by the article:
The examples search for employees whose last name starts with 'WIN'
and fetches all SALES for these employees.
And the queries representing the nested-loop join are these:
select employees0_.subsidiary_id as subsidiary1_0_
-- MORE COLUMNS
from employees employees0_
where upper(employees0_.last_name) like ?;
select sales0_.subsidiary_id as subsidiary4_0_1_
-- MORE COLUMNS
from sales sales0_
where sales0_.subsidiary_id=?
and sales0_.employee_id=?;
select sales0_.subsidiary_id as subsidiary4_0_1_
-- MORE COLUMNS
from sales sales0_
where sales0_.subsidiary_id=?
and sales0_.employee_id=?;
As you can see, the two last queries are perfectly the same. This is what I confused by. Why isn't just generating the first two queries not enough? Why do we have to generate the third one?
Bear in mind that the code you pasted is the referenced article example of what not to do – an anti-pattern.
That said, the queries are parameterized, and therefore not actually identical. The two initial ? chars in each query are parameters will be replaced by a different value for subsidiary_id in each iteration of the for loop.
It is not necessary to generate the third query. If you write SQL queries manually you can load all sales for all retrieved employees as a single query. But the "N+1 query" anti-pattern arises when the program code looks like in the article:
for (Employees e: emp) {
// process Employee
for (Sales s: e.getSales()) {
// process sale for Employee
}
}
In that code e.getSales() method loads data for one employee. This method does not have enough information to load sales data for all other employees too, because the ORM does not have full list of employees for which it is necessary to load sales data. So, ORM is forced to load sales data of each employee in a separate query.
Some ORMs can avoid "N+1 query" problem automatically. For example, in PonyORM (written in Python) the code from the article will looks like this:
# the first query loads all necessary employees
employees = select(e for e in Employee if e.lastName.startswith('WIN'))
for e in employees:
# process Employee
for sale in e.sales:
# process sale for Employee
When program starts loop over employee query, PonyORM loads all the necessary employees at once. When the sales items for the first employee are requested, PonyORM loads it for this employee only (because ORM doesn't know our intention and supposes that maybe we need sales data for the first employee only). But when sales data of the second employee are requested, PonyORM notices the "N+1 query" anti-pattern, sees that we have N employee objects loaded in memory, and loads sales for all remaining employees in a single query. This behavior can be considered as a heuristic. It may load some extra sales objects if our for-loop contains break operation. But typically this heuristic leads to better performance, because it can drastically reduce the number of queries. Typically it is not a problem to load some extra data, it is much more important to reduce the number of roundtrips to the server.
I am working on a project which involves a payroll system for Employees. My main goal is to retrieve the actual work hours for each employee, however I am a bit confused about how I should proceed with the data I have been given. The database I am using is non-relational and it is used to calculate the financial transactions for the company involved. I'm supposed to build a BI-solution using staging tables, dimensions and a data warehouse.
These are the main tables I have to work with:
Timetable
Employee
Transaction
Deviation
I have timetables in the database which will give me the actual schedule for each employee - in hours. So calculating the hours they are supposed to work is no problem. In transaction I can see how much each employee earns and in deviation I can see if any abnormalities occur - for example if an employee is ill or on holiday. It also states how much is added and deducted to the monthly salary (it also states unit count).
My theory is that I use the transaction/deviation database and compare the results to the actual work schedule - this way I will know if the employee has worked more or less than planned.
Am I on the right track or is there another way of doing this?
I have just started with BI so I appreciate any help I can get!
That sounds like you are on the right path, but really you should be confirming the plan with a data expert familiar with the payroll database.
To make that simple, dummy up some results in Excel first (say pick a random person from the database) and do the calculations to get the actual hours. Take that to the data expert and get them to confirm if this is correct, or perhaps there are exceptions where this business rule does not apply.