I am working on a project which involves a payroll system for Employees. My main goal is to retrieve the actual work hours for each employee, however I am a bit confused about how I should proceed with the data I have been given. The database I am using is non-relational and it is used to calculate the financial transactions for the company involved. I'm supposed to build a BI-solution using staging tables, dimensions and a data warehouse.
These are the main tables I have to work with:
Timetable
Employee
Transaction
Deviation
I have timetables in the database which will give me the actual schedule for each employee - in hours. So calculating the hours they are supposed to work is no problem. In transaction I can see how much each employee earns and in deviation I can see if any abnormalities occur - for example if an employee is ill or on holiday. It also states how much is added and deducted to the monthly salary (it also states unit count).
My theory is that I use the transaction/deviation database and compare the results to the actual work schedule - this way I will know if the employee has worked more or less than planned.
Am I on the right track or is there another way of doing this?
I have just started with BI so I appreciate any help I can get!
That sounds like you are on the right path, but really you should be confirming the plan with a data expert familiar with the payroll database.
To make that simple, dummy up some results in Excel first (say pick a random person from the database) and do the calculations to get the actual hours. Take that to the data expert and get them to confirm if this is correct, or perhaps there are exceptions where this business rule does not apply.
Related
I am new to database design and I am trying to practice with available government statistics for a small country. I have found almost 100 tables that store information collected for given years and months from a specific region. Some tables are updated monthly, while others are updated annually. I believe this means that in each table, there will be a natural composite PK made up of the year and month, or simply a PK made up of the year.
ER Diagram
In the above image, each parent attribute of Trip Survey represents one of the many data tables I have collected from public databanks specific to the region being researched (e.g. satisfaction_level, motivation_level, amount_spent all represent different surveys on the same population). Does it make sense to combine all of the tables into one table (e.g. Trip Survey)?
I'm not sure if my relationships are accurate (total and partial participation). My goal is to be able to queries the data to find points of correlation and make predictions about the future. I want to try and connect all of the tables over time.
The surveys collected can cover nearly any topic, but the common thread is they represent a moment in time, either monthly or annually. I want to eventually add a table of significant political events that may reflect outliers from trends.
Example Result: When motivation levels were low in 2018, spending was also down and length of stay was shorter relative to 'n' period.
As a newbie, any and all help is greatly appreciated.
Thank you
Simplify simplify simplify.
Start with one table, with at least some columns you comprehend. Load it into some dbms (pick one with geospatial capabilities and windowing functions, you may want them later: recent versions of MariaDB, MySQL and PostreSQL are fine choices). Import your table. This can be a pain in the axx neck to get right, but do your best to get it right anyhow.
Don't worry about primary keys or unique indexes when starting out. You're just exploring the data, not building it. Don't worry about buying or renting a server: most laptops can handle this kind of exploration just fine.
Pick a client program that keeps a history of the queries you put into it. HeidiSQL is a good choice. The relatively new Datagrip from Jetbrains is worth a look. Avoid Microsoft's SQL Server Management Studio: no history feature. (You'll often want to go back to something you tried a few hours or days ago when you're exploring, so the query history feature is vital.)
Then fiddle around with queries, especially aggregates ... e.g.
SELECT COUNT(*), year, origin, destination
FROM trip
GROUP BY year, origin, destination;
Look for interesting stuff you can glean from the one table. Get the hang of it. Then add another table that can be JOINed easily to the first table. Repeat your exploration.
That should get you started. Once you begin to understand your dataset, you can start ranking stuff, working out quintiles, and all that.
And, when you have to update or augment your data without reloading it, you'll need various primary / unique keys. That's in the future for you.
I am designing database using Microsoft Access for capacity management. I have different types of equipment and need to track the power usage by every month. For instance, I have HVAC system and Chiller System. Withing each system there are different equipments like AHU_1, AHU_2 ,AHU_3, MAU_1, MAU_2 and etc in HVAC system and CHWP_1, CHWP_2, CWP_1, CWP_2 and etc in Chiller system.
I need to track the power or usage by every month. For each system i have separate tables containing their respective equipments. What will be suitable way to track the usage? This is what I'm planning to do which I believe there are three options as in the picture below:
Creating a main table called Chiller_usage Table which will have all the equipments and dates with usage value. The problem i see is that it will has repetitive of each equipments due to dates and the pro is not many tables.
Creating each equipment table which will have dates and usage. The problem is I have around 60 to 70 equipments with 5 different major systems and will lead to mass amount of table which will be very difficult when making queries and reports.
Creating date table with equipments and usage value. This looks promising for now because i will have few table initially and as times goes on there will be 12 tables each year which is alot in the future.
What I'm thinking of is the first option since it is easy to manage when making custom queries because I need to perform calculation in terms of costing, usage analysis of each equipment with graphs and etc. But that i believe will be clumsy due to repetitive name of equipments due to variable dates. Is there any other viable options? Thank you.
Assuming you need to store monthly energy usage for each piece of equipment. Normalize the tables. neither the person entering the data or the manager asking for reports needs to see the complexity of the underlying tables. The person entering the data sees both a form for adding systems/equipment and a form for entering energy usage per equipment per month. The manager sees a report with just the information he wants like energy costs per a system per a year. The normalized tables can be recombined into human readable tables with queries. Access trys to make making the forms and reports as simple as clicking on those appropriate queries and then clicking on create form/report. In practice some assembly is required. Behind the scenes the forms put the correct data in the correct tables and the report shows only the data the manager wants. For instance. here is a normalized table structure based on the data you provided and some assumptions:
The tables are normalized and have the relationships shown. Beneath there is a query to show the total power each system uses between any two dates such as for a yearly report. So data looking like this:
is turned into this:
I'm creating an inventory management and sales tool for an e-commerce site. I'm somewhat new to programming, and I'm curious what is the best way to keep track of totals. For example this company sells roughly 200 products a day, and I would like to be able to keep track of the total amount of products sold in dollars, units sold, and eventually graph this data. I would like to be able to graph a month's worth of these numbers (may 14: 145 units sold, $14,545, $2000 profit, may 15: etc). What is the best way of doing this?
I thought about creating a total's table, and every time a new order comes it adds order value to the previous total's amount, but this seems like it could get cloudy quick if an order doesn't get logged.
Doing a select all and adding the total's for each day for a month seems like it would be bad performance wise.
What options do I have and what do you recommend as the best solution?
I recommend against creating a totals table. While building a report that summarizes the totals from the transactional data may seen to cause a performance problem, in practice it might not be nearly as bad as you think. Two hundred orders per day over thirty days really isn't all that many records for most modern relational database systems.
If you did run into a significant performance issue with this one report, one thing you could do would be to run the report during any off-hours that the business may have and then cache the results of that run in a table for use when someone wants to view the report. However, before going to that trouble I recommend just trying out what was mentioned above and see if performance is really even that much of an issue.
I don't know a good way to maintain sums depending on dates in a SQL database.
Take a database with two tables:
Client
clientID
name
overdueAmount
Invoice
clientID
invoiceID
amount
dueDate
paymentDate
I need to propose a list of the clients and order it by overdue amount (sum of not paid past invoices of the client). On big database it isn't possible to calculate it in real time.
The problem is the maintenance of an overdue amount field on the client. The amount of this field can change at midnight from one day to the other even if nothing changed on the invoices of the client.
This sum changes if the invoice is paid, a new invoice is created and due date is past, a due date is now past and wasn't yesterday...
The only solution I found is to recalculate every night this field on every client by summing the invoices respecting the conditions. But it's not efficient on very big databases.
I think it's a common problem and I would like to know if a best practice exists?
You should read about data warehousing. It will help you to solve this problem. It looks similar as what you just said
"The only solution I found is to recalculate every night this field
on every client by summing the invoices respecting the conditions. But
it's not efficient on very big databases."
But it has something more than that. When you read it, try to forget about normalization. Its main intention is for 'show' data, not 'manage' data. So, you would feel weird at beginning but if you understand 'why we need data warehousing', it will be very very interesting.
This is a book that can be a good start http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247 , classic one.
Firstly, I'd like to understand what you mean by "very big databases" - most RDBMS systems running on decent hardware should be able to calculate this in real time for anything less than hundreds of millions of invoices. I speak from experience here.
Secondly, "best practice" is one of those expressions that mean very little - it's often used to present someone's opinion as being more meaningful than simply an opinion.
In my opinion, by far the best option is to calculate it on the fly.
If your database is so big that you really can't do this, I'd consider a nightly batch (as you describe). Nightly batch runs are a pain - especially for systems that need to be available 24/7, but they have the benefit of keeping all the logic in a single place.
If you want to avoid nightly batches, you can use triggers to populate an "unpaid_invoices" table. When you create a new invoice record, a trigger copies that invoice to the "unpaid_invoices" table; when you update the invoice with a payment, and the payment amount equals the outstanding amount, you delete from the unpaid_invoices table. By definition, the unpaid_invoices table should be far smaller than the total number of invoices; calculating the outstanding amount for a given customer on the fly should be okay.
However, triggers are nasty, evil things, with exotic failure modes that can stump the unsuspecting developer, so only consider this if you have a ninja SQL developer on hand. Absolutely make sure you have a SQL query which checks the validity of your unpaid_invoices table, and ideally schedule it as a regular task.
I am looking for a tool or system to take a look at the database and identify values that are out of the ordinary. I don't need anything to do real time checks, just a system which does processing overnight or at scheduled points. I am looking for a system at two levels:
Database wide: Eg: Compare salaries of all employees and identify ones that are too low or too high from the average.
Per employee: Eg: Check salary history for employee and identify payments that are out of the ordinary for the employee.
The two above are only examples, take for instance the case with ATM withdrawals, Shopping order history, Invoice history, etc.
You could use Analysis Services and a data mining model.
Obviously you'd have to adapt the code, but here's a sample from Microsoft:
http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=83
"This sample shows how the clustering algorithm can be used to perform automatic data validation through the use of the PredictCaseLikelihood() function. To exercise the sample, enter values into the form and click the submit button. If the combination of values has a reasonable likelihood, the form will accept the values. If not, additional elements of the prediction query indicate the value likely to be unacceptable. Checking the “Show Details” box on the form will show the query that was sent in addition to the probability ratios used to determine the outlying values."
I don't have MySQL installed at the moment but I guess the first can be achieved with a query similar to this (off the top of my head, not tested, could not work at all):
SELECT name, salary FROM emp WHERE salary>(SELECT AVG(salary) FROM emp);
Or, a more complex query would be:
SELECT name, salary from emp WHERE salary - (SELECT AVG(salary) FROM emp) >
(SELECT AVG(salary - (SELECT AVG(salary) FROM emp)) FROM emp);
The 2nd one basically selects the employees whose salaries differ from the average of the salaries by more than the average of the difference in all the employees' salaries.
Lemme know if it works.
The hard part is defining "out of the ordinary."
What you're trying to do is what fraud detection software for figuring out when somebody is laundering money is all about. Your simple example is an easy one. The more complex ones are done with databases, statistics, data mining, and rules engines that contain lots of rules. It's not an easy problem, unless you want to restrict yourself to the trivial case that you cited.
If you manage to turn it into an easy problem, you'll be a wealthy person. Good luck.
There are different methods for finding outliers: distance-based, cluster-based, etc.
You could use Data Applied's outlier detection or clustering analytics. The first one automatically finds records which are most different from their N closest neighbors. The second finds large groups (clusters) of records, and identifies records which don't fit well any cluster. They make it free for small data sets, and it's online (http://www.data-applied.com). You don't have to write code, but you can use their Web API if you want.