I'm reading the article explaining the nested-loop-join algorithm and I don't exactly understand the actual working principle of nested selects. Here is an example provided by the article:
The examples search for employees whose last name starts with 'WIN'
and fetches all SALES for these employees.
And the queries representing the nested-loop join are these:
select employees0_.subsidiary_id as subsidiary1_0_
-- MORE COLUMNS
from employees employees0_
where upper(employees0_.last_name) like ?;
select sales0_.subsidiary_id as subsidiary4_0_1_
-- MORE COLUMNS
from sales sales0_
where sales0_.subsidiary_id=?
and sales0_.employee_id=?;
select sales0_.subsidiary_id as subsidiary4_0_1_
-- MORE COLUMNS
from sales sales0_
where sales0_.subsidiary_id=?
and sales0_.employee_id=?;
As you can see, the two last queries are perfectly the same. This is what I confused by. Why isn't just generating the first two queries not enough? Why do we have to generate the third one?
Bear in mind that the code you pasted is the referenced article example of what not to do – an anti-pattern.
That said, the queries are parameterized, and therefore not actually identical. The two initial ? chars in each query are parameters will be replaced by a different value for subsidiary_id in each iteration of the for loop.
It is not necessary to generate the third query. If you write SQL queries manually you can load all sales for all retrieved employees as a single query. But the "N+1 query" anti-pattern arises when the program code looks like in the article:
for (Employees e: emp) {
// process Employee
for (Sales s: e.getSales()) {
// process sale for Employee
}
}
In that code e.getSales() method loads data for one employee. This method does not have enough information to load sales data for all other employees too, because the ORM does not have full list of employees for which it is necessary to load sales data. So, ORM is forced to load sales data of each employee in a separate query.
Some ORMs can avoid "N+1 query" problem automatically. For example, in PonyORM (written in Python) the code from the article will looks like this:
# the first query loads all necessary employees
employees = select(e for e in Employee if e.lastName.startswith('WIN'))
for e in employees:
# process Employee
for sale in e.sales:
# process sale for Employee
When program starts loop over employee query, PonyORM loads all the necessary employees at once. When the sales items for the first employee are requested, PonyORM loads it for this employee only (because ORM doesn't know our intention and supposes that maybe we need sales data for the first employee only). But when sales data of the second employee are requested, PonyORM notices the "N+1 query" anti-pattern, sees that we have N employee objects loaded in memory, and loads sales for all remaining employees in a single query. This behavior can be considered as a heuristic. It may load some extra sales objects if our for-loop contains break operation. But typically this heuristic leads to better performance, because it can drastically reduce the number of queries. Typically it is not a problem to load some extra data, it is much more important to reduce the number of roundtrips to the server.
Related
We have data of different dimensions, for example:
Name by Company
Stock prices by Date, Company
Commodity prices by Date & Commodity
Production volumes by Date, Commodity & Company
We're thinking of the best way of storing these in BigQuery. One potential method is to put them all in the same table, and nest the extra dimensions.
That would mean:
Almost all the data would be nested - e.g. there would be a single 'row' for each Company, and then its prices would be nested by Date.
Data would have to share at least one dimension - I don't think there would be a way of representing Commodity prices in a table whose first column was the company's Name
Are there disadvantages? Are there performance implications? Is it sensible to nest 5000 dates + associated values within each company's row?
It's common to have nested/repeated columns in BigQuery schemas since it makes reasoning about the data easier. Firebase produces schemas with repetition at many levels, for instance. If you flatten everything, the downside is you need some kind of unique ID for each row in order to associate events with each other, and then you'll need aggregations (using the ID as a key) rather than simple filters if you want to do any kind of counting.
As for downsides of nested/repeated schemas, one is that you may find yourself performing complicated transformations of the structure with ARRAY subqueries or STRUCT operators, for instance. These are generally fast, but they do have some overhead relative to queries without any structure imposed on the result at all.
My best suggestion would be to load some data and run some experiments. Storage and querying both are relatively cheap, so you can try a few different schema shapes and see which works better for your purposes.
Updating in Bigquery is pretty new, but based on the public available info BigQuery DML it is currently limited to only 48 updates per table per day.
Quotas
DML statements are significantly more expensive to process than SELECT
statements.
Maximum UPDATE/DELETE statements per day per table: 48 Maximum
UPDATE/DELETE statements per day per project: 500 Maximum INSERT
statements per day per table: 1,000 Maximum INSERT statements per day
per project: 10,000
Processing nested data is also very expensive since all of the data from that column is loaded on every query. It is also slow if you are doing a lot of operations on nested data.
Can this kind of logic can be implemented in SSIS and is it possible to do it in near-real time?
Users are submitting tables with hundreds of thousands of records and waiting for the results for up to 1 hour with the current implementation when the starting table have about 500.000 rows (after the STEP1 and STEP2 we have millions of records). In the future the amount of data and the user base may drastically grow.
STEP 1
We have a table (A) of around 500.000 rows with the following main columns: ID, AMOUNT
We also have a table (B) with the prop.steps and the following main columns: ID_A, ID_B, COEF
TABLE A:
id amount
a 1000
b 2000
TABLE B:
id_a,id_b,coef
a,a1,2
a1,b2,2
b,b1,5
We are creating new records from all the 500.000 records that we have in the table A multiplying the AMOUNT by the COEF:
OUTPUT TABLE:
id, amount
a,1000
a1,2000
a2,4000
b,2000
b1,10000
STEP 2
Following custom logic, we are assigning the amount of all the records calculated before to some other items with the following logic:
TABLE A
ID,AMOUNT
a,1000
a1,2000
a2,4000
b,2000
b1,10000
TABLE B
ID,WC,COEF
a,wc1,0.1
a,wc2,1
a,wc3,0.1
a1,wc4,1
a2,wc5,1
b,wc1,1
b1,wc1,1
b1,wc2,1
OUTPUT TABLE:
ID,WC,AMOUNT
a,wc1,100
a,wc2,1000
a,wc3,100
a1,wc4,2000
a2,wc5,4000
b,wc1,2000
b1,wc1,10000
b1,wc2,10000
The other steps are just joins and arithmetical operations on the tables and the overall number of records can't be reduced (the tables have also other fields with metadata).
In my personal experience that kind of logic can be completely implemented in SSIS.
I would do it in a Script Task or Component for two reasons:
First, if I understood correctly, you need an asynchronous task to
output more data than your input. Scripts can handle multiple and diferent outputs.
Second, in the script you can implement all those calculations
which in the case of using other components would take a lot of them and
relationships between them. And the most important aspect, the
algorithm complexity is kept in relation with your algorithmic
design which can be a huge boost on performance and scalability if
you get a good complexity, two aspects that, if I get it right again,
are fundamental.
There are, though, some professionals that have a bad opinion of "complex" scripts and...
The down step of this approach is that you need some ability with .NET and programming, also most of your package logic will be focus there and script debugging can be more complex than other components. But once you get to use the .NET features of SSIS, there is no turning back.
Usually getting near real time in SSIS is tricky for big data sets, and sometimes you need to integrate other tools (e.g. StreamInsight) to achieve it.
For one of our applications we have huge data in multiple tables and every time a user does something new record is inserted in to these tables. There is a reporting screen where we have to do calculations from these tables and show the total from these tables
For example: Assume two parent tables Employee and Attendance
Employee table has 100,000 records and Attendance table has data for each day whenever a employee goes and comes out of their building. The records in Attendance table is more 2 million for one year. I need to calculate the attendance for each employee (Total) and display it on screen for all 100,000 records and it is paginated based on employee name. The caluclation takes too much time and it spikes the DB CPU.
To avoid runtime calculation for the total Im planning to have a separate table with total calculated values for each employee and just query the table and show it whenever needed. But the problem is for previous years the data is not going to change but for the current year the data will be generated whenever the employee records attendance day to day. What is the best option for me to keep the table updated in real time with Total for every employee whenever new attendance is recorded for the current year.
I thought of using triggers but triggers are synchronous and it should affect the performance of my reporting application when ever I query or it will affect the performance of inserts into parent table.
Please let me know if there are any better ways to update my Total value table in real time without impacting the performance of insert or update to parent tables
This is a perfect case for indexed views. Certainly, the core of your query is a group by such as:
select EmployeeID, count(*)
from AttendanceRecords
group by EmployeeID
Index that view. It's contents will then be available cheaply and updated in real time. There is zero potential for out-of-sync data.
One option would be to use SQL Change Tracking:
https://msdn.microsoft.com/en-us/bb933875.aspx
This is not change data capture (which can be quite heavy) - change tracking just lets you know which keys changed so you can act on it. With that information, you could have a regular job that collects those changes and updates your summaries.
...or, if you can use SQL 2014, you could get into Updatable Column Stores and dispense with the summaries.
Would you consider exporting data from previous years and using it to create the total attendance counts for employees in earlier years?
You say you're moving towards essentially having a table acting as a counter at the moment, so by ensuring your old data conforms to this model as well it'll be much easier to write and maintain the code that interacts with it and server load from any individual query should be minimal.
I am working on a project which involves a payroll system for Employees. My main goal is to retrieve the actual work hours for each employee, however I am a bit confused about how I should proceed with the data I have been given. The database I am using is non-relational and it is used to calculate the financial transactions for the company involved. I'm supposed to build a BI-solution using staging tables, dimensions and a data warehouse.
These are the main tables I have to work with:
Timetable
Employee
Transaction
Deviation
I have timetables in the database which will give me the actual schedule for each employee - in hours. So calculating the hours they are supposed to work is no problem. In transaction I can see how much each employee earns and in deviation I can see if any abnormalities occur - for example if an employee is ill or on holiday. It also states how much is added and deducted to the monthly salary (it also states unit count).
My theory is that I use the transaction/deviation database and compare the results to the actual work schedule - this way I will know if the employee has worked more or less than planned.
Am I on the right track or is there another way of doing this?
I have just started with BI so I appreciate any help I can get!
That sounds like you are on the right path, but really you should be confirming the plan with a data expert familiar with the payroll database.
To make that simple, dummy up some results in Excel first (say pick a random person from the database) and do the calculations to get the actual hours. Take that to the data expert and get them to confirm if this is correct, or perhaps there are exceptions where this business rule does not apply.
I am looking for a tool or system to take a look at the database and identify values that are out of the ordinary. I don't need anything to do real time checks, just a system which does processing overnight or at scheduled points. I am looking for a system at two levels:
Database wide: Eg: Compare salaries of all employees and identify ones that are too low or too high from the average.
Per employee: Eg: Check salary history for employee and identify payments that are out of the ordinary for the employee.
The two above are only examples, take for instance the case with ATM withdrawals, Shopping order history, Invoice history, etc.
You could use Analysis Services and a data mining model.
Obviously you'd have to adapt the code, but here's a sample from Microsoft:
http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=101&Id=83
"This sample shows how the clustering algorithm can be used to perform automatic data validation through the use of the PredictCaseLikelihood() function. To exercise the sample, enter values into the form and click the submit button. If the combination of values has a reasonable likelihood, the form will accept the values. If not, additional elements of the prediction query indicate the value likely to be unacceptable. Checking the “Show Details” box on the form will show the query that was sent in addition to the probability ratios used to determine the outlying values."
I don't have MySQL installed at the moment but I guess the first can be achieved with a query similar to this (off the top of my head, not tested, could not work at all):
SELECT name, salary FROM emp WHERE salary>(SELECT AVG(salary) FROM emp);
Or, a more complex query would be:
SELECT name, salary from emp WHERE salary - (SELECT AVG(salary) FROM emp) >
(SELECT AVG(salary - (SELECT AVG(salary) FROM emp)) FROM emp);
The 2nd one basically selects the employees whose salaries differ from the average of the salaries by more than the average of the difference in all the employees' salaries.
Lemme know if it works.
The hard part is defining "out of the ordinary."
What you're trying to do is what fraud detection software for figuring out when somebody is laundering money is all about. Your simple example is an easy one. The more complex ones are done with databases, statistics, data mining, and rules engines that contain lots of rules. It's not an easy problem, unless you want to restrict yourself to the trivial case that you cited.
If you manage to turn it into an easy problem, you'll be a wealthy person. Good luck.
There are different methods for finding outliers: distance-based, cluster-based, etc.
You could use Data Applied's outlier detection or clustering analytics. The first one automatically finds records which are most different from their N closest neighbors. The second finds large groups (clusters) of records, and identifies records which don't fit well any cluster. They make it free for small data sets, and it's online (http://www.data-applied.com). You don't have to write code, but you can use their Web API if you want.