Count number of rows returned in a SQL statement - sql

Are there any DB engines that allow you to run an EXPLAIN (or other function) where it will give you an approximate count of values that may be returned before an aggregation is run (not rows scanned but that actually would be returned)? For example, in the following query:
SELECT gender, COUNT(1) FROM sales JOIN (
SELECT id, person FROM sales2 WHERE country='US'
GROUP BY person_id
) USING (id)
WHERE sales.age > 20
GROUP BY gender
Let's say this query returns 3 rows after being aggregated, but would return 170M rows if unaggregated.
Are there any tools where you can run the query to get this '170M' number or does this have to do with complexity theory (or something similar) where it's almost just as expensive to run the query (without the final aggregation/having/sort/limit/etc) to get the count? In other words, doing a rewrite to:
SELECT COUNT(1) FROM sales JOIN (
SELECT id, person FROM sales2 WHERE country='US'
GROUP BY person_id
) USING (id)
WHERE sales.age > 20
But having to execute the query nonetheless.
As an example of using the current (mysql) explain to show how 'off' it is to get what I'm looking for:
explain select * from movies where title>'a';
# rows=147900
select count(1) from _tracktitle where title>'a';
# 144647 --> OK, pretty close
explain select * from movies where title>'u';
# rows=147900
select * from movies where title>'u';
# 11816 --> Not close at all

Assuming you can use MS SQL Server, you could tap into the same data the Optimiser is using for cardinality estimation: DBCC SHOW_STATISTICS (table, index) WITH HISTOGRAM
Part of data sets you get back is per-column histogram, which is essentially number of rows for each value range found in the table.
You probably want to query the data programmatically, one way to achieve this would be to insert it into a temp table:
CREATE TABLE #histogram (
RANGE_HI_KEY datetime PRIMARY KEY,
RANGE_ROWS INT,
EQ_ROWS INT,
DISTINCT_RANGE_ROWS INT,
AVG_RANGE_ROWS FLOAT
)
INSERT INTO #histogram
EXEC ('DBCC SHOW_STATISTICS (Users, CreationDate) WITH HISTOGRAM')
SELECT 'Estimate', SUM(RANGE_ROWS+EQ_ROWS) FROM #histogram WHERE RANGE_HI_KEY BETWEEN '2010-08-30 08:28:45.070' AND '2010-09-20 22:15:33.603'
UNION ALL
select 'Actual', COUNT(1) from Users u WHERE u.CreationDate BETWEEN '2010-08-30 08:28:45.070' AND '2010-09-20 22:15:33.603'
For example, check out what this same query run against Stack Overflow Database.
| -------- | ----- |
| Estimate | 98092 |
| Actual | 11715 |
it seems like a lot but then keep in mind that the whole table has almost 15mil records.
A note on precision and other gotchas
The maximum number of histogram steps is capped at 200 - which is not a lot, so you are not getting guaranteed 10% margin of error, but neither does SQL Server.
As you insert data into table, histograms may get stale so your results would get skewed even more.
There are different ways to update this data, some are reasonably quick while others effectively require full table scan
not all columns will have statistics. You can either create it manually or (I believe) it gets created automatically if you run a search with the column as predicate

MS Sql Server offers "execution plans". In the picture below I have queries and I press (Ctrl-L) to see the plans.
In my queries I return all records in first and just the count in the other, using the same table.
Look at metric corresponding to red arrows- estimated # of rows that WILL be scanned when queries are run. In this case, that number is same regardless whether count(*) or *, your point in case!

Related

SQL - Insert using Column based on SELECT result

I currently have a table called tempHouses that looks like:
avgprice | dates | city
dates are stored as yyyy-mm-dd
However I need to move the records from that table into a table called houses that looks like:
city | year2002 | year2003 | year2004 | year2005 | year2006
The information in tempHouses contains average house prices from 1995 - 2014.
I know I can use SUBSTRING to get the year from the dates:
SUBSTRING(dates, 0, 4)
So basically for each city in tempHouses.city I need to get the the average house price from the above years into one record.
Any ideas on how I would go about doing this?
This is an SQL Server approach, and a PIVOT may be a better, but here's one way:
SELECT City,
AVG(year2002) AS year2002,
AVG(year2003) AS year2003,
AVG(year2004) AS year2004
FROM (
SELECT City,
CASE WHEN Dates BETWEEN '2002-01-01T00:00:00' AND '2002-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2002,
CASE WHEN Dates BETWEEN '2003-01-01T00:00:00' AND '2003-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2003
CASE WHEN Dates BETWEEN '2004-01-01T00:00:00' AND '2004-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2004
-- Repeat for each year
)
GROUP BY City
The inner query gets the data into the correct format for each record (City, year2002, year2003, year2004), whilst the outer query gets the average for each City.
There many be many ways to do this, and performance may be the deciding factor on which one to choose.
The best way would be to use a script to perform the query execution for you because you will need to run it multiple times and you extract the data based on year. Make sure that the only required columns are city & row id:
http://dev.mysql.com/doc/refman/5.0/en/insert-select.html
INSERT INTO <table> (city) VALUES SELECT DISTINCT `city` from <old_table>;
Then for each city extract the average values, insert them into a temporary table and then insert into the main table.
SELECT avg(price), substring(dates, 0, 4) dates from <old_table> GROUP BY dates;
Otherwise you're looking at a combination query using joins and potentially unions to extrapolate the data. Because you're flattening the table into a single row per city it's going to be a little tough to do. You should create indexes first on the date column if you don't want the database query to fail with memory limits or just take a very long time to execute.

Why can't I use SELECT ... FOR UPDATE with aggregate functions?

I have an application where I find a Sum() of a database column for a set of records and later use that sum in a separate query, similar to the following (made up tables, but the idea is the same):
SELECT Sum(cost)
INTO v_cost_total
FROM materials
WHERE material_id >=0
AND material_id <= 10;
[a little bit of interim work]
SELECT material_id, cost/v_cost_total
INTO v_material_id_collection, v_pct_collection
FROM materials
WHERE material_id >=0
AND material_id <= 10
FOR UPDATE;
However, in theory someone could update the cost column on the materials table between the two queries, in which case the calculated percents will be off.
Ideally, I would just use a FOR UPDATE clause on the first query, but when I try that, I get an error:
ORA-01786: FOR UPDATE of this query expression is not allowed
Now, the work-around isn't the problem - just do an extra query to lock the rows before finding the Sum(), but that query would serve no other purpose than locking the tables. While this particular example is not time consuming, the extra query could cause a performance hit in certain situations, and it's not as clean, so I'd like to avoid having to do that.
Does anyone know of a particular reason why this is not allowed? In my head, the FOR UPDATE clause should just lock the rows that match the WHERE clause - I don't see why it matters what we are doing with those rows.
EDIT: It looks like SELECT ... FOR UPDATE can be used with analytic functions, as suggested by David Aldridge below. Here's the test script I used to prove this works.
SET serveroutput ON;
CREATE TABLE materials (
material_id NUMBER(10,0),
cost NUMBER(10,2)
);
ALTER TABLE materials ADD PRIMARY KEY (material_id);
INSERT INTO materials VALUES (1,10);
INSERT INTO materials VALUES (2,30);
INSERT INTO materials VALUES (3,90);
<<LOCAL>>
DECLARE
l_material_id materials.material_id%TYPE;
l_cost materials.cost%TYPE;
l_total_cost materials.cost%TYPE;
CURSOR test IS
SELECT material_id,
cost,
Sum(cost) OVER () total_cost
FROM materials
WHERE material_id BETWEEN 1 AND 3
FOR UPDATE OF cost;
BEGIN
OPEN test;
FETCH test INTO l_material_id, l_cost, l_total_cost;
Dbms_Output.put_line(l_material_id||' '||l_cost||' '||l_total_cost);
FETCH test INTO l_material_id, l_cost, l_total_cost;
Dbms_Output.put_line(l_material_id||' '||l_cost||' '||l_total_cost);
FETCH test INTO l_material_id, l_cost, l_total_cost;
Dbms_Output.put_line(l_material_id||' '||l_cost||' '||l_total_cost);
END LOCAL;
/
Which gives the output:
1 10 130
2 30 130
3 90 130
The syntax select . . . for update locks records in a table to prepare for an update. When you do an aggregation, the result set no longer refers to the original rows.
In other words, there are no records in the database to update. There is just a temporary result set.
You might try something like:
<<LOCAL>>
declare
material_id materials.material_id%Type;
cost materials.cost%Type;
total_cost materials.cost%Type;
begin
select material_id,
cost,
sum(cost) over () total_cost
into local.material_id,
local.cost,
local.total_cost
from materials
where material_id between 1 and 3
for update of cost;
...
end local;
The first row gives you the total cost, but it selects all the rows and in theory they could be locked.
I don't know if this is allowed, mind you -- be interesting to hear whether it is.
For example, there is product table with id, name and stock as shown below.
product table:
id
name
stock
1
Apple
3
2
Orange
5
3
Lemon
8
Then, both 2 queries below can run sum() and SELECT FOR UPDATE together:
SELECT sum(stock) FROM (SELECT * FROM product FOR UPDATE) AS result;
WITH result AS (SELECT * FROM product FOR UPDATE) SELECT sum(stock) FROM result;
Output:
sum
-----
16
(1 row)
For that, you can use the WITH command.
Exemple:
WITH result AS (
-- your select
) SELECT * FROM result GROUP BY material_id;
Is your problem "However, in theory someone could update the cost column on the materials table between the two queries, in which case the calculated percents will be off."?
In that case , probably you can simply use a inner query as:
SELECT material_id, cost/(SELECT Sum(cost)
FROM materials
WHERE material_id >=0
AND material_id <= 10)
INTO v_material_id_collection, v_pct_collection
FROM materials
WHERE material_id >=0
AND material_id <= 10;
Why do you want to lock a table? Other applications might fail if they try to update that table during that time right?

How to design the Tables / Query for (m:n relation?)

I am sorry if the term m:n is not correct, If you know a better term i will correct. I have the following situation, this is my original data:
gameID
participID
result
the data itself looks like that
1 5 10
1 4 -10
2 5 150
2 2 -100
2 1 -50
when i would extract this table it will easily have some 100mio rows and around 1mio participIDs ore more.
i will need:
show me all results of all games from participant x, where participant y was present
luckily only for a very limited amount of participants, but those are subject to change so I need a complete table and can reduce in a second step.
my idea is the following, it just looks very unoptimized
1) get the list of games where the "point of view participant" is included"
insert into consolidatedtable (gameid, participid, result)
select gameID,participID,sum(result) from mastertable where participID=x and result<>0
2) get all games where other participant is included
insert into consolidatedtable (gameid, participid, result)
where gameID in (select gameID from consolidatedtable)
AND participID=y and result<>0
3) delete all games from consolidate table where count<2
delete from consolidatedDB where gameID in (select gameid from consolidatedtable where count(distinct(participID)<2 group by gameid)
the whole thing looks like a childrens solution to me
I need a consolidated table for each player
I insert way to many games into this table and delete them later on
the whole thing needs to be run participant by participant over
the whole master table, it would not work if i do this for several
participants at the same time
any better ideas, must be, this ones just so bad. the master table will be postgreSQL on the DW server, the consolidated view will be mySQL (but the number crunching will be done in postgreSQL)
my problems
1) how do i build the consolidated table(s - do i need more than one), without having to run a single query for each player over the whole master table (i need to data for players x,y,z and no matter who else is playing) - this is the consolidation task for the DW server, it should create the table for webserver (which is condensed)
2) how can i then extract the at the webserver fast (so the table design of (1) should take this into consideration. we are not talking about a lot of players here i need this info, maybe 100? (so i could then either partition by player ID, or just create single table)
Datawarehouse: postgreSQL 9.2 (48GB, SSD)
Webserver: mySQL 5.5 (4GB Ram, SSD)
master table: gameid BIGINT, participID, Result INT, foreign key on particiP ID (to participants table)
the DW server will hold the master table, the DW server should also prepare the consolidated/extracted Tables (processing power, ssd space is not
an issue)
the webserver should hold the consoldiated tables (only for the 100
players where i need the info) and query this data in a very
efficient manner
so efficient query at webserver >> workload of DW server)
i think this is important, sorry that i didnt include it at the beginning.
the data at the DW server updates daily, but i do not need to query the whole "master table" completely every day. the setup allows me to consolidate only never values. eg: yesterday consolidation was up to ID 500, current ID=550, so today i only consolidate 501-550.
Here is another idea that might work, depending on your database (and my understanding of the question):
SELECT *
FROM table a
WHERE participID = 'x'
AND EXISTS (
SELECT 1 FROM table b
WHERE b.participID = 'y'
AND b.gameID=a.gameID
);
Assuming you have indexes on the two columns (participID and gameID), the performance should be good.
I'd compare it to this and see which runs faster:
SELECT *
FROM table a
JOIN (
SELECT gameID
FROM table
WHERE participID = 'y'
GROUP BY gameID
) b
ON a.gameID=b.gameID
WHERE a.participID = 'x';
Sounds like you just want a self join:
For all participants:
SELECT x.gameID, x.participID, x.results, y.participID, y.results
FROM table as x
JOIN table as y
ON T1.gameID = T2.gameID
WHERE x.participID <> y.participID
The downside of that is you'd get each participant on each side of each game.
For 2 specific particpants:
SELECT x.gameID, x.results, y.results
FROM (SELECT gameID, participID, results
FROM table
WHERE t1.participID = 'x'
and results <> 0)
as x
JOIN (SELECT gameID, participID, results
FROM table
WHERE t1.participID = 'y'
and results <> 0)
as y
ON T1.gameID = T2.gameID
You might not need to select participID in your query, depending on what you're doing with the results.

Maintaining logical consistency with a soft delete, whilst retaining the original information

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.
I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.

Approach to a Bin Packing sql problem

I have a problem in sql where I need to generate a packing list from a list of transactions.
Data Model
The transactions are stored in a table that contains:
transaction id
item id
item quantity
Each transaction can have multiple items (and coincidentally multiple rows with the same transaction id). Each item then has a quantity from 1 to N.
Business Problem
The business requires that we create a packing list, where each line item in the packing list contains the count of each item in the box.
Each box can only contain 160 items (they all happen to be the same size/weight). Based on the total count of the order we need to split items into different boxes (sometimes splitting even the individual item's collection into two boxes)
So the challenge is to take that data schema and come up with the result set that includes how many of each item belong in each box.
I am currently brute forcing this in some not so pretty ways and wondering if anyone has an elegant/simple solution that I've overlooked.
Example In/Out
We really need to isolate how many of each item end up in each box...for example:
Order 1:
100 of item A100 of item B140 of item C
This should result in three rows in the result set:
Box 1: A (100), B (60) Box 2: B(40), C (120) Box 3: C(20)
Ideally the query would be smart enough to put all of C together, but at this point - we're not too concerned with that.
How about something like
SELECT SUM([Item quantity]) as totalItems
, SUM([Item quantity]) / 160 as totalBoxes
, MOD(SUM([Item Quantity), 160) amountInLastBox
FROM [Transactions]
GROUP BY [Transaction Id]
Let me know what fields in the resultset you're looking for and I could come up with a better one
I was looking for something similar and all I could achieve was expanding the rows to the number of item counts in a transaction, and grouping them into bins. Not very elegant though.. Moreover, because string aggregation is still very cumbersome in SQL Server (Oracle, i miss you!), I have to leave the last part out. I mean putting the counts in one single row..
My solution is as follows:
Example transactions table:
INSERT INTO transactions
(trans_id, item, cnt) VALUES
('1','A','50'),
('2','A','140'),
('3','B','100'),
('4','C','80');
GO
Create a dummy sequence table, which contains numbers from 1 to 1000 (I assume that maximum number allowed for an item in a single transaction is 1000):
CREATE TABLE numseq (n INT NOT NULL IDENTITY) ;
GO
INSERT numseq DEFAULT VALUES ;
WHILE SCOPE_IDENTITY() < 1000 INSERT numseq DEFAULT VALUES ;
GO
Now we can generate a temporary table from transactions table, in which each transaction and item exist "cnt" times in a subquery, and then give numbers to the bins using division, and group by bin number:
SELECT bin_nr, item, count(*) count_in_bin
INTO result
FROM (
SELECT t.item, ((row_number() over (order by t.item, s.n) - 1) / 160) + 1 as bin_nr
FROM transactions t
INNER JOIN numseq s
ON t.cnt >= s.n -- join conditionally to repeat transaction rows "cnt" times
) a
GROUP BY bin_id, item
ORDER BY bin_id, item
GO
Result is:
bin_id item count_in_bin
1 A 160
2 A 30
2 B 100
2 C 30
3 C 50
In Oracle, the last step would be as simple as that:
SELECT bin_id, WM_CONCAT(CONCAT(item,'(',count_in_bin,')')) contents
FROM result
GROUP BY bin_id
This isn't the prettiest answer but I am using a similar method to keep track of stock items through an order process, and it is easy to understand, and may lead to you developing a better method than I have.
I would create a table called "PackedItem" or something similar. The columns would be:
packed_item_id (int) - Primary Key, Identity column
trans_id (int)
item_id (int)
box_number (int)
Each record in this table represents 1 physical unit you will ship.
Lets say someone adds a line to transaction 4 with 20 of item 12, I would add 20 records to the PackedItem table, all with the transaction ID, the Item ID, and a NULL box number. If a line is updated, you need to add or remove records from the PackedItem table so that there is always a 1:1 correlation.
When the time comes to ship, you can simply
SELECT TOP 160 FROM PackedItem WHERE trans_id = 4 AND box_number IS NULL
and set the box_number on those records to the next available box number, until no records remain where the box_number is NULL. This is possible using one fairly complicated UPDATE statement inside a WHILE loop - which I don't have the time to construct fully.
You can now easily get your desired packing list by querying this table as follows:
SELECT box_number, item_id, COUNT(*) AS Qty
FROM PackedItem
WHERE trans_id = 4
GROUP BY box_number, item_id
Advantages - easy to understand, fairly easy to implement.
Pitfalls - if the table gets out of sync with the lines on the Transaction, the final result can be wrong; This table will get many records in it and will be extra work for the server. Will need each ID field to be indexed to keep performance good.