Interview question: SQL recursion in one statement - sql

Someone I know went to an interview and was given the following problem to solve. I've thought about it for a few hours and believe that it's not possible to do without using some database-specific extensions or features from recent standards that don't have wide support yet.
I don't remember the story behind what is being represented, but it's not relevant. In simple terms, you're trying to represent chains of unique numbers:
chain 1: 1 -> 2 -> 3
chain 2: 42 -> 78
chain 3: 4
chain 4: 7 -> 8 -> 9
...
This information is already stored for you in the following table structure:
id | parent
---+-------
1 | NULL
2 | 1
3 | 2
42 | NULL
78 | 42
4 | NULL
7 | NULL
8 | 7
9 | 8
There could be millions of such chains and each chain can have an unlimited number of entries. The goal is to create a second table that would contain the exact same information, but with a third column that contains the starting point of the chain:
id | parent | start
---+--------+------
1 | NULL | 1
2 | 1 | 1
3 | 2 | 1
42 | NULL | 42
78 | 42 | 42
4 | NULL | 4
7 | NULL | 7
8 | 7 | 7
9 | 8 | 7
The claim (made by the interviewers) is that this can be achieved with just 2 SQL queries. The hint they provide is to first populate the destination table (I'll call it dst) with the root elements, like so:
INSERT INTO dst SELECT id, parent, id FROM src WHERE parent IS NULL
This will give you the following content:
id | parent | start
---+--------+------
1 | NULL | 1
42 | NULL | 42
4 | NULL | 4
7 | NULL | 7
They say that you can now execute just one more query to get to the goal shown above.
In my opinion, you can do one of two things. Use recursion in the source table to get to the front of each chain, or continuously execute some version of SELECT start FROM dst WHERE dst.id = src.parent after each update to dst (i.e. can't cache the results).
I don't think either of these situations is supported by common databases like MySQL, PostgreSQL, SQLite, etc. I do know that in PostgreSQL 8.4 you can achieve recursion using WITH RECURSIVE query, and in Oracle you have START WITH and CONNECT BY clauses. The point is that these things are specific to database type and version.
Is there any way to achieve the desired result using regular SQL92 in just one query? The best I could do is fill-in the start column for the first child with the following (can also use a LEFT JOIN to achieve the same result):
INSERT INTO dst
SELECT s.id, s.parent,
(SELECT start FROM dst AS d WHERE d.id = s.parent) AS start
FROM src AS s
WHERE s.parent IS NOT NULL
If there was some way to re-execute the inner select statement after each insert into dst, then the problem would be solved.

It can not be implemented in any static SQL that follows ANSI SQL 92.
But as you said it can be easy implemented with oracle's CONNECT BY
SELECT id,
parent,
CONNECT_BY_ROOT id
FROM table
START WITH parent IS NULL
CONNECT BY PRIOR id = parent

In SQL Server you would use a Common Table Expression (CTE).
To replicate the stored data I've created a temporary table
-- Create a temporary table
CREATE TABLE #SourceData
(
ID INT
, Parent INT
)
-- Setup data (ID, Parent, KeyField)
INSERT INTO #SourceData VALUES (1, NULL);
INSERT INTO #SourceData VALUES (2, 1);
INSERT INTO #SourceData VALUES (3, 2);
INSERT INTO #SourceData VALUES (42, NULL);
INSERT INTO #SourceData VALUES (78, 42);
INSERT INTO #SourceData VALUES (4, NULL);
INSERT INTO #SourceData VALUES (7, NULL);
INSERT INTO #SourceData VALUES (8, 7);
INSERT INTO #SourceData VALUES (9, 8);
Then I create the CTE to compile the data result:
-- Perform CTE
WITH RecursiveData (ID, Parent, Start) AS
(
-- Base query
SELECT ID, Parent, ID AS Start
FROM #SourceData
WHERE Parent IS NULL
UNION ALL
-- Recursive query
SELECT s.ID, s.Parent, rd.Start
FROM #SourceData AS s
INNER JOIN RecursiveData AS rd ON s.Parent = rd.ID
)
SELECT * FROM RecursiveData WHERE Parent IS NULL
Which will output the following:
id | parent | start
---+--------+------
1 | NULL | 1
42 | NULL | 42
4 | NULL | 4
7 | NULL | 7
Then I clean up :)
-- Clean up
DROP TABLE #SourceData

There is no recursive query support in ANSI-92, because it was added in ANSI-99. Oracle has had it's own recursive query syntax (CONNECT BY) since v2. While Oracle supported the WITH clause since 9i, SQL Server is the first I knew of to support the recursive WITH/CTE syntax -- Oracle didn't start until 11gR2. PostgreSQL added support in 8.4+. MySQL has had a request in for WITH support since 2006, and I highly doubt you'll see it in SQLite.
The example you gave is only two levels deep, so you could use:
INSERT INTO dst
SELECT a.id,
a.parent,
COALESCE(c.id, b.id) AS start
FROM SRC a
LEFT JOIN SRC b ON b.id = a.parent
LEFT JOIN SRC c ON c.id = b.parent
WHERE a.parent IS NOT NULL
You'd have to add a LEFT JOIN for the number of levels deep, and add them in proper sequence to the COALESCE function.

Related

HQL, insert two rows if a condition is met

I have the following table called table_persons in Hive:
+--------+------+------------+
| people | type | date |
+--------+------+------------+
| lisa | bot | 19-04-2022 |
| wayne | per | 19-04-2022 |
+--------+------+------------+
If type is "bot", I have to add two rows in the table d1_info else if type is "per" i only have to add one row so the result is the following:
+---------+------+------------+
| db_type | info | date |
+---------+------+------------+
| x_bot | x | 19-04-2022 |
| x_bnt | x | 19-04-2022 |
| x_per | b | 19-04-2022 |
+---------+------+------------+
How can I add two rows if this condition is met?
with a Case When maybe?
You may try using a union to merge or duplicate the rows with bot. The following eg unions the first query which selects all records and the second query selects only those with bot.
Edit
In response to the edited question, I have added an additional parity column (storing 1 or 0) named original to differentiate the duplicate entry named
SELECT
p1.*,
1 as original
FROM
table_persons p1
UNION ALL
SELECT
p1.*,
0 as original
FROM
table_persons p1
WHERE p1.type='bot'
You may then insert this into your other table d1_info using the above query as a subquery or CTE with the desired transformations CASE expressions eg
INSERT INTO d1_info
(`db_type`, `info`, `date`)
WITH merged_data AS (
SELECT
p1.*,
1 as original
FROM
table_persons p1
UNION ALL
SELECT
p1.*,
0 as original
FROM
table_persons p1
WHERE p1.type='bot'
)
SELECT
CONCAT('x_',CASE
WHEN m1.type='per' THEN m1.type
WHEN m1.original=1 AND m1.type='bot' THEN m1.type
ELSE 'bnt'
END) as db_type,
CASE
WHEN m1.type='per' THEN 'b'
ELSE 'x'
END as info,
m1.date
FROM
merged_data m1
ORDER BY m1.people,m1.date;
See working demo db fiddle here
I think what you want is to create a new table that captures your logic. This would simplify your query and make it so you could easily add new types without having to edit logic of a case statement. It may also make it cleaner to view your logic later.
CREATE TABLE table_persons (
`people` VARCHAR(5),
`type` VARCHAR(3),
`date` VARCHAR(10)
);
INSERT INTO table_persons
VALUES
('lisa', 'bot', '19-04-2022'),
('wayne', 'per', '19-04-2022');
CREATE TABLE info (
`type` VARCHAR(5),
`db_type` VARCHAR(5),
`info` VARCHAR(1)
);
insert into info
values
('bot', 'x_bot', 'x'),
('bot', 'x_bnt', 'x'),
('per','x_per','b');
and then you can easily do a join:
select
info.db_type,
info.info,
persons.date date
from
table_persons persons inner join info
on
info.type = persons.type

SQL Server Roll-up (Sum) Value from associative table

I have the following situation. I have a “Client” Table (Parent) and an “Opportunity” Table Child table. (See example below).
Client Table
| Id | Name
------------------
|1 | Client A
|2 | Client B
|3 | Client C
Opportunity Table
| Id | ClientId | Value
---------------------------------
| 10 | 1 | 1000
| 11 | 1 | 3000
| 12 | 2 | 1500
| 13 | 3 | 2000
I want to show sum of all Total of Opportunity Value (OppValue) on the client record.
Expected Output
| Id | Name | OppValue
-----------------------------
|1 | Client A | 4000
|2 | Client B | 1500
|3 | Client C | 2000
The business requirement is to filter on “OppValue” with the following criteria greater than, less than or null, etc but not by opportunity create date, etc. We are expecting each year users will be adding 500 clients and 45000 new opportunities. Based on the above I can think of three options
Calculate OppValue using SQL Query (Group By or Partition By)
Create View or Calculated Field using UDF
Create a new field in the “Client” table and populate it using Application business logic (outside SQL).
Which of the solution in your opinion will work best in terms of User experience (speed) and maintenance?
In case there is a better suggestion please let me know.
Many thanks in advance.
Start with a view:
create view client_opp as (
select c.*, o.oppvalue
from client c outer apply
(select sum(oppvalue) as oppvalue
from opportunities o
where o.clientId = c.clientId
) o;
Be sure you have an index on opportunities(clientId, oppvalue) -- or at least on opportunities(clientId). Note that this uses apply quite specifically so the view should work well even when used in a query with additional filtering.
If this works performance-wise, then you are done. Other methods using triggers and UDFs require a bit more maintenance in the database. You can definitely use them, but I would recommend waiting to see if this meets your performance needs.
You can use apply :
create view client_view as
select c.*, ot.OppValue
from ClientTable c cross apply
( select sum(value) as OppValue
from OpportunityTable ot
where ot.ClientId = c.ClientId
) ot;
Try the following query using inner join and sum() function. To know more about inner join you can follow this link. You can learn in detail about the Aggregate Functions (Transact-SQL) here.
Create table Client
(Id int, Name Varchar(20))
insert into Client values
(1, 'Client A'),
(2, 'Client B'),
(3, 'Client C')
create table Opportunity
(Id int, ClientId int, Value int)
insert into Opportunity values
(10, 1, 1000 ),
(11, 1, 3000 ),
(12, 2, 1500 ),
(13, 3, 2000 )
Select Client.Id, Client.Name, sum(Value) as Value
from Client
inner join Opportunity on Client.Id = Opportunity.ClientId
group by Client.Id, Client.Name
Output
Id Name Value
----------------------
1 Client A 4000
2 Client B 1500
3 Client C 2000
To create a view you can use the following create view example.
Syntax:
Create View <ViewName>
as
<View Query>
Example:
create view MyView
as
Select Client.Id, Client.Name, sum(Value) as Value
from Client
inner join Opportunity on Client.Id = Opportunity.ClientId
group by Client.Id, Client.Name
Selecting the result from a view as created above.
select * from MyView

Include data in a table looking in every insert if there is a match with the table values

I need to insert data from one table into another, but this insert must look into the table which receives data to determine if there is a match or not, and if it is, don't insert new data.
So, i have the next tables (NODE_ID refers to values in NODE1 and NODE2, think about lines with two nodes everyone):
Table A:
| ARC | NODE1 | NODE2 | STATE |
| x | 1 | 2 | A |
| y | 2 | 3 | A |
| z | 3 | 4 | B |
Table B:
| NODE_ID| VALUE |
| 1 | N |
| 2 | N |
| 3 | N |
| 4 | N |
And want the next result, that relates NODE_ID with ARCS and write in the result table the value of STATE from ARCS table, only one result for each NODE, because if not, i would have more than one row for the same NODE:
Table C result:
| NODE_ID| STATE |
| 1 | A |
| 2 | A |
| 3 |A(or B)|
I tried to do this with CASE statement with EXISTS, IF , and NVL2() and so on in the select but have no result at this time.
Any idea about how could i write this query?
Thank you very much for your help
Ok guys, i edit my message to explain how i did it finally, i've also changed a little bit my first message to make it more clear to undestand because we had problems with that.
So finally i used this query, that #mathguy introduced to me:
merge into Table_C c
using (select distinct b.NODE_ID as nodes, a.STATE
from Table_A a, Table_B b
where (b.NODE_ID=a.NODE1 or b.NODE_ID=a.NODE2) s
on (s.nodes=c.NODE_ID)
when not matched then
insert (NODE_ID, STATE)
values (s.nodes, s.STATE)
That's all
This can be done with insert, but often when you update one table with values from another, the merge statement is more powerful (more flexible).
merge into table_c c
using ( select arc, min(state) as state from table_a group by arc ) s
on (s.arc = c.node_id)
when not matched then insert (node_id, state)
values (s.arc, s.state)
;
Thanks to #Boneist and #ThorstenKettner for pointing out several syntax errors (now fixed).
If table C does not yet exist, use a create select statement:
create table c as select arc as node_id, state from a;
In case there can be duplicate arc (not shown in your sample) you'd need aggregation:
create table c as select arc as node_id, min(state) as state from a group by arc;

PostgreSQL 9.3 - Compare two sets of data without duplicating values in first set

I have a group of tables that define some rules that need to be followed, for example:
CREATE TABLE foo.subrules (
subruleid SERIAL PRIMARY KEY,
ruleid INTEGER REFERENCES foo.rules(ruleid),
subrule INTEGER,
barid INTEGER REFERENCES foo.bars(barid)
);
INSERT INTO foo.subrules(ruleid,subrule,barid) VALUES
(1,1,1),
(1,1,2),
(1,2,2),
(1,2,3),
(1,2,4),
(1,3,3),
(1,3,4),
(1,3,5),
(1,3,6),
(1,3,7);
What this is defining is a set of "subrules" that need to be satisfied... if all "subrules" are satisfied then the rule is also satisfied.
In the above example, "subruleid" 1 can be satisfied with a "barid" value of 1 or 2.
Additionally, "subruleid" 2 can be satisfied with a "barid" value of 2, 3, or 4.
Likewise, "subruleid" 3 can be satisfied with a "barid" value of 3, 4, 5, 6, or 7.
I also have a data set that looks like this:
primarykey | resource | barid
------------|------------|------------
1 | A | 1
2 | B | 2
3 | C | 8
The tricky part is that once a "subrule" is satisfied with a "resource", that "resource" can't satisfy any other "subrule" (even if the same "barid" would satisfy the other "subrule")
So, what I need is to evaluate and return the following results:
ruleid | subrule | barid | primarykey | resource
------------|------------|------------|------------|------------
1 | 1 | 1 | 1 | A
1 | 1 | 2 | NULL | NULL
1 | 2 | 2 | 2 | B
1 | 2 | 3 | NULL | NULL
1 | 2 | 4 | NULL | NULL
1 | 3 | 3 | NULL | NULL
1 | 3 | 4 | NULL | NULL
1 | 3 | 5 | NULL | NULL
1 | 3 | 6 | NULL | NULL
1 | 3 | 7 | NULL | NULL
NULL | NULL | NULL | 3 | C
Interestingly, if "primarykey" 3 had a "barid" value of 2 (instead of 8) the results would be identical.
I have tried several methods including a plpgsql function that performs a grouping by "subruleid" with ARRAY_AGG(barid) and building an array from barid and checking if each element in the barid array is in the "subruleid" group via a loop, but it just doesn't feel right.
Is a more elegant or efficient option available?
The following fragment finds solutions, if there are any. The number three (resources) is hardcoded. If only one solution is needed some symmetry-breaker should be added.
If the number of resources is not bounded, I think there could be a solution by enumerating all possible tableaux (Hilbert? mixed-radix?), and selecting from them, after pruning the not-satifying ones.
-- the data
CREATE TABLE subrules
( subruleid SERIAL PRIMARY KEY
, ruleid INTEGER -- REFERENCES foo.rules(ruleid),
, subrule INTEGER
, barid INTEGER -- REFERENCES foo.bars(barid)
);
INSERT INTO subrules(ruleid,subrule,barid) VALUES
(1,1,1), (1,1,2),
(1,2,2), (1,2,3), (1,2,4),
(1,3,3), (1,3,4), (1,3,5), (1,3,6), (1,3,7);
CREATE TABLE resources
( primarykey INTEGER NOT NULL PRIMARY KEY
, resrc varchar
, barid INTEGER NOT NULL
);
INSERT INTO resources(primarykey,resrc,barid) VALUES
(1, 'A', 1) ,(2, 'B', 2) ,(3, 'C', 8)
-- ################################
-- uncomment next line to find a (two!) solution(s)
-- ,(4, 'D', 7)
;
-- all matching pairs of subrules <--> resources
WITH pairs AS (
SELECT sr.subruleid, sr.ruleid, sr.subrule, sr.barid
, re.primarykey, re.resrc
FROM subrules sr
JOIN resources re ON re.barid = sr.barid
)
SELECT
p1.ruleid AS ru1 , p1.subrule AS sr1 , p1.resrc AS one
, p2.ruleid AS ru2 , p2.subrule AS sr2 , p2.resrc AS two
, p3.ruleid AS ru3 , p3.subrule AS sr3 , p3.resrc AS three
-- self-join the pairs, excluding the ones that
-- use the same subrule or resource
FROM pairs p1
JOIN pairs p2 ON p2.primarykey > p1.primarykey -- tie-breaker
JOIN pairs p3 ON p3.primarykey > p2.primarykey -- tie breaker
WHERE 1=1
AND p2.subruleid <> p1.subruleid
AND p2.subruleid <> p3.subruleid
AND p3.subruleid <> p1.subruleid
;
Result (after uncommenting the line with missing resource) :
ru1 | sr1 | one | ru2 | sr2 | two | ru3 | sr3 | three
-----+-----+-----+-----+-----+-----+-----+-----+-------
1 | 1 | A | 1 | 1 | B | 1 | 3 | D
1 | 1 | A | 1 | 2 | B | 1 | 3 | D
(2 rows)
The resources {A,B,C} could of course be hard-coded, but that would prevent the 'D' record (or any other) to serve as the missing link.
Since you are not clarifying the question, I am going with my own assumptions.
subrule numbers are ascending without gaps for each rule.
(subrule, barid) is UNIQUE in table subrules.
If a there are multiple resources for the same barid, assignments are arbitrary among these peers.
As commented, the number of resources matches the number of subrules (which has no effect on my suggested solution).
The algorithm is as follows:
Pick the subrule with the smallest subrule number.
Assign a resource to the lowest barid possible (the first that has a matching resource), which consumes the resource.
After the first resource is matched, skip to the next higher subruleid and repeat 2.
Append all remaining resources after last subrule.
You can implement this with pure SQL using a recursive CTE:
WITH RECURSIVE cte AS ((
SELECT s.*, r.resourceid, r.resource
, CASE WHEN r.resourceid IS NULL THEN '{}'::int[]
ELSE ARRAY[r.resourceid] END AS consumed
FROM subrules s
LEFT JOIN resource r USING (barid)
WHERE s.ruleid = 1
ORDER BY s.subrule, r.barid, s.barid
LIMIT 1
)
UNION ALL (
SELECT s.*, r.resourceid, r.resource
, CASE WHEN r.resourceid IS NULL THEN c.consumed
ELSE c.consumed || r.resourceid END
FROM cte c
JOIN subrules s ON s.subrule = c.subrule + 1
LEFT JOIN resource r ON r.barid = s.barid
AND r.resourceid <> ALL (c.consumed)
ORDER BY r.barid, s.barid
LIMIT 1
))
SELECT ruleid, subrule, barid, resourceid, resource FROM cte
UNION ALL -- add unused rules
SELECT s.ruleid, s.subrule, s.barid, NULL, NULL
FROM subrules s
LEFT JOIN cte c USING (subruleid)
WHERE c.subruleid IS NULL
UNION ALL -- add unused resources
SELECT NULL, NULL, r.barid, r.resourceid, r.resource
FROM resource r
LEFT JOIN cte c USING (resourceid)
WHERE c.resourceid IS NULL
ORDER BY subrule, barid, resourceid;
Returns exactly the result you have been asking for.
SQL Fiddle.
Explain
It's basically an implementation of the algorithm laid out above.
Only take a single match on a single barid per subrule. Hence the LIMIT 1, which requires additional parentheses:
Sum results of a few queries and then find top 5 in SQL
Collecting "consumed" resources in the array consumed and exclude them from repeated assignment with r.resourceid <> ALL (c.consumed). Note in particular how I avoid NULL values in the array, which would break the test.
The CTE only returns matched rows. Add rules and resources without match in the outer SELECT to get the complete result.
Or you open two cursors on the tables subrule and resource and implement the algorithm with any decent programming language (including PL/pgSQL).

In SQL, what's the difference between count(column) and count(*)?

I have the following query:
select column_name, count(column_name)
from table
group by column_name
having count(column_name) > 1;
What would be the difference if I replaced all calls to count(column_name) to count(*)?
This question was inspired by How do I find duplicate values in a table in Oracle?.
To clarify the accepted answer (and maybe my question), replacing count(column_name) with count(*) would return an extra row in the result that contains a null and the count of null values in the column.
count(*) counts NULLs and count(column) does not
[edit] added this code so that people can run it
create table #bla(id int,id2 int)
insert #bla values(null,null)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,null)
select count(*),count(id),count(id2)
from #bla
results
7 3 2
Another minor difference, between using * and a specific column, is that in the column case you can add the keyword DISTINCT, and restrict the count to distinct values:
select column_a, count(distinct column_b)
from table
group by column_a
having count(distinct column_b) > 1;
A further and perhaps subtle difference is that in some database implementations the count(*) is computed by looking at the indexes on the table in question rather than the actual data rows. Since no specific column is specified, there is no need to bother with the actual rows and their values (as there would be if you counted a specific column). Allowing the database to use the index data can be significantly faster than making it count "real" rows.
The explanation in the docs, helps to explain this:
COUNT(*) returns the number of items in a group, including NULL values and duplicates.
COUNT(expression) evaluates expression for each row in a group and returns the number of nonnull values.
So count(*) includes nulls, the other method doesn't.
We can use the Stack Exchange Data Explorer to illustrate the difference with a simple query. The Users table in Stack Overflow's database has columns that are often left blank, like the user's Website URL.
-- count(column_name) vs. count(*)
-- Illustrates the difference between counting a column
-- that can hold null values, a 'not null' column, and count(*)
select count(WebsiteUrl), count(Id), count(*) from Users
If you run the query above in the Data Explorer, you'll see that the count is the same for count(Id) and count(*)because the Id column doesn't allow null values. The WebsiteUrl count is much lower, though, because that column allows null.
The COUNT(*) sentence indicates SQL Server to return all the rows from a table, including NULLs.
COUNT(column_name) just retrieves the rows having a non-null value on the rows.
Please see following code for test executions SQL Server 2008:
-- Variable table
DECLARE #Table TABLE
(
CustomerId int NULL
, Name nvarchar(50) NULL
)
-- Insert some records for tests
INSERT INTO #Table VALUES( NULL, 'Pedro')
INSERT INTO #Table VALUES( 1, 'Juan')
INSERT INTO #Table VALUES( 2, 'Pablo')
INSERT INTO #Table VALUES( 3, 'Marcelo')
INSERT INTO #Table VALUES( NULL, 'Leonardo')
INSERT INTO #Table VALUES( 4, 'Ignacio')
-- Get all the collumns by indicating *
SELECT COUNT(*) AS 'AllRowsCount'
FROM #Table
-- Get only content columns ( exluce NULLs )
SELECT COUNT(CustomerId) AS 'OnlyNotNullCounts'
FROM #Table
COUNT(*) – Returns the total number of records in a table (Including NULL valued records).
COUNT(Column Name) – Returns the total number of Non-NULL records. It means that, it ignores counting NULL valued records in that particular column.
Basically the COUNT(*) function return all the rows from a table whereas COUNT(COLUMN_NAME) does not; that is it excludes null values which everyone here have also answered here.
But the most interesting part is to make queries and database optimized it is better to use COUNT(*) unless doing multiple counts or a complex query rather than COUNT(COLUMN_NAME). Otherwise, it will really lower your DB performance while dealing with a huge number of data.
Further elaborating upon the answer given by #SQLMeance and #Brannon making use of GROUP BY clause which has been mentioned by OP but not present in answer by #SQLMenace
CREATE TABLE table1 (
id INT
);
INSERT INTO table1 VALUES
(1),
(2),
(NULL),
(2),
(NULL),
(3),
(1),
(4),
(NULL),
(2);
SELECT * FROM table1;
+------+
| id |
+------+
| 1 |
| 2 |
| NULL |
| 2 |
| NULL |
| 3 |
| 1 |
| 4 |
| NULL |
| 2 |
+------+
10 rows in set (0.00 sec)
SELECT id, COUNT(*) FROM table1 GROUP BY id;
+------+----------+
| id | COUNT(*) |
+------+----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 3 |
| 3 | 1 |
| 4 | 1 |
+------+----------+
5 rows in set (0.00 sec)
Here, COUNT(*) counts the number of occurrences of each type of id including NULL
SELECT id, COUNT(id) FROM table1 GROUP BY id;
+------+-----------+
| id | COUNT(id) |
+------+-----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 0 |
| 3 | 1 |
| 4 | 1 |
+------+-----------+
5 rows in set (0.00 sec)
Here, COUNT(id) counts the number of occurrences of each type of id but does not count the number of occurrences of NULL
SELECT id, COUNT(DISTINCT id) FROM table1 GROUP BY id;
+------+--------------------+
| id | COUNT(DISTINCT id) |
+------+--------------------+
| NULL | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
+------+--------------------+
5 rows in set (0.00 sec)
Here, COUNT(DISTINCT id) counts the number of occurrences of each type of id only once (does not count duplicates) and also does not count the number of occurrences of NULL
It is best to use
Count(1) in place of column name or *
to count the number of rows in a table, it is faster than any format because it never go to check the column name into table exists or not
There is no difference if one column is fix in your table, if you want to use more than one column than you have to specify that how much columns you required to count......
Thanks,
As mentioned in the previous answers, Count(*) counts even the NULL columns, whereas count(Columnname) counts only if the column has values.
It's always best practice to avoid * (Select *, count *, …)