SQL Query Problems With Counting Frequency - sql

I've been trying to list all yelp users who haven't reviewed any businesses but have provided at least 2 comments on other user's reviews for the following table:
But I've been having some issues. These issues mainly are derived from my attempts to count the elements listed as a varchar. For example, the questions states that I need to return users who have commented on atleast two other user's reviews. Currently I have the List_Of_Comments stored as a varchar with the characters looking like the following: "Y3, Y2". How am I supposed to determine how often a user posts a comment through a varchar?This is what I have so far:
SELECT U.YELP_ID FROM REVIEWS R, YELP_USER U
WHERE R.Author = U.YELP_ID AND R.Author = NULL AND R.Number_Of_Comments >= 2;
Assuming the following tables:
CREATE TABLE REVIEWS (
REVIEW_ID VARCHAR(3),
Stars INT,
Author VARCHAR(3),
Publish_Date VARCHAR(22),
BUSSINESS_ID VARCHAR(3),
List_Of_Comments VARCHAR(7),
Number_Of_Comments INT
);
CREATE TABLE YELP_USER (
YELP_ID VARCHAR(3),
Email VARCHAR(17),
First_Name VARCHAR(8),
Last_Name VARCHAR(17),
DOB DATE,
BirthPlace VARCHAR(3),
Gender VARCHAR(1),
Friendlist VARCHAR(9),
Complimented_Friendlist VARCHAR(6),
Checkedin_Businesses VARCHAR(36)
);
If anyone could help me figure this out I would greatly appreciate it. I've been stuck on this for hours. Thanks!

To answer what I think you are asking... How to count the number of entries in a comma separated list:
Oracle Setup:
INSERT INTO REVIEWS VALUES ( 1, 1, 'A1', DATE '2016-02-02', 'B1', 'C1,C2', NULL );
INSERT INTO REVIEWS VALUES ( 2, 1, 'A2', DATE '2016-02-01', 'B1', 'C3', NULL );
INSERT INTO REVIEWS VALUES ( 3, 1, 'A3', DATE '2016-02-01', 'B1', NULL, NULL );
Query:
SELECT REVIEW_ID,
COALESCE( REGEXP_COUNT( List_of_comments, '[^,]+' ), 0 ) AS Number_of_comments
FROM REVIEWS;
Results:
REVIEW_ID NUMBER_OF_COMMENTS
--------- ------------------
1 2
2 1
3 0
A better solution:
Storing it how you are doing with a VARCHAR2(7) column for a list of comments will only allow you to store, at most, 4 comment IDs (if each ID is a single character).
It would be better to move them to their own tables using something like:
CREATE TABLE REVIEW_COMMENTS (
COMMENT_ID NUMBER(8,0) PRIMARY KEY,
REVIEW_ID VARCHAR2(3) REFERENCES REVIEWS( REVIEW_ID ),
YELP_ID VARCHAR2(3) REFERENCES YELP_USER( YELP_ID ),
COMMENT_VALUE VARCHAR2(140)
);
COMMENT ON TABLE REVIEW_COMMENTS IS 'The comments on a review by a user.';
COMMENT ON COLUMN REVIEW_COMMENTS( COMMENT_ID ) IS 'A unique identifier for the comment by a user on a review.';
COMMENT ON COLUMN REVIEW_COMMENTS( REVIEW_ID ) IS 'The identifier for the review the comment was left against.';
COMMENT ON COLUMN REVIEW_COMMENTS( YELP_ID ) IS 'The identifier for the user who left the comment.';
COMMENT ON COLUMN REVIEW_COMMENTS( COMMENT_VALUE ) IS 'The text of the comment.';
Also, do not store dates as VARCHAR2 column.

Your data structure only allows 2 comments for each review (at least 2 chars for each review, plus two commas. Any other comment will not fit in 7 chars), but assuming that's what you want, you could try to get all the users that are not in the review table
... from yelp_users yu
where not exists in (select 1 from reviews r where r.author = yu.yelp_id)
and his id is on the list of comments. I would search it using instr:
and exists (
select 1
from reviews r
where instr(',' || r.list_of_comments || ',', ',' || yu.yelp_id || ',' , 1, 1) > 0)
I've concatenated ',' to avoid you the case where you look for Y1 and end up getting a false positive when Y11 is the one that commented.
As your goal is to get users who commented at least twice, you could move the review table to from and put all the SQL in a subquery, grouping the user id on the external SQL.
=)

Related

How to I get distinct combinations of one XRef column related to any value in the other XRef column

I need to select the count of unique value combinations of column B in an XRef table which is grouped by column A.
Consider the following schema and data, which represents a simple family structure. Each child has a father and mother:
TABLE Father
FatherID
Name
1
Alex
2
Bob
TABLE Mother
MotherID
Name
1
Alice
2
Barbara
TABLE Child
ChildID
FatherID
MotherID
Name
1
1 (Alex)
1 (Alice)
Adam
2
1 (Alex)
1 (Alice)
Billy
3
1 (Alex)
2 (Barbara)
Celine
4
2 (Bob)
2 (Barbara)
Derek
The distinct combinations of mothers for each father are:
Alex (Alice, Barbara)
Bob (Barbara)
In all there are two distinct combinations of mothers:
Alice, Barbara
Barbara
The query I want to write would return the count of those distinct combinations of mother, regardless of which father they are associated with:
UniqueMotherGroups
2
I was able to do this successfully using the STRING_AGG function, but it feels clunky. It also needs to operate over millions of rows and is quite slow at the moment. Is there a more idiomatic way to do this with set operations instead?
Here is my working example:
-- Drop pre-existing tables
DROP TABLE IF EXISTS dbo.Child;
DROP TABLE IF EXISTS dbo.Father;
DROP TABLE IF EXISTS dbo.Mother;
-- Create family tables.
CREATE TABLE dbo.Father
(
FatherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Father
ADD CONSTRAINT PK_Father
PRIMARY KEY CLUSTERED (FatherID);
ALTER TABLE dbo.Father SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Mother
(
MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Mother
ADD CONSTRAINT PK_Mother
PRIMARY KEY CLUSTERED (MotherID);
ALTER TABLE dbo.Mother SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Child
(
ChildID INT NOT NULL
, FatherID INT NOT NULL
, MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Child
ADD CONSTRAINT PK_Child
PRIMARY KEY CLUSTERED (ChildID);
CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID, MotherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Father
FOREIGN KEY (FatherID)
REFERENCES dbo.Father (FatherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Mother
FOREIGN KEY (MotherID)
REFERENCES dbo.Mother (MotherID);
-- Insert two children with the same parents
INSERT INTO dbo.Father
(
FatherID
, Name
)
VALUES
(1, 'Alex')
, (2, 'Bob')
, (3, 'Charlie')
INSERT INTO dbo.Mother
(
MotherID
, Name
)
VALUES
(1, 'Alice')
, (2, 'Barbara');
INSERT INTO dbo.Child
(
ChildID
, FatherID
, MotherID
, Name
)
VALUES
(1, 1, 1, 'Adam')
, (2, 1, 1, 'Billy')
, (3, 1, 2, 'Celine')
, (4, 2, 2, 'Derek')
, (5, 3, 1, 'Eric');
-- CTE Gets distinct combinations of parents
WITH distinctParentCombinations (FatherID, MotherID)
AS (SELECT children.FatherID
, children.MotherID
FROM dbo.Child as children
GROUP BY children.FatherID
, children.MotherID
)
-- CTE Gets uses STRING_AGG to get unique combinations of mothers.
, motherGroups (Mothers)
AS (SELECT STRING_AGG(CONVERT(VARCHAR(MAX), distinctParentCombinations.MotherID), '-') WITHIN GROUP (ORDER BY distinctParentCombinations.MotherID) AS Mothers
FROM distinctParentCombinations
GROUP BY distinctParentCombinations.FatherID
)
-- Remove the COUNT function to see the actual combinations
SELECT COUNT(motherGroups.Mothers) AS UniqueMotherGroups
FROM motherGroups
-- Clean up the example
DROP TABLE IF EXISTS dbo.Child;
DROP TABLE IF EXISTS dbo.Father;
DROP TABLE IF EXISTS dbo.Mother;
You have a great explanation and setup of your "problem case".
Your setup runs great in (for example) tempdb.
You have solved the problem in a nice way, and I don't think you can optimize it much further if you are going to calculate the mother groups every time you run the query.
There is one small mistake though; You must do a COUNT(DISTINCT motherGroups.Mothers) in your final count.
Since you mention milions of rows, I would suggest a slightly different approach.
If you aggregate the mother groups as soon as there is a change in the Child table, your query can run fast every time - even with millions of rows.
The kind of queries you want to run is seldom run only once, so it would be nice if the heavy work is already done.
Usually I prefer not to use triggers, because you get extra logic in a place where it could be hard to find and debug.
But sometimes triggers are nice to have, especially when you are not able to change the source code running on the clients.
So, my solution is to add a new column to the Father table and to create a trigger which (re)generates the mother group each time there is a change in the Child table.
This way, the hard aggregation work for each father is done as soon there is a change, and you don't have to aggregate when you run your query.
Since you already have millions of rows, we also have to update these existing rows.
I have used SQL Server 2019 for this solution.
*** The solution ***
Add 1 or 2 new columns to the Father table.
If you should add 1 or 2, it depends on what your preferences are:
"Do I want to see the aggregated mother groups for debugging purpose, or do I just trust the hashed values?"
Column 1: Hashed value of the aggregated mother group for each Father row.
The hashed value is VARBINARY and is at least 32 bytes, but we will use VARBINARY(1600):
1600 is less than 1700 which is the max nonclustered index size, so we will not have any problems indexing the column.
Since the hash value is in blocks of 32 bytes, a value of 1600 will cover a really, really, really long aggreated mother group.
-- Column 1: Hashed value of the aggregated mother group for each Father row.
alter table Father add MotherHash varbinary(1600)
create index IX_MotherHash on Father(MotherHash)
Column 2: This column is more optional, and depends on your preferences.
The column could be nice to have for debugging purpose if any questions are made about the result.
Which VARCHAR-length you should use depends on your real data.
MAX? Then you have no problems storing the mother groups, but you might have problems indexing it, since 1700 is the max for an unclustered index. But maybe you don't need to index it?
1700? Then you are able to index the column, but depending on your real data, will this cover the biggest mother group?
Why indexing? If you want to list the aggregated mother groups, it could be faster to read the index than the whole table.
As said; this depends on you (and your data). If we have no need to see the aggregated mother groups, then we don't need this column at all.
For this demo/solution we will add the column for debugging purpose, without any indexing.
-- Column 2: This column is more optional, and depends on your preferences.
alter table Father add MotherGroup varchar(MAX)
go
Create a trigger on the Child table.
It will handle all inserts, updates and deletes in the Child table.
create or alter trigger trIUD_Child on Child
after insert, update, delete
as
begin
set nocount on
-- Get all FatherIDs from the Inserted and Deleted table.
-- An ordinary Temp table is created with a clustered index to get SEEK performance later.
-- The table might also have more than 100 rows, where table variables are not recommended.
declare #numRowsInInsertedDeleted int
create table #rowsInInsertedDeleted(rowId int identity(1, 1), FatherID int)
create unique clustered index ix on #rowsInInsertedDeleted(rowId)
insert #rowsInInsertedDeleted(FatherID)
select distinct f.FatherID
from
(
select i.FatherID from inserted i
union all
select i.FatherID from deleted i
) f
select #numRowsInInsertedDeleted = max(rowId) from #rowsInInsertedDeleted
-- We have to loop each of the FatherIDs, since we might have several rows in the Inserted and Deleted tables.
declare #rowId int = 0
while (#rowId < #numRowsInInsertedDeleted)
begin
-- Get the father for the next row.
select #rowId += 1
declare #fatherId int
select #fatherId = r.FatherID
from #rowsInInsertedDeleted r
where r.rowId = #rowId
-- Aggregate the mothers for this father.
declare #motherGroup varchar(max) = ''
select #motherGroup += ',' + cast(c.MotherID as varchar)
from Child c
where c.FatherID = #fatherId
group by c.MotherID
order by c.MotherID
-- Update the father record.
-- Any empty strings are handled automatically, skip the leading ','.
update Father
set MotherGroup = substring(#motherGroup, 2, 2147483647),
MotherHash = HASHBYTES('SHA2_256', #motherGroup)
where FatherID = #fatherId
end
end
go
Updating existing rows
Since you already have millions of rows, we must aggregate the mother groups for these existing rows.
If you don't have the disk space for logging the update of the whole table, maybe you should take your database out of AG and switch to Simple recovery model for this task?
In that case you should also modify the update with a WHERE clause to update only parts of the table, and run the update for each part until the whole table is updated.
Example: update Child set FatherID = FatherID where FatherID between 1 and 1000000
Note: This update statement could block access to the Child table for other users.
-- Aggregate the mother groups for the existing rows.
-- This could takes minutes to complete, depending on the number of rows.
-- NOTE: This update statement could block access to the Child table for other users.
update Child set FatherID = FatherID
That's it!
You should now be able to quickly get the mother groups on existing rows, and also after future changes in the Child table.
-- Voila - now you can get the unique mother groups any time at a fast speed.
select count(distinct MotherHash) from Father
Thank you for posting such a comprehensive setup for the test data. However, I'm not running any CREATE/DROP statements against my DB so I converted those tables into table variables. Using your data, I came up with the following query. Just change the table names back to your dbo. names and you should be able to test in your environment. I basically concatenate every father/mother combo into a text string using FOR XML PATH. Then I count up all the distinct combos. If you find error in my logic, let me know. I'm just tossing this in the ring of possible solutions.
WITH distinctCombos AS (
SELECT DISTINCT
c.FatherID, c.MotherID
FROM #Child as c
) , motherComboCount AS (
SELECT
f.FatherID
, f.[Name]
, STUFF((
SELECT
',' + CAST(dc.MotherID as nvarchar)
FROM distinctCombos as dc
WHERE dc.FatherID = f.FatherID
ORDER BY dc.MotherID ASC
FOR XML PATH('')
),1,1,'') as motherList
FROM #Father as f
)
SELECT
COUNT(DISTINCT motherList) as UniqueMotherGroups
FROM motherComboCount as mcc
To save a bit of compute power, remove the STUFF function as it's not necessary for the comparison... it just makes the list nicer to look at if displaying... and I'm in the habit of using it.
It looks like the main differences between our methods is the use of FOR XML PATH vs STRING_AGG (I'm still on older SQL.) And I use DISTINCT twice instead of GROUP BY. If you have a larger dataset to test against, let me know how the 2 methods compare. I'm trying to think of a completely set-based method but I can't see it at the moment.
Update: Method 2.
Here's an idea I had using recursive CTEs to build the distinct mother combinations. In your example data, there are only 2 mothers per father. So there would be a total of 4 set-based queries performed (first CTE, 2 queries in the recursive CTE and the final SELECT).
WITH uniqueCombo as (
SELECT DISTINCT
c.FatherID
, c.MotherID
, ROW_NUMBER() OVER(PARTITION BY c.FatherID ORDER BY c.MotherID) as row_num
FROM #Child as c
), combos as (
SELECT
uc.FatherID
, uc.MotherID
, CAST(uc.MotherID as nvarchar(max)) as [path]
, row_num
, 0 as hierarchy_num
FROM uniqueCombo as uc
WHERE uc.row_num = 1
UNION ALL
SELECT
uc.FatherID
, uc.MotherID
, co.[path] + ',' + CAST(uc.MotherID as nvarchar(max))
, uc.row_num
, co.hierarchy_num + 1 as heirarchy_num
FROM uniqueCombo as uc
INNER JOIN combos as co
ON co.FatherID = uc.FatherID
--AND co.MotherID <> uc.MotherID
AND co.row_num + 1 = uc.row_num
), rankedCombos as (
SELECT
c.[path]
, ROW_NUMBER() OVER(PARTITION BY c.FatherID ORDER BY c.hierarchy_num DESC) as row_num
FROM combos as c
)
SELECT COUNT(DISTINCT rc.[path]) as UniqueMotherGroups
FROM rankedCombos as rc
WHERE rc.row_num = 1
Update 2:
I had another idea to use a PIVOT to transpose the records so that the FatherID would be in the left-most column with the MotherIDs as the column headers. To make that work with a dynamic list of MotherIDs, you have to use a dynamic PIVOT/dynamic SQL. (FatherID isn't really needed in the PIVOT so it's not included in the PIVOT query. I just had to describe what the goal is.) After the pivot, you can SELECT DISTINCT to get the unique mother combinations. Then the last SELECT is to get the COUNT. This one I ran in SQL Fiddle:
SQL Fiddle
MS SQL Server 2017 Schema Setup:
-- Create family tables.
CREATE TABLE dbo.Father
(
FatherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Father
ADD CONSTRAINT PK_Father
PRIMARY KEY CLUSTERED (FatherID);
ALTER TABLE dbo.Father SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Mother
(
MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Mother
ADD CONSTRAINT PK_Mother
PRIMARY KEY CLUSTERED (MotherID);
ALTER TABLE dbo.Mother SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Child
(
ChildID INT NOT NULL
, FatherID INT NOT NULL
, MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Child
ADD CONSTRAINT PK_Child
PRIMARY KEY CLUSTERED (ChildID);
CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID, MotherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Father
FOREIGN KEY (FatherID)
REFERENCES dbo.Father (FatherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Mother
FOREIGN KEY (MotherID)
REFERENCES dbo.Mother (MotherID);
-- Insert two children with the same parents
INSERT INTO dbo.Father
(
FatherID
, Name
)
VALUES
(1, 'Alex')
, (2, 'Bob')
, (3, 'Charlie')
INSERT INTO dbo.Mother
(
MotherID
, Name
)
VALUES
(1, 'Alice')
, (2, 'Barbara');
INSERT INTO dbo.Child
(
ChildID
, FatherID
, MotherID
, Name
)
VALUES
(1, 1, 1, 'Adam')
, (2, 1, 1, 'Billy')
, (3, 1, 2, 'Celine')
, (4, 2, 2, 'Derek')
, (5, 3, 1, 'Eric');
Query 1:
DECLARE #cols AS nvarchar(MAX)
DECLARE #query AS nvarchar(MAX)
SET #cols = STUFF((
SELECT DISTINCT ',' + QUOTENAME(m.MotherID)
FROM Mother as m
FOR XML PATH(''))
,1,1,'')
SET #query = 'SELECT COUNT(mCount) as UniqueMotherGroups FROM (
SELECT DISTINCT ' + #cols + ', 1 as mCount FROM (
SELECT ' + #cols + '
FROM (
SELECT
c.FatherID
, c.MotherID
, 1 as mID
FROM child as c
) x
PIVOT
(
MAX(mID)
FOR MotherID in (' + #cols + ')
) p
) as m
) as mg'
--SELECT #query
Exec(#query)
Results:
| UniqueMotherGroups |
|--------------------|
| 3 |
UPDATE 3: Here's one other idea... create a results table with a unique constraint and with IGNORE_DUP_KEY=ON. You could use this in a function or stored procedure, or, setup a trigger to put the mother combinations into a unique-combo-holding-table. With IGNORE_DUP_KEY=ON, you can insert every combo and only the unique combos will remain. Then just do a count of all the rows.
--Create a table to hold the results:
CREATE TABLE results (
ChildID int not null
, UniqueCombos nvarchar(50) not null
PRIMARY KEY WITH (IGNORE_DUP_KEY = ON)
);
--Insert all combos into the results table. The unique constraint will cause only unique entries to remain.
INSERT INTO results (ChildID, UniqueCombos)
SELECT DISTINCT
c.ChildID
, (
SELECT ',' + CAST(MotherID as nvarchar(500))
FROM Child as c2
WHERE c2.ChildID = c.ChildID
ORDER BY c2.MotherID
FOR XML PATH('')
) as mother_combos
FROM Child as c
;
--Count up all the rows in the results table. Since these are all unique combinations, it should be fast to sum.
SELECT COUNT(*)
FROM results;
If you accept to define a maximum number of mothers per father (here 7) you may try:
select count(*) as UniqueMotherGroups from (
select distinct m1, m2, m3, m4, m5, m6, m7 from (
select FatherID, row_number() over(partition by FatherID order by motherid) as rn, motherid
from (
select distinct FatherID, MotherID
from t_Child
)
)
pivot (
max(motherid) for rn in (1 as m1,2 as m2,3 as m3,4 as m4,5 as m5,6 as m6,7 as m7)
)
)
;
UNIQUEMOTHERGROUPS
------------------
3
Here is one idea. Instead of using precise STRING_AGG you can calculate a hash / checksum of the group. You don't need to know the exact composition of the group, you just need to distinguish between different groups. Calculating of the hash may be faster than concatenating strings.
SQL Server has a function CHECKSUM_AGG
You can write your own hashing function with CLR.
Sample data
CREATE TABLE #Child
(
ChildID INT NOT NULL IDENTITY PRIMARY KEY
,FatherID INT NOT NULL
,MotherID INT NOT NULL
,Name VARCHAR(50) NOT NULL
);
INSERT INTO #Child
(
FatherID
,MotherID
,Name
)
VALUES
(1, 1, 'Adam')
,(1, 1, 'Billy')
,(1, 2, 'Celine')
,(2, 2, 'Derek')
,(3, 1, 'Eric')
,(4, 1, 'A')
,(4, 1, 'B')
,(4, 2, 'C')
,(4, 2, 'D')
,(4, 2, 'E')
,(5, 2, 'F')
,(6, 2, 'G')
;
Query
WITH
distinctParentCombinations
AS
(
SELECT
FatherID
,MotherID
FROM #Child
GROUP BY
FatherID
,MotherID
)
,motherGroups
AS
(
SELECT
FatherID
,CHECKSUM_AGG(MotherID) AS MotherGroup
FROM distinctParentCombinations
GROUP BY
FatherID
)
SELECT COUNT(DISTINCT MotherGroup) AS UniqueMotherGroups
FROM motherGroups
;
Result
+--------------------+
| UniqueMotherGroups |
+--------------------+
| 3 |
+--------------------+
You need to compare performance of all methods on your actual data.
Obviously, with CHECKSUM_AGG it is possible that some of the groups will be missed. There is a chance that two different groups will generate the same checksum.
You know better if this is acceptable.
General way to speed up calculations is to have some of the results already pre-calculated. In your case, for the first part you can create indexed view as follows:
CREATE OR ALTER VIEW vw_distinctParentCombinations WITH SCHEMABINDING AS
SELECT children.FatherID
, children.MotherID
,COUNT_BIG(*) AS [wifes_count]
FROM dbo.Child as children
GROUP BY children.FatherID
, children.MotherID
GO
CREATE UNIQUE CLUSTERED INDEX IX_vw_distinctParentCombinations ON vw_distinctParentCombinations
(
FatherID,MotherID
);
Then in your initial query, you can avoid the first CTE:
-- CTE Gets distinct combinations of parents
WITH motherGroups (Mothers)
AS
(SELECT STRING_AGG(CONVERT(VARCHAR(MAX), distinctParentCombinations.MotherID), '-') WITHIN GROUP (ORDER BY distinctParentCombinations.MotherID) AS Mothers
FROM vw_distinctParentCombinations distinctParentCombinations WITH(NOEXPAND)
GROUP BY distinctParentCombinations.FatherID
)
-- Remove the COUNT function to see the actual combinations
SELECT COUNT(motherGroups.Mothers) AS UniqueMotherGroups
FROM motherGroups;
This will avoid the initial read of the large table and depending the distinct combinations of the pairs (father - mother) it can reduce the view size significantly.
Unfortunately, there are a lot of limitations in order to create an indexed view, and you are not able to create such for the second CTE.
If we change our mind and look this issue in different view, simply we can get the count of mothers with this query:
SELECT Count(distinct ConcatMothers) UniqueMothersCount from(
SELECT FatherID, concat(FatherID,'-',SUM(MotherID)) ConcatMothers
FROM dbo.Child
GROUP BY FatherID) t;
Or even you can use Dense_Rank() like this:
SELECT Max(RankMothers) UniqueMothersCount from(
SELECT FatherID, DENSE_RANK() over (order by concat(FatherID,'-',SUM(MotherID))) RankMothers
FROM dbo.Child
GROUP BY FatherID) t;
For the performance it is hard to measure because dataset is small but since we have one column in the group by and the motherId is in the select maybe we can change index as below:
CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID) Include(MotherID);
but you need to check it on your dataset.

Unpivot multiple columns in Snowflake

I have a table that looks as follows:
I need to unpivot the Rating and the Comments as follows:
What is the best way to do this in Snowflake?
Note: there are some cells in the comment columns that are NULL
Adding details:
create or replace table reviews(name varchar(50), acting_rating int, acting_comments text, comedy_rating int, comedy_comments text);
insert into reviews values
('abc', 4, NULL, 1, 'NO'),
('xyz', 3, 'some', 1, 'haha'),
('lmn', 1, 'what', 4, NULL);
select * from reviews;
select name, skill, skill_rating, comments
from reviews
unpivot(skill_rating for skill in (acting_rating, comedy_rating))
unpivot(comments for skill_comments in (acting_comments,comedy_comments))
--Following where clause is added to filter the irrelevant comments due to multiple unpivots
where substr(skill,1,position('_',skill)-1) = substr(skill_comments,1,position('_',skill_comments)-1)
order by name;
will produce produce the desired results, but with data that has NULLs, the unpivoted rows that have NULLs go missing from the output:
NAME SKILL SKILL_RATING COMMENTS
abc COMEDY_RATING 1 NO
lmn ACTING_RATING 1 what
xyz ACTING_RATING 3 some
xyz COMEDY_RATING 1 haha
If all you need to solve is for the table specified in the question - you can do it manually with a set of UNION ALL:
select NAME
, 'ACTING_RATING' as SKILL, ACTING_RATING as SKILL_RATING, ACTING_COMMENTS as SKILL_COMMENTS
from DATA
union all
select NAME
, 'COMEDY_RATING', COMEDY_RATING, COMEDY_COMMENTS
from DATA
union all
select NAME
, 'MUSICAL_PERFORMANCE_RATING', MUSICAL_PERFORMANCE_RATING, MUSICAL_PERFORMANCE_COMMENTS
from DATA
This is a basic script and should give the desired output
create or replace table reviews(name varchar(50), acting_rating int, acting_comments text, comedy_rating int, comedy_comments text);
insert into reviews values
('abc', 4, 'something', 1, 'NO'),
('xyz', 3, 'some', 1, 'haha'),
('lmn', 1, 'what', 4, 'hahaha');
select * from reviews;
select name, skill, skill_rating, comments
from reviews
unpivot(skill_rating for skill in (acting_rating, comedy_rating))
unpivot(comments for skill_comments in (acting_comments,comedy_comments))
--Following where clause is added to filter the irrelevant comments due to multiple unpivots
where substr(skill,1,position('_',skill)-1) = substr(skill_comments,1,position('_',skill_comments)-1)
order by name;
If the goal is to store the unpivoted result as a table then INSERT ALL could be used to unpivot mutliple columns at once:
Setup:
create or replace table reviews(
name varchar(50), acting_rating int,
acting_comments text, comedy_rating int, comedy_comments text);
insert into reviews values
('abc', 4, NULL, 1, 'NO'),
('xyz', 3, 'some', 1, 'haha'),
('lmn', 1, 'what', 4, NULL);
select * from reviews;
Query:
CREATE OR REPLACE TABLE reviews_transposed(
name VARCHAR(50)
,skill TEXT
,skill_rating INT
,skill_comments TEXT
);
INSERT ALL
INTO reviews_transposed(name, skill, skill_rating, skill_comments)
VALUES (name, 'ACTING_RATING', acting_rating, acting_comments)
INTO reviews_transposed(name, skill, skill_rating, skill_comments)
VALUES (name, 'COMEDY_RATING', comedy_rating, comedy_comments)
SELECT *
FROM reviews;
SELECT *
FROM reviews_transposed;
Before:
After:
This approach has one significant advantage over UNION ALL approach proposed by Felippe, when saving into table (the number of table scans and thus partition read is growing for each UNION ALL wheareas INSERT ALL scans source table only once.
INSERT INTO reviews_transposed
select NAME
, 'ACTING_RATING' as SKILL, ACTING_RATING as SKILL_RATING, ACTING_COMMENTS as SKILL_COMMENTS
from reviews
union all
select NAME
, 'COMEDY_RATING', COMEDY_RATING, COMEDY_COMMENTS
from reviews;
vs INSERT ALL
Back in TSQL days i'd just use a CROSS APPLY. The nearest equivalent in snowflake would be something like:
create or replace TEMPORARY table reviews(name varchar(50), acting_rating int, acting_comments text, comedy_rating int, comedy_comments text);
insert into reviews values
('abc', 4, NULL, 1, 'NO'),
('xyz', 3, 'some', 1, 'haha'),
('lmn', 1, 'what', 4, NULL);
SELECT R.NAME
,P.VALUE:SKILL::VARCHAR(100) AS SKILL
,P.VALUE:RATING::NUMBER AS RATING
,P.VALUE:COMMENTS::VARCHAR(1000) AS COMMENTS
FROM reviews R
,TABLE(FLATTEN(INPUT => ARRAY_CONSTRUCT(
OBJECT_CONSTRUCT('SKILL','COMEDY','RATING',R.COMEDY_RATING,'COMMENTS',R.COMEDY_COMMENTS),
OBJECT_CONSTRUCT('SKILL','ACTING','RATING',R.ACTING_RATING,'COMMENTS',R.ACTING_COMMENTS)
)
)) AS P;
This only hits the source table once and preserves NULLs.
ResultSet
I've had same problem,
Here is my solution for unpivoting by two categories AND keeping nulls:
First you replace NULL's with some string, for example: 'NULL'
Then brake the two unpivots into two separate cte's and create common category column to join them again later, 'skill' in your case.
Lastly, join the two cte's by name and skill category, replace the 'NULL' string with actual NULL
create or replace table reviews(name varchar(50), acting_rating int, acting_comments text, comedy_rating int, comedy_comments text);
insert into reviews values
('abc', 4, 'something', 1, 'NO'),
('xyz', 3, 'some', 1, 'haha'),
('lmn', 1, 'what', 4, 'hahaha');
WITH base AS (SELECT name
, acting_rating
, IFNULL(acting_comments, 'NULL') AS acting_comments
, comedy_rating
, IFNULL(comedy_comments, 'NULL') AS comedy_comments
FROM reviews
)
, skill_rating AS (SELECT name
, REPLACE(skill, '_RATING', '') AS skill
, skill_rating
FROM base
UNPIVOT (skill_rating FOR skill IN (acting_rating, comedy_rating))
)
, comments AS (SELECT name
, REPLACE(skill_comments, '_COMMENTS', '') AS skill
, comments
FROM base
UNPIVOT (comments FOR skill_comments IN (acting_comments,comedy_comments))
)
SELECT s.name
, s.skill
, s.skill_rating
, NULLIF(c.comments, 'NULL') AS comments
FROM skill_rating AS s
JOIN comments AS c
ON s.name = c.name
AND s.skill = c.skill
ORDER BY name;
The result:
name skill skill_rating comments
abc ACTING 4 <null>
abc COMEDY 1 NO
lmn ACTING 1 what
lmn COMEDY 4 <null>
xyz ACTING 3 some
xyz COMEDY 1 haha

SQL: Stored procedure

I am implementing a library management system in SQL. I have the following table structure and some values inserted in them:
create table books
(
IdBook number(5),
NameBook varchar2(35),
primary key(IdBook)
);
create table users
(
IdUsers number(5),
NameUser varchar2(20),
primary key(IdUsers)
);
create table borrowed
(
IdBorrowed number(5),
IdUsers number(5),
IdBook number(5),
DueDate date,
DateReturned date,
constraint fk_borrowed foreign key(IdUsers) references users(IdUsers),
constraint fk_borrowed2 foreign key(IdBook) references books(IdBook)
);
insert into books values(0,'FairyTale');
insert into books values(1,'Crime and Punishment');
insert into books values(2,'Anna Karenina');
insert into books values(3,'Norwegian Wood');
insert into users values(01,'Robb Dora');
insert into users values(02,'Pop Alina');
insert into users values(03,'Grozavescu Teodor');
insert into users values(04,'Popa Alin');
insert into borrowed values(10,02,3,'22-Jan-2017',null);
insert into borrowed values(11,01,1,'25-Jan-2017','19-Dec-2016');
insert into borrowed values(12,01,3,'22-Jan-2017',null);
insert into borrowed values(13,04,2,'22-Jan-2017','13-Dec-2016');
What I want now is that my db to allow "borrowing" books for the users(i.e insert into the borrowed table) that have no unreturned books(i.e date returned is not null) and if they have unreturned books I want to abandon the whole process. I thought to implement this in the following way:
create or replace procedure borrowBook(IdBorrowed in number,IdUsers number,IdBook number,DueDate date,DateReturned date) as begin
if exists (SELECT u.IdUsers, u.NameUser, b.DateReturned
FROM users u, borrowed b
WHERE u.IDUSERS = b.IdUsers and DateReturned is not null),
insert into borrowed values(IdBorrowed,IdUsers,IdBook,DueDate,DateReturned);
end borrowBook;
The above procedure does not check if the parameter I pass to this function is the same as the one in my select and I do not know how to do this and correctly insert a value in my table.
Any help would be much appreciated. Thank in advance!
You should not name your parameters the same as columns also used inside the procedure.
You can also simplify your procedure to a single INSERT statement, no IF required:
create or replace procedure borrowBook(p_idborrowed in number, p_idusers number, p_idbook number, p_duedate date, p_datereturned date)
as
begin
insert into borrowed (idborrowed, idusers, idbook, duedate, datereturned)
select p_idborrowed, p_idusers, p_idbook, p_duedate, p_datereturned
from dual
where not exists (select *
from users u
join borrowed b on u.idusers = b.idusers
and b.datereturned is not null);
end borrowBook;
It's also good coding style to explicitly list the columns for an INSERT statement. And you should get used to the explicit JOIN operator instead of using implicit joins in the where clause.
What about this one:
create or replace procedure borrowBook( p_IdBorrowed in number ,
p_IdUsers number ,
p_IdBook number ,
p_DueDate date ,
p_DateReturned date )
as
begin
if (SELECT COUNT(*)
FROM borrowed
WHERE IDUSERS = p_IdUsers
AND DateReturned IS NULL) = 0 THEN
insert into borrowed values (p_IdBorrowed ,
p_IdUsers ,
p_IdBook ,
p_DueDate ,
p_DateReturned );
end if ;
end borrowBook;
You would seem to want something like this:
create or replace procedure borrowBook (
in_IdBorrowed in number,
in_IdUsers number,
in_IdBook number,
in_DueDate date,
in_DateReturned date
) as
v_flag number;
begin
select (case when exists (select 1
from borrowed b
where b.IdUsers = in_IdUsers and b.DateReturned is not null
)
then 1 else 0
end)
into v_flag
from dual;
if (flag = 0) then
insert into borrowed
values(in_IdBorrowed, in_IdUsers, in_IdBook, in_DueDate, v_DateReturned);
end if
end -- borrowBook;

SQL Server constraint to check other columns

Product: SQL Server
Is it possible to write a constraint, that checks the values of other columns? In my specific case I will give you an example:
Let's imagine I have a table with 5 Columns:
Name | Hobby1 | Hobby2 | Hobby3 | Hobby4
Lets say there are the following values in it:
John Doe| fishing | reading| swimming| jogging
What I try to reach is the following:
If someone trys to Insert : John Doe, fishing,reading
It should be blocked, cause I don't want the same combination in the first 3 Columns.
Can I realise that with a constraint or do I need a Primary key combination for the first 3 columns?
Thanks for your replies.
Add unique constraint to your table for first three columns.
ALTER TABLE YourTableName
ADD CONSTRAINT UK_Name_Hobby1_Hobby2 UNIQUE (Name, Hobby1,Hobby2);
Well, you could do as #jarlh says (in comments) and ensure the hobby columns are ordered and this will satisfy your requirements:
CREATE TABLE Hobbies
(
Name VARCHAR(35) NOT NULL,
Hobby1 VARCHAR(35) NOT NULL,
Hobby2 VARCHAR(35) NOT NULL,
Hobby3 VARCHAR(35),
Hobby4 VARCHAR(35),
CHECK ( Hobby1 < Hobby2 ),
CHECK ( Hobby2 < Hobby3 ),
CHECK ( Hobby3 < Hobby4 ),
UNIQUE ( Name, Hobby1, Hobby2 ),
);
...However, this would allow the following which feels like it should be invalid data:
INSERT INTO Hobbies VALUES ( 'John Doe', 'fishing', 'jogging', 'reading', 'swimming' );
INSERT INTO Hobbies VALUES ( 'John Doe', 'jogging', 'reading', 'swimming', NULL );

Insert values of one table in a database to another table in another database

I would like to take some data from a table from DB1 and insert some of that data to a table in DB2.
How would one proceed to do this?
This is what I've got so far:
CREATE VIEW old_study AS
SELECT *
FROM dblink('dbname=mydb', 'select name,begins,ends from study')
AS t1(name varchar(50), register_start date, register_end date);
/*old_study now contains the data I wanna transfer*/
INSERT INTO studies VALUES (nextval('studiesSequence'),name, '',3, 0, register_start, register_end)
SELECT name, register_start, register_end from old_study;
This is how my table in DB2 looks:
CREATE TABLE studies(
id int8 PRIMARY KEY NOT NULL,
name_string VARCHAR(255) NOT NULL,
description VARCHAR(255),
field int8 REFERENCES options_table(id) NOT NULL,
is_active INTEGER NOT NULL,
register_start DATE NOT NULL,
register_end DATE NOT NULL
);
You should include the column names in both the insert and select:
insert into vip_employees(name, age, occupation)
select name, age, occupation
from employees;
However, your data structure is suspect. Either you should use a flag in employees to identify the "VIP employees". Or you should have a primary key in employees and use this primary key in vip_employees to refer to employees. Copying over the data fields is rarely the right thing to do, especially for columns such as age which are going to change over time. Speaking of that, you normally derive age from the date of birth, rather than storing it directly in a table.
INSERT INTO studies
(
id
,name_string
,description
,field
,is_active
,register_start
,register_end
)
SELECT nextval('studiesSequence')
,NAME
,''
,3
,0
,register_start
,register_end
FROM dblink('dbname=mydb', 'select name,begins,ends from study')
AS t1(NAME VARCHAR(50), register_start DATE, register_end DATE);
You can directly insert values that retured by dblink()(that means no need to create a view)
Loop and cursor are weapons of last resort. Try to avoid them. You probably want INSERT INTO ... SELECT:
INSERT INTO x(x, y, z)
SELECT x, y, z
FROM t;
SqlFiddleDemo
EDIT:
INSERT INTO vip_employees(name, age, occupation) -- your column list may vary
SELECT name, age, occupation
FROM employees;
Your syntax is wrong. You cannot have both, a values clause for constant values and a select clause for a query in your INSERT statement.
You'd have to select constant values in your query:
insert into studies
(
id,
name_string,
description,
field,
is_active,
register_start,
register_end
)
select
studiesSequence.nextval,
name,
'Test',
null,
0,
register_start,
register_end
from old_study;