Postgres union query returns strange result - sql

I have the following tables:
CREATE TABLE geodat(
vessel UUID NOT NULL,
trip UUID NOT NULL,
geom geometry(LineString,4326),
PRIMARY KEY(vessel,trip)
);
CREATE TABLE areas(
gid SERIAL NOT NULL,
/* --other columns of little interest-- */
geom geometry(MultiPolygon,3035),
PRIMARY KEY(gid)
);
The following query is supposed to return the area that has been crossed the least, as well as how many times it was crossed and by which vessels.
SELECT vessel,MIN(cnt) as min_crossing,gid
FROM (
SELECT vessel,COUNT(*) as cnt, gid
FROM (
SELECT vessel, null as geo1, geom as geo2, null as gid
FROM geodat
UNION ALL
SELECT null,geom,null,gid FROM areas ) as P
WHERE ST_Crosses(geo1,geo2) AND geo1 IS NOT NULL AND geo2 IS NOT NULL
GROUP BY gid,vessel) as P1
GROUP BY gid,vessel
Theoretically, this query should solve the question above. The problem is that I am getting (0 rows) as an answer, although I have been assured as to the opposite. I discovered it has something to do with the null values the UNION produced, but I don't have a clue how to fix this.
Any ideas?
NOTE: The two tables have 31822 rows and 27308 rows respectively which makes a JOIN impractical.

You have the condition
WHERE ST_Crosses(geo1,geo2) AND geo1 IS NOT NULL AND geo2 IS NOT NULL
However, in the union all you are explicitly setting geo1 to null and geo2 to null. Hence the query is returning 0 rows.
You can change the where condition above to or, which would return rows.
WHERE ST_Crosses(geo1,geo2) AND geo1 IS NOT NULL OR geo2 IS NOT NULL

Related

Redshift create list and search different table with it

I think there a few ways to tackle this, but I'm not sure how to do any of them.
I have two tables, the first has ID's and Numbers. The ID's and numbers can potentially be listed more than once, so I create a result table that lists the unique numbers grouped by ID.
My second table has rows (100 million) with the ID and Numbers again. I need to search that table for any ID that has a Number not in the list of Numbers from the result table.
Can redshift do a query based on if the ID matches and the Number exists in the list from the table? Can this all be done in memory/one statement?
DROP TABLE IF EXISTS `myTable`;
CREATE TABLE `myTable` (
`id` mediumint(8) unsigned NOT NULL auto_increment,
`ID` varchar(255),
`Numbers` mediumint default NULL,
PRIMARY KEY (`id`)
) AUTO_INCREMENT=1;
INSERT INTO `myTable` (`ID`,`Numbers`)
VALUES
("CRQ44MPX1SZ",1890),
("UHO21QQY3TW",4370),
("JTQ62CBP6ER",1825),
("RFD95MLC2MI",5014),
("URZ04HGG2YQ",2859),
("CRQ44MPX1SZ",1891),
("UHO21QQY3TW",4371),
("JTQ62CBP6ER",1826),
("RFD95MLC2MI",5015),
("URZ04HGG2YQ",2860),
("CRQ44MPX1SZ",1892),
("UHO21QQY3TW",4372),
("JTQ62CBP6ER",1827),
("RFD95MLC2MI",5016),
("URZ04HGG2YQ",2861);
SELECT ID, listagg(distinct Numbers,',') as Number_List, count(Numbers) as Numbers_Count
FROM myTable
GROUP BY ID
AS result
DROP TABLE IF EXISTS `myTable2`;
CREATE TABLE `myTable2` (
`id` mediumint(8) unsigned NOT NULL auto_increment,
`ID` varchar(255),
`Numbers` mediumint default NULL,
PRIMARY KEY (`id`)
) AUTO_INCREMENT=1;
INSERT INTO `myTable2` (`ID`,`Numbers`)
VALUES
("CRQ44MPX1SZ",1870),
("UHO21QQY3TW",4350),
("JTQ62CBP6ER",1825),
("RFD95MLC2MI",5014),
("URZ04HGG2YQ",2859),
("CRQ44MPX1SZ",1891),
("UHO21QQY3TW",4371),
("JTQ62CBP6ER",1826),
("RFD95MLC2MI",5015),
("URZ04HGG2YQ",2860),
("CRQ44MPX1SZ",1882),
("UHO21QQY3TW",4372),
("JTQ62CBP6ER",1827),
("RFD95MLC2MI",5016),
("URZ04HGG2YQ",2861);
Pseudo Code
Select ID, listagg(distinct Numbers) as Violation
Where Numbers IN NOT IN result.Numbers_List
or possibly: WHERE Numbers NOT LIKE '%' || result.Numbers_List|| '%'
Desired Output
(“CRQ44MPX1SZ”, ”1870,1882”)
(“UHO21QQY3TW”, ”4350”)
EDIT
Going the JOIN route, I am not getting the right results...but I'm pretty sure my WHERE implementation is wrong.
SELECT mytable1.ID, listagg(distinct mytable2.Numbers, ',') as unauth_list, count(mytable2.Numbers) as unauth_count
FROM mytable1
LEFT JOIN mytable2 on mytable1.id = mytable2.id
WHERE (mytable1.id = mytable2.id)
AND (mytable1.Numbers <> mytable2.Numbers)
GROUP BY mytable1.id
Expected output:
(“CRQ44MPX1SZ”, ”1870,1882”, 2)
(“UHO21QQY3TW”, ”4350”, 1)
Just left join the two tables on ID and numbers and check for (where clause) to see if the match wasn't found. Shouldn't be a need for listagg() and complex comparing. Or did I miss part of the question?

How can I stop my Postgres recusive CTE from indefinitely looping?

Background
I'm running Postgres 11 on CentOS 7.
I recently learned the basics of recursive CTEs in Postgres thanks to S-Man's answer to my recent question.
The problem
While working on a closely related issue (counting parts sold within bundles and assemblies) and using this recursive CTE, I ran into a problem where the query looped indefinitely and never completed.
I tracked this down to the presence of non-spurious 'self-referential' entries in the relator table, i.e. rows with the same value for parent_name and child_name.
I know that these are the source of the problem because when I recreated the situation with test tables and data, the undesired looping behavior occurred when these rows were present, and disappeared when these rows were absent or when UNION (which excludes duplicate returned rows) was used in the CTE rather than UNION ALL .
I think the data model itself probably needs adjusting so that these 'self-referential' rows aren't necessary, but for now, what I need to do is get this query to return the desired data on completion and stop looping.
How can I achieve this result? All guidance much appreciated!
Tables and test data
CREATE TABLE the_schema.names_categories (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
thing_name TEXT NOT NULL,
thing_category TEXT NOT NULL
);
CREATE TABLE the_schema.relator (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
parent_name TEXT NOT NULL,
child_name TEXT NOT NULL,
child_quantity INTEGER NOT NULL
);
/* NOTE: listing_name below is like an alias of a relator.parent_name as it appears in a catalog,
required to know because it is these listing_names that are reflected by sales.sold_name */
CREATE TABLE the_schema.catalog_listings (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
listing_name TEXT NOT NULL,
parent_name TEXT NOT NULL
);
CREATE TABLE the_schema.sales (
id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMPTZ DEFAULT now(),
sold_name TEXT NOT NULL,
sold_quantity INTEGER NOT NULL
);
CREATE VIEW the_schema.relationships_with_child_category AS (
SELECT
c.listing_name,
r.parent_name,
r.child_name,
r.child_quantity,
n.thing_category AS child_category
FROM
the_schema.catalog_listings c
INNER JOIN
the_schema.relator r
ON c.parent_name = r.parent_name
INNER JOIN
the_schema.names_categories n
ON r.child_name = n.thing_name
);
INSERT INTO the_schema.names_categories (thing_name, thing_category)
VALUES ('parent1', 'bundle'), ('child1', 'assembly'), ('child2', 'assembly'),('subChild1', 'component'),
('subChild2', 'component'), ('subChild3', 'component');
INSERT INTO the_schema.catalog_listings (listing_name, parent_name)
VALUES ('listing1', 'parent1'), ('parent1', 'child1'), ('parent1','child2'), ('child1', 'child1'), ('child2', 'child2');
INSERT INTO the_schema.catalog_listings (listing_name, parent_name)
VALUES ('parent1', 'child1'), ('parent1','child2');
/* note the two 'self-referential' entries */
INSERT INTO the_schema.relator (parent_name, child_name, child_quantity)
VALUES ('parent1', 'child1', 1),('child1', 'subChild1', 1), ('child1', 'subChild2', 1)
('parent1', 'child2', 1),('child2', 'subChild1', 1), ('child2', 'subChild3', 1), ('child1', 'child1', 1), ('child2', 'child2', 1);
INSERT INTO the_schema.sales (sold_name, sold_quantity)
VALUES ('parent1', 1), ('parent1', 2), ('listing1', 1);
The present query, loops indefinitely with the required UNION ALL
WITH RECURSIVE cte AS (
SELECT
s.sold_name,
s.sold_quantity,
r.child_name,
r.child_quantity,
r.child_category as category
FROM
the_schema.sales s
JOIN the_schema.relationships_with_child_category r
ON s.sold_name = r.listing_name
UNION ALL
SELECT
cte.sold_name,
cte.sold_quantity,
r.child_name,
r.child_quantity,
r.child_category
FROM cte
JOIN the_schema.relationships_with_child_category r
ON cte.child_name = r.parent_name
)
SELECT
child_name,
SUM(sold_quantity * child_quantity)
FROM cte
WHERE category = 'component'
GROUP BY child_name
;
In catalog_listings table listing_name and parent_name is same for child1 and child2
In relator table parent_name and child_name is also same for child1 and child2
These rows are creating cycling recursion.
Just remove those two rows from both the tables:
delete from catalog_listings where id in (4,5)
delete from relator where id in (7,8)
Then your desired output will be as below:
child_name
sum
subChild2
8
subChild3
8
subChild1
16
Is this the result you are looking for?
If you can't delete the rows you can use below add parent_name<>child_name condition to avoid those rows:
WITH RECURSIVE cte AS (
SELECT
s.sold_name,
s.sold_quantity,
r.child_name,
r.child_quantity,
r.child_category as category
FROM
the_schema.sales s
JOIN the_schema.relationships_with_child_category r
ON s.sold_name = r.listing_name and r.parent_name <>r.child_name
UNION ALL
SELECT
cte.sold_name,
cte.sold_quantity,
r.child_name,
r.child_quantity,
r.child_category
FROM cte
JOIN the_schema.relationships_with_child_category r
ON cte.child_name = r.parent_name and r.parent_name <>r.child_name
)
SELECT
child_name,
SUM(sold_quantity * child_quantity)
FROM cte
WHERE category = 'component'
GROUP BY child_name ;
You may be able to avoid infinite recursion simply by using UNION instead of UNION ALL.
The documentation describes the implementation:
Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.
So long as the working table is not empty, repeat these steps:
Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.
Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.
"Getting rid of the duplicates" should cause the intermediate table to be empty at some point, which ends the iteration.

How to join two tables together and return all rows from both tables, and to merge some of their columns into a single column

I'm working with SQL Server 2012 and wish to query the following:
I've got 2 tables with mostly different columns. (1 table has 10 columns the other has 6 columns).
however they both contains a column with ID number and another column of category_name.
The ID numbers may be overlap between the tables (e.g. 1 table may have 200 distinct IDs and the other 900 but only 120 of the IDs are in both).
The Category name are different and unique for each table.
Now I wish to have a single table that will include all the rows of both tables, with a single ID column and a single Category_name column (total of 14 columns).
So in case the same ID has 3 records in table 1 and another 5 records in table 2 I wish to have all 8 records (8 rows)
The complex thing here I believe is to have a single "Category_name" column.
I tried the following but when there is no null in both of the tables I'm getting only one record instead of both:
SELECT isnull(t1.id, t2.id) AS [id]
,isnull(t1.[category], t2.[category_name]) AS [category name]
FROM t1
FULL JOIN t2
ON t1.id = t2.id;
Any suggestions on the correct way to have it done?
Make your FULL JOIN ON 1=0
This will prevent rows from combining and ensure that you always get 1 copy of each row from each table.
Further explanation:
A FULL JOIN gets rows from both tables, whether they have a match or not, but when they do match, it combines them on one row.
You wanted a full join where you never combine the rows, because you wanted every row in both tables to appear one time, no matter what. 1 can never equal 0, so doing a FULL JOIN on 1=0 will give you a full join where none of the rows match each other.
And of course you're already doing the ISNULL to make sure the ID and Name columns always have a value.
SELECT ID, Category_name, (then the other 8 columns), NULL, NULL, NULL, NULL
FROM t1
UNION ALL
SELECT ID, Category_name, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, (then the other 4 columns)
FROM t2
This demonstrates how you can use a UNION ALL to combine the row sets from two tables, TableA and TableB, and insert the set into TableC.
Create two source tables with some data:
CREATE TABLE dbo.TableA
(
id int NOT NULL,
category_name nvarchar(50) NOT NULL,
other_a nvarchar(20) NOT NULL
);
CREATE TABLE dbo.TableB
(
id int NOT NULL,
category_name nvarchar(50) NOT NULL,
other_b nvarchar(20) NOT NULL
);
INSERT INTO dbo.TableA (id, category_name, other_a)
VALUES (1, N'Alpha', N'ppp'),
(2, N'Bravo', N'qqq'),
(3, N'Charlie', N'rrr');
INSERT INTO dbo.TableB (id, category_name, other_b)
VALUES (4, N'Delta', N'sss'),
(5, N'Echo', N'ttt'),
(6, N'Foxtrot', N'uuu');
Create TableC to receive the result set. Note that columns other_a and other_b allow null values.
CREATE TABLE dbo.TableC
(
id int NOT NULL,
category_name nvarchar(50) NOT NULL,
other_a nvarchar(20) NULL,
other_b nvarchar(20) NULL
);
Insert the combined set of rows into TableC:
INSERT INTO dbo.TableC (id, category_name, other_a, other_b)
SELECT id, category_name, other_a, NULL AS 'other_b'
FROM dbo.TableA
UNION ALL
SELECT id, category_name, NULL, other_b
FROM dbo.TableB;
Display the results:
SELECT *
FROM dbo.TableC;

A query to find if any two fields in a row are equal?

I have to maintain a scary legacy database that is very poorly designed. All the tables have more than 100 columns - one has 650. The database is very denormalized and I have found that often the same data is expressed in several columns in the same row.
For instance, here is a sample of columns for one of the tables:
[MEMBERADDRESS] [varchar](331) NULL,
[DISPLAYADDRESS] [varchar](max) NULL,
[MEMBERINLINEADDRESS] [varchar](max) NULL,
[DISPLAYINLINEADDRESS] [varchar](250) NULL,
[__HISTISDN] [varchar](25) NULL,
[HISTISDN] [varchar](25) NULL,
[MYDIRECTISDN] [varchar](25) NULL,
[MYISDN] [varchar](25) NULL,
[__HISTALT_PHONE] [varchar](25) NULL,
[HISTALT_PHONE] [varchar](25) NULL,
It turns out that MEMBERADDRESS and DISPLAYADDRESS have the same value for all rows in the table. The same is true for the other clusters of fields I have shown here.
It will be very difficult and time consuming to identify all cases like this manually. Is it possible to create a query that would identify if two fields have the same value in every row in a table?
If not, are there any existing tools that will help me identify these sorts of problems?
There are two approaches I see to simplify this query:
Write a script that generates your queries - feed your script the name of the table and the suspected columns, and let it produce a query that checks each pair of columns for equality. This is the fastest approach to implement in a one-of situation like yours.
Write a query that "normalizes" your data, and search against it - self-join the query to itself, then filter out the duplicates.
Here is a quick illustration of the second approach:
SELECT id, name, val FROM (
SELECT id, MEMBERADDRESS as val,'MEMBERADDRESS' as name FROM MyTable
UNION ALL
SELECT id, DISPLAYADDRESS as val,'DISPLAYADDRESS' as name FROM MyTable
UNION ALL
SELECT id, MEMBERINLINEADDRESS as val,'MEMBERINLINEADDRESS' as name FROM MyTable
UNION ALL
...
) first
JOIN (
SELECT id, MEMBERADDRESS as val,'MEMBERADDRESS' as name FROM MyTable
UNION ALL
SELECT id, DISPLAYADDRESS as val,'DISPLAYADDRESS' as name FROM MyTable
UNION ALL
SELECT id, MEMBERINLINEADDRESS as val,'MEMBERINLINEADDRESS' as name FROM MyTable
UNION ALL
...
) second ON first.id=second.id AND first.value=second.value
There is a lot of manual work for 100 columns (at least it does not grow as N^2, as in the first approach, but it is still a lot of manual typing). You may be better off generating the selects connected with UNION ALL using a small script.
The following approach uses unpivot to create triples. It makes some assumptions: values are not null; each row has an id; and columns have compatible types.
select t.which, t2.which
from (select id, which, value
from MEMBERADDRESS
unpivot (value for which in (<list of columns here>)) up
) t full outer join
(select id, which, value
from MEMBERADDRESS
unpivot (value for which in (<list of columns here>)) up
) t2
on t.id = t2.id and t.which <> t2.which
group by t.which, t2.which
having sum(case when t.value = t2.value then 1 else 0 end) = count(*)
It works by creating a new table with three columns: id, which column, and the value in the column. It then does a self join on id (to keep comparisons within one row) and value (to get matching values). This self-join should always match, because the columns are the same in the two halves of the query.
The having then counts the number of values that are the same on both sides for a given pair of columns. When all these are the same, then the match is successful.
You can also leave out the having clause and use something like:
select t.which, t2.which, sum(case when t.value = t2.value then 1 else 0 end) as Nummatchs,
count(*) as NumRows
To get more complete information.

SQL Server NULLABLE column vs SQL COUNT() function

Could someone help me understand something? When I can, I usually avoid (*) in an SQL statement. Well, today was payback. Here is a scenario:
CREATE TABLE Tbl (Id INT IDENTITY(1, 1) PRIMARY KEY, Name NVARCHAR(16))
INSERT INTO Tbl VALUES (N'John')
INSERT INTO Tbl VALUES (N'Brett')
INSERT INTO Tbl VALUES (NULL)
I could count the number of records where Name is NULL as follows:
SELECT COUNT(*) FROM Tbl WHERE Name IS NULL
While avoiding the (*), I discovered that the following two statements give me two different results:
SELECT COUNT(Id) FROM Tbl WHERE Name IS NULL
SELECT COUNT(Name) FROM Tbl WHERE Name IS NULL
The first statement correctly return 1 while the second statement yields 0. Why or How?
That's because
The COUNT(column_name) function returns the number of values (NULL
values will not be counted) of the specified column
so when you count Id you get expected result, while counting Name no, but the answer provided by query is correct
Everything is described in COUNT (Transact-SQL).
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
ALL - is default
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates expression for each row in a group and returns the number of nonnull values.
"COUNT()" does not count NULL values. So basically:
SELECT COUNT(Id) FROM Tbl WHERE Name IS NULL
will return the number of lines where ("ID" IS NOT NULL) AND ("Name" IS NULL); result is "1"
While:
SELECT COUNT(Name) FROM Tbl WHERE Name IS NULL
will count the lines where ("Name" IS NOT NULL) AND ("Name" IS NULL); result will always be 0
As it was said, COUNT (column_name) doesn't count NULL values.
If you don't want use COUNT(*) then use COUNT(1), but actualy you will not see difference in performance.
"Always avoid using *" is one of those blanket statements that people blindly follow. If you knew the reasons why you were avoiding * then you would know that none of those reasons apply when doing count(*).
The * in COUNT(*) is not the same * in SELECT * FROM...
SELECT COUNT(*) FROM T; very specifically means the cardinality of the table expression T.
SELECT COUNT(1) FROM T; will generate the same results as COUNT(*) but if the contents of the parentheses is not * then it must be parsed.
SELECT COUNT(c) FROM T; where c is a nullable column in table T will count the non-null values.
P.S. I'm comfortable with using SELECT * FROM... in the right circumstances.
P.P.S. Your 'table' has no key: consider INSERT INTO Tbl VALUES ('John', 'John', 'John', NULL, NULL, NULL); would be allowed by the results would be nonsense.