I have a SQL table in the form of:
| value | row_loc | column_loc |
|-------|---------|------------|
| a | 0 | 1 |
| b | 1 | 1 |
| c | 1 | 0 |
| d | 0 | 0 |
I would like to find a way to map it onto a table/grid, given the indices, using SQL. Something like:
| d | a |
| c | b |
(The context being, I would like to create a colour map with colours corresponding to values a, b, c, d, in the locations specified)
I would be able to do this iteratively in python, but cannot figure out how to do it in SQL, or if it is even possible! Any help or guidance on this problem would be greatly appreciated!
EDIT: a, b, c, d are examples of numeric values (which would not be able to be selected using named variables in practice, so I'm relying on selecting them based on location. Also worth noting, the number of rows and columns will always be the same. The value column is also not the primary key to this table, so is not necessarily unique, it is just as a continuous value.
Yes, it is possible, assuming the column number is limited since SQL supports only determined number of columns. The number of rows in result set depends on number of distinct row_loc values so we have to group by column row_loc. Then choose value using simple case.
with t (value, row_loc, column_loc) as (
select 'a', 0, 1 from dual union all
select 'b', 1, 1 from dual union all
select 'c', 1, 0 from dual union all
select 'd', 0, 0 from dual
)
select max(case column_loc when 0 then value else null end) as column0
, max(case column_loc when 1 then value else null end) as column1
from t
group by row_loc
order by row_loc
I tested it on Oracle. Not sure what to do if multiple values match on same coordinate, I chose max. For different vendors you could also utilize special clauses such as count ... filter (where ...). Or the Oracle pivot clause can also be used.
I have a structure / tree that looks similar to this.
CostType is mandatory and can exist by itself, but it can have a parent ProfitType or Unit and other CostTypes as children.
There can only be duplicate Units. Other cannot appear multiple times in the structure.
| ID | name | parent_id | ProfitType | CostType | Unit |
| -: | ------------- | --------: |
| 1 | Root | (NULL) |
| 2 | 1 | 1 | 300 | | |
| 3 | 1-1 | 2 | | 111 | |
| 4 | 1-1-1 | 3 | | | 8 |
| 5 | 1-2 | 2 | | 222 | |
| 6 | 1-2-1 | 5 | | 333 | |
| 7 | 1-2-1-1 | 6 | | | 8 |
| 8 | 1-2-1-2 | 6 | | | 9 |
Parameters | should RETURN |
(300,111,8) | 4 |
(null,111,8) | 4 |
(null,null,8) | first match, 4 |
(null,222,8) | best match, 5 |
(null,333,null) | 6 |
I am at a loss on how I could create a function that receives (ProfitType, CostType, Unit) and return the best matching ID from the structure.
This isn't giving exactly the answers you provided as example, but see my comment above - if (null,222,8) should be 7 to match how (null,333,8) returns 4 then this is correct.
Also note that I formatted this using temp tables instead of as a function, I don't want to trip a schema change audit so I posted what I have as temp tables, I can rewrite it as a function Monday when my DBA is available, but I thought you might need it before the weekend. Just edit the "DECLARE #ProfitType int = ..." lines to the values you want to test
I also put in quite a few comments because the logic is tricky, but if they aren't enough leave a comment and I can expand my explanation
/*
ASSUMPTIONS:
A tree can be of arbitrary depth, but will not exceed the recursion limit (defaults to 100)
All trees will include at least 1 CostType
All trees will have at most 1 ProfitType
CostType can appear multiple times in a traversal from root to leaf (can units?)
*/
SELECT *
INTO #Temp
FROM (VALUES (1,'Root',NULL, NULL, NULL, NULL)
, (2,'1', 1, 300, NULL, NULL)
, (3,'1-1', 2, NULL, 111, NULL)
, (4,'1-1-1', 3, NULL, NULL, 8)
, (5,'1-2', 2, NULL, 222, NULL)
, (6,'1-2-1', 5, NULL, 333, NULL)
, (7,'1-2-1-1', 6, NULL, NULL, 8)
, (8,'1-2-1-2', 6, NULL, NULL, 9)
) as TempTable(ID, RName, Parent_ID, ProfitType, CostType, UnitID)
--SELECT * FROM #Temp
DECLARE #ProfitType int = NULL--300
DECLARE #CostType INT = 333 --NULL --111
DECLARE #UnitID INT = NULL--8
--SELECT * FROM #Temp
;WITH cteMatches as (
--Start with all nodes that match one criteria, default a score of 100
SELECT N.ID as ReportID, *, 100 as Score, 1 as Depth
FROM #Temp AS N
WHERE N.CostType= #CostType OR N.ProfitType=#ProfitType OR N.UnitID = #UnitID
), cteEval as (
--This is a recursive CTE, it has a (default) limit of 100 recursions
--, but that can be raised if your trees are deeper than 100 nodes
--Start with the base case
SELECT M.ReportID, M.RName, M.ID ,M.Parent_ID, M.Score
, M.Depth, M.ProfitType , M.CostType , M.UnitID
FROM cteMatches as M
UNION ALL
--This is the recursive part, add to the list of matches the match when
--its immediate parent is also considered. For that match increase the score
--if the parent contributes another match. Also update the ID of the match
--to the parent's IDs so recursion can keep adding if more matches are found
SELECT M.ReportID, M.RName, N.ID ,N.Parent_ID
, M.Score + CASE WHEN N.CostType= #CostType
OR N.ProfitType=#ProfitType
OR N.UnitID = #UnitID THEN 100 ELSE 0 END as Score
, M.Depth + 1, N.ProfitType , N.CostType , N.UnitID
FROM cteEval as M INNER JOIN #Temp AS N on M.Parent_ID = N.ID
)SELECT TOP 1 * --Drop the "TOP 1 *" to see debugging info (runners up)
FROM cteEval
ORDER BY SCORE DESC, DEPTH
DROP TABLE #Temp
I'm sorry I don't have enough rep to comment.
You'll have to define "best answer" (for example, why isn't the answer to null,222,8 7 or null instead of 5?), but here's the approach I'd use:
Derive a new table where ProfitType and CostType are listed explicitly instead of only by inheritance. I would approach that by using a cursor (how awful, I know) and following the parent_id until a ProfitType and CostType is found -- or the root is reached. This presumes an unlimited amount of child/grandchild levels for parent_id. If there is a limit, then you can instead use N self joins where N is the number of parent_id levels allowed.
Then you run multiple queries against the derived table. The first query would be for an exact match (and then exit if found). Then next query would be for the "best" partial match (then exit if found), followed by queries for 2nd best, 3rd best, etc. until you've exhausted your "best" match criteria.
If you need nested parent CostTypes to be part of the "best match" criteria, then I would make duplicate entries in the derived table for each row that has multiple CostTypes with a CostType "level". level 1 is the actual CostType. level 2 is that CostType's parent, level 3 etc. Then your best match queries would return multiple rows and you'd need to pick the row with the lowest level (which is the closest parent/grandparent).
For somewhat insane business reasons related to getting rather messy data from our customer, I have the following problem;
1)I have a table with 6 semi-unique identifiers and one automatically incrementing unique ID. The table has more fields. But, those aren't important to this discussion. Nor is the exact type of data the fields hold.
2)I want to get a list of the unique IDs of all rows that participate in at least one duplicate relationship. (There's not any additional value in identifying all of the row pairs that indicate a duplication. But, if a solution provides that, it's fairly trivial to retrieve the set of duplicate rows. So, that would also be fine)
3)A duplicate is defined as;
3a)For each of these 6 fields, record A must either match record B or one of them must be null
3b)At least one field must match exactly (i.e. neither is null)
4)All of the potential duplicate fields of interest are strings and are not empty strings. Many rows have at least one of the fields of interest as null but (at least assuming our ingest logic is working) none of them can have more than 3 of these fields as null.
5)Exact string content matching is fine. We don't need any regex-based, case-insensitive... sort of matching.
6)Actual duplicates in the table are fairly rare.
7)We are running PostgreSQL 9. Using database-specific functionality is acceptable.
8)The table has 500,000 rows. So, the naive query I started out with, provided below, takes far too long to be viable. Presumably, it operates principally in exponential time. Ideally, the results should return in less than a minute, running on a midrange server.
SELECT a.id
FROM myTable a
JOIN myTable b ON a.id < b.id
AND (a.field1 = b.field1 OR a.field1 IS NULL OR b.field1 IS NULL )
AND (a.field2 = b.field2 OR a.field2 IS NULL OR b.field2 IS NULL)
....
WHERE
a.field1 = b.field1 OR a.field2 = b.field2 ...
9)I also looked into using "group by". But, "group by" does not consider two rows to be equal if a grouped column in one contains null and the other contains a value. Unless there is a way to achieve that behavior, group by won't work for my "both equal or at least one is null" logic.
10)The set of values that might be expected to appear in each row can be assumed non-overlapping with other columns. i.e., other than null, you will not expect a value from field 1 to appear in any rows for field 2.
Update: Sorry for the lack of information. I'll provide as close an approximation of the table schema as I can. Unfortunately, the project in question is in defense and even just the field names of the table could reveal information about operational security.
CREATE TABLE a (
id serial NOT NULL PRIMARY KEY,
f1 character varying,
f2 character varying,
f3 character varying,
f4 character varying,
f5 character varying,
f6 character varying,
...Other columns that aren't really relevant
)
CREATE INDEX f1_idx
ON public.a
USING btree
(f1 COLLATE pg_catalog."default");
...Same index for the other 5 fields.
For ease of reference, I'll copy Lorenze Albe's question and answer it here.
If you have the three rows
(1, 2, 3, 4, NULL, 6)
(1, 2, 3, NULL, 5, NULL)
(1, 2, 3, 4, 7, NULL)
which are duplicates?
(1, 2, 3, NULL, 5, NULL)
and
(1, 2, 3, 4, 7, NULL)
are not duplicates because field 5 is non-null in both and they are not equal. The other two are duplicates.
I'll give a few more examples of my own for clarity. (Just for completeness, I'll provide my row examples as strings. But, like I said, their string-iness isn't really important because we require exact string matches.
("1", "2", "3", "4", NULL, NULL)
AND
("1","2","3",NULL,"9",NULL)
are duplicates because columns 4, 5, and 6 are null in at least one and all other fields are equal.
("1", "2", "3", "4", NULL, "6")
AND
("1","2","3",NULL,"9","7")
are not duplicates because field 6 differs and neither is null
And two examples more typical of the actual data;
(NULL, NULL, "3", NULL, "5", "6")
and
("1", "2", NULL, "4", NULL, "6")
are duplicates because all fields wherein they differ, at least one side is null.
(NULL, NULL, "3", NULL, "5", "6")
and
("1", "2", NULL, "4", NULL, "6")
Yes, that does mean that
(NULL, NULL, NULL, "4", "5", "6")
and
("1", "2", "3", NULL, NULL, NULL)
would be duplicates if not for the requirement that at least one field match exactly. Which fields are null and which aren't is very nearly random. All that we require from our data provider is that at least 2 of the 6 fields must be provided.
Another Update: I've updated point 2 to reflect the fact that I want all rows that participate in at least one duplicate pair. So, for the three rows
(1, 2, 3, 4, NULL, 6)
(1, 2, 3, NULL, 5, NULL)
(1, 2, 3, 4, 7, NULL)
all three would be returned because even though rows 2 and 3 would not be considered duplicates of each other, row pairs 1,2 are duplicates and 1,3 are duplicates and therefore all three participate in a duplicate relationship and therefore would be returned.
Use count() over(partition by ...) then filter that result for any counts greater than 1:
CREATE TABLE mytable(
ID INTEGER NOT NULL PRIMARY KEY
,col1 VARCHAR(2) NOT NULL
,col2 VARCHAR(2)
,col3 VARCHAR(2) NOT NULL
,col4 VARCHAR(2)
,col5 VARCHAR(2)
,col6 VARCHAR(2)
);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1001,'a1','b1','c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1002,'a1',NULL,'c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1003,'a1','b1','c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1004,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1005,'a1','b1','c1',NULL,'e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1006,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1007,'f1',NULL,'b1','c1','d1','e1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1008,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1009,'c1','d1','e1','f1',NULL,NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1010,'c1','d1','e1','f1',NULL,'a1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1011,'a1','b1','c1','d1','e1','f1');
select
*
, count(*) over(partition by
coalesce(col1,'NULL')
, coalesce(col2,'NULL')
, coalesce(col3,'NULL')
, coalesce(col4,'NULL')
, coalesce(col5,'NULL')
, coalesce(col6,'NULL')
) cv
from mytable
id | col1 | col2 | col3 | col4 | col5 | col6 | cv
---: | :--- | :--- | :--- | :--- | :--- | :--- | -:
1001 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1003 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1011 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1005 | a1 | b1 | c1 | null | e1 | f1 | 1
1002 | a1 | null | c1 | d1 | e1 | f1 | 1
1008 | b1 | c1 | d1 | e1 | f1 | null | 3
1004 | b1 | c1 | d1 | e1 | f1 | null | 3
1006 | b1 | c1 | d1 | e1 | f1 | null | 3
1010 | c1 | d1 | e1 | f1 | null | a1 | 1
1009 | c1 | d1 | e1 | f1 | null | null | 1
1007 | f1 | null | b1 | c1 | d1 | e1 | 1
Use the approach above as a subquery, and then use where cv > 1 to locate all rows that have "duplicates" in those 6 columns.
db<>fiddle here
Please note the power of having some sample data to work with. Really it is your responsibility to provide sample data with your question (as you already own that data anyway). Do NOT try to explain in words alone, use data to illustrate the "as is" and the "to be", you will find your questions easier to prepare and faster to answer. See Why should I provide a MCVE
I have two separate tables, each containing character-separated cells. One table contains all the key data, the other all the val data. In the application, both tables are loaded into recordsets, the two cells are split into two arrays and the arrays are then used 'side-by side' to create key/val pair.
Trying to decouple the database from the application, I will basically create a view that emulates this behavior.
I created some sample tables and data to better illustrate it.
/* Create tables */
CREATE TABLE [dbo].[tblKeys](
[KeyId] [int] NOT NULL, /*PK*/
[KeyData] [nvarchar](max) NULL
)
CREATE TABLE [dbo].[tblValues](
[ValId] [int] NOT NULL, /*PK*/
[KeyId] [int] NOT NULL, /*FK*/
[Language] [nvarchar](5) NOT NULL,
[ValData] [nvarchar](max) NULL
)
/* Populate tables */
INSERT INTO [dbo].[tblKeys] ([KeyId], [KeyData]) VALUES
(1, '1|2|3'),
(2, '1|2|3|4'),
(3, '2|1')
INSERT INTO [dbo].[tblValues] ([ValId], [KeyId], [Language], [ValData]) VALUES
(1, 1, 'en', 'Apple|Orange|Pear'),
(2, 1, 'sv-se', 'Äpple|Apelsin|Päron'),
(3, 2, 'en', 'Milk|Butter|Cheese|Cream'),
(4, 2, 'sv-se', 'Mjölk|Smör|Ost|Grädde'),
(5, 3, 'en', 'Male|Female'),
(6, 3, 'sv-se', 'Man|Kvinna')
The desired end result in the view looks like this:
| KeyId | KeyData | Language | ValData
+-------+---------+----------+----------+
| 1 | 1 | en | Apple |
| 1 | 2 | en | Orange |
| 1 | 3 | en | Pear |
| 1 | 1 | sv-se | Äpple |
| 1 | 2 | sv-se | Apelsin |
| 1 | 3 | sv-se | Päron |
...
etc.
I've seen similar questions here on StackOverflow, but they all deal with a single table being tilted in similar ways. I need to tilt both tables while using the positions of the data in the two columns KeyData and ValData as siginficants when recombining them into proper key/val pairs.
How would I do this in an efficient manner?
[Edit]: The database design is not mine. It's some old crap I inhereted. I know it's bad. Bad as in horrendous.
This would work for you. But: you really should change your data design!
;WITH Splitted AS
(
SELECT v.Language
,v.ValData
,v.ValId
,k.KeyData
,k.KeyId
,CAST('<x>' + REPLACE(v.ValData,'|','</x><x>') + '</x>' AS XML) AS ValData_XML
,CAST('<x>' + REPLACE(k.KeyData,'|','</x><x>') + '</x>' AS XML) AS KeyData_XML
FROM tblValues AS v
INNER JOIN tblKeys AS k ON v.KeyId=k.KeyId
)
,NumberedValData AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY KeyId,Language ORDER BY (SELECT NULL)) AS RowNumber
,KeyId
,Language
,A.B.value('.','varchar(max)') AS ValD
FROM Splitted
CROSS APPLY Splitted.ValData_XML.nodes('/x') AS A(B)
)
,NumberedKeyData AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY KeyId,Language ORDER BY (SELECT NULL)) AS RowNumber
,KeyId
,Language
,A.B.value('.','varchar(max)') AS KeyD
FROM Splitted
CROSS APPLY Splitted.KeyData_XML.nodes('/x') AS A(B)
)
,Combined AS
(
SELECT nk.KeyId
,nk.KeyD
,nk.Language
,nv.ValD
FROM NumberedKeyData AS nk
INNER JOIN NumberedValData AS nv ON nk.KeyId=nv.KeyId AND nk.Language=nv.Language AND nk.RowNumber=nv.RowNumber
)
SELECT Combined.KeyId
,Combined.KeyD AS KeyData
,Splitted.Language
,Combined.ValD AS ValData
FROM Splitted
INNER JOIN Combined ON Splitted.KeyId=Combined.KeyId AND Splitted.Language=Combined.Language
ORDER BY Splitted.KeyId,Splitted.
The result
KeyId KeyData Language ValData
1 1 en Apple
1 2 en Orange
1 3 en Pear
1 1 sv-se Äpple
1 2 sv-se Apelsin
1 3 sv-se Päron
2 1 en Milk
2 2 en Butter
[...]