BINARY_CHECKSUM - different result depending on number of rows - sql

I wonder why the BINARY_CHECKSUM function returns different result for the same:
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL),
(3, 1, 2)) s(id,a,b);
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL)) s(id,a,b);
Ouput:
+-----+----+------+-------------+
| id | a | b | bc |
+-----+----+------+-------------+
| 1 | | 100 | -109 |
| 2 | | | -2147483640 |
| 3 | 1 | 2 | 18 |
+-----+----+------+-------------+
-- -109 vs 100
+-----+----+------+------------+
| id | a | b | bc |
+-----+----+------+------------+
| 1 | | 100 | 100 |
| 2 | | | 2147483647 |
+-----+----+------+------------+
And for second sample I get what I would anticipate:
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, 1, 100),
(2, 3, 4),
(3,1,1)) s(id,a,b);
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, 1, 100),
(2, 3, 4)) s(id,a,b);
Ouptut for both first two rows:
+-----+----+------+-----+
| id | a | b | bc |
+-----+----+------+-----+
| 1 | 1 | 100 | 116 |
| 2 | 3 | 4 | 52 |
+-----+----+------+-----+
db<>fiddle demo
It has strange consequences when I want to compare two tables/queries:
WITH t AS (
SELECT 1 AS id, NULL AS a, 100 b
UNION ALL SELECT 2, NULL, NULL
UNION ALL SELECT 3, 1, 2 -- comment this out
), s AS (
SELECT 1 AS id ,100 AS a, NULL as b
UNION ALL SELECT 2, NULL, NULL
UNION ALL SELECT 3, 2, 1 -- comment this out
)
SELECT t.*,s.*
,BINARY_CHECKSUM(t.a, t.b) AS bc_t, BINARY_CHECKSUM(s.a, s.b) AS bc_s
FROM t
JOIN s
ON s.id = t.id
WHERE BINARY_CHECKSUM(t.a, t.b) = BINARY_CHECKSUM(s.a, s.b);
db<>fiddle demo2
For 3 rows I get single result:
+-----+----+----+-----+----+----+--------------+-------------+
| id | a | b | id | a | b | bc_t | bc_s |
+-----+----+----+-----+----+----+--------------+-------------+
| 2 | | | 2 | | | -2147483640 | -2147483640 |
+-----+----+----+-----+----+----+--------------+-------------+
but for 2 rows I get also id = 1:
+-----+----+------+-----+------+----+-------------+------------+
| id | a | b | id | a | b | bc_t | bc_s |
+-----+----+------+-----+------+----+-------------+------------+
| 1 | | 100 | 1 | 100 | | 100 | 100 |
| 2 | | | 2 | | | 2147483647 | 2147483647 |
+-----+----+------+-----+------+----+-------------+------------+
Remarks:
I am not searching for alternatives like(HASH_BYTES/MD5/CHECKSUM)
I am aware that BINARY_CHECKSUM could lead to collisions(two different calls produce the same output) here scenario is a bit different
For this definition, we say that null values, of a specified type,
compare as equal values. If at least one of the values in the
expression list changes, the expression checksum can also change.
However, this is not guaranteed. Therefore, to detect whether values
have changed, we recommend use of BINARY_CHECKSUM only if your
application can tolerate an occasional missed change.
It is strange for me that hash function returns different result for the same input arguments.
Is this behaviour by design or it is some kind of glitch?
EDIT:
As #scsimon
points out it works for materialized tables but not for cte.
db<>fiddle actual table
Metadata for cte:
SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL),
(3, 1, 2)) s(id,a,b)', NULL,0);
SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL)) s(id,a,b)', NULL,0)
-- working workaround
SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, cast(NULL as int), 100),
(2, NULL, NULL)) s(id,a,b)', NULL,0)
For all cases all columns are INT but with explicit CAST it behaves as it should.
db<>fidde metadata

This has nothing to do with the number of rows. It is because the values in one of the columns of the 2-row version are always NULL. The default type of NULL is int and the default type of a numeric constant (of this length) is int, so these should be comparable. But from a values() derived table, these are (apparently) not exactly the same type.
In particular, a column with only typeless NULLs from a derived table is not comparable, so it is excluded from the binary checksum calculation. This does not occur in a real table, because all columns have types.
The rest of the answer illustrates what is happening.
The code behaves as expected with type conversions:
SELECT *, BINARY_CHECKSUM(a, b) AS bc
FROM (VALUES(1, cast(NULL as int), 100),
(2, NULL, NULL)
) s(id,a,b);
Here is a db<>fiddle.
Actually creating tables with the values suggests that columns with only NULL values have exactly the same type as columns with explicit numbers. That suggests that the original code should work. But an explicit cast also fixes the problem. Very strange.
This is really, really strange. Consider the following:
select v.*, checksum(a, b), checksum(c,b)
FROM (VALUES(1, NULL, 100, NULL),
(2, 1, 2, 1.0)
) v(id, a, b, c);
The change in type for "d" affects the binary_checksum() for the second row, but not for the first.
This is my conclusion. When all the values in a column are binary, then binary_checksum() is aware of this and the column is in the category of "noncomparable data type". The checksum is then based on the remaining columns.
You can validate this by seeing the error when you run:
select v.*, binary_checksum(a)
FROM (VALUES(1, NULL, 100, NULL),
(2, NULL, 2, 1.0)
) v( id,a, b, c);
It complains:
Argument data type NULL is invalid for argument 1 of checksum function.
Ironically, this is not a problem if you save the results into a table and use binary_checksum(). The issue appears to be some interaction with values() and data types -- but something that is not immediately obvious in the information_schema.columns table.
The happyish news is that the code should work on tables, even if it does not work on values() generated derived tables -- as this SQL Fiddle demonstrates.
I also learned that a column filled with NULLs really is typeless. The assignment of the int data type in a select into seems to happen when the table is being defined. The "typeless" type is converted to an int.

For the literal NULL without the CAST (and without any typed values in the column) it entirely ignores it and just gives you the same result as BINARY_CHECKSUM(b).
This seems to happen very early on. The initial tree representation output from
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL)) s(id,a,b)
OPTION (RECOMPILE, QUERYTRACEON 8605, QUERYTRACEON 3604);
Already shows that it has decided to just use one column as input to the function
ScaOp_Intrinsic binary_checksum
ScaOp_Identifier COL: Union1008
This compares with the following output for your first query
ScaOp_Intrinsic binary_checksum
ScaOp_Identifier COL: Union1011
ScaOp_Identifier COL: Union1010
If you try and get the BINARY_CHECKSUM with
SELECT *, BINARY_CHECKSUM(a) AS bc
FROM (VALUES(1, NULL, 100)) s(id,a,b)
It gives the error
Msg 8184, Level 16, State 1, Line 8 Error in binarychecksum. There are
no comparable columns in the binarychecksum input.
This is not the only place where an untyped NULL constant is treated differently from an explicitly typed one.
Another case is
SELECT COALESCE(CAST(NULL AS INT),CAST(NULL AS INT))
vs
SELECT COALESCE(NULL,NULL)
I'd err on the side of "glitch" in this case rather than "by design" though as the columns from the derived table are supposed to be int before they get to the checksum function.
SELECT COALESCE(a,b)
FROM (VALUES(NULL, NULL)) s(a,b)
Does work as expected without this glitch.

Related

Is it possible to map values onto a table given corresponding row and column indices in SQL?

I have a SQL table in the form of:
| value | row_loc | column_loc |
|-------|---------|------------|
| a | 0 | 1 |
| b | 1 | 1 |
| c | 1 | 0 |
| d | 0 | 0 |
I would like to find a way to map it onto a table/grid, given the indices, using SQL. Something like:
| d | a |
| c | b |
(The context being, I would like to create a colour map with colours corresponding to values a, b, c, d, in the locations specified)
I would be able to do this iteratively in python, but cannot figure out how to do it in SQL, or if it is even possible! Any help or guidance on this problem would be greatly appreciated!
EDIT: a, b, c, d are examples of numeric values (which would not be able to be selected using named variables in practice, so I'm relying on selecting them based on location. Also worth noting, the number of rows and columns will always be the same. The value column is also not the primary key to this table, so is not necessarily unique, it is just as a continuous value.
Yes, it is possible, assuming the column number is limited since SQL supports only determined number of columns. The number of rows in result set depends on number of distinct row_loc values so we have to group by column row_loc. Then choose value using simple case.
with t (value, row_loc, column_loc) as (
select 'a', 0, 1 from dual union all
select 'b', 1, 1 from dual union all
select 'c', 1, 0 from dual union all
select 'd', 0, 0 from dual
)
select max(case column_loc when 0 then value else null end) as column0
, max(case column_loc when 1 then value else null end) as column1
from t
group by row_loc
order by row_loc
I tested it on Oracle. Not sure what to do if multiple values match on same coordinate, I chose max. For different vendors you could also utilize special clauses such as count ... filter (where ...). Or the Oracle pivot clause can also be used.

Use IN to compare Array of Values against a table of data

I want to compare an array of values against the the rows of a table and return only the rows in which the data are different.
Suppose I have myTable:
| ItemCode | ItemName | FrgnName |
|----------|----------|----------|
| CD1 | Apple | Mela |
| CD2 | Mirror | Specchio |
| CD3 | Bag | Borsa |
Now using the SQL instruction IN I would like to compare the rows above against an array of values pasted from an excel file and so in theory I would have to write something like:
WHERE NOT IN (
ARRAY[CD1, Apple, Mella],
ARRAY[CD2, Miror, Specchio],
ARRAY[CD3, Bag, Borsa]
)
The QUERY should return rows 1 and 2 "MELLA" and "MIROR" are in fact typos.
You could use a VALUES expression to emulate a table of those arrays, like so:
... myTable AS t
LEFT JOIN (
VALUES (1, 'CD1','Apple','Mella')
, (1, 'CD2', 'Miror', 'Specchio')
, (1, 'CD3', 'Bag', 'Borsa')
) AS v(rowPresence, a, b, c)
ON t.ItemCode = v.a AND t.ItemName = v.b AND t.FrgnName = v.c
WHERE v.rowPresence IS NULL
Technically, in your scenario, you can do without the "rowPresence" field I added, since none of the values in your arrays are NULL any would do; I basically added it to point to a more general case.

Find best match in tree given a combination of multiple keys

I have a structure / tree that looks similar to this.
CostType is mandatory and can exist by itself, but it can have a parent ProfitType or Unit and other CostTypes as children.
There can only be duplicate Units. Other cannot appear multiple times in the structure.
| ID | name | parent_id | ProfitType | CostType | Unit |
| -: | ------------- | --------: |
| 1 | Root | (NULL) |
| 2 | 1 | 1 | 300 | | |
| 3 | 1-1 | 2 | | 111 | |
| 4 | 1-1-1 | 3 | | | 8 |
| 5 | 1-2 | 2 | | 222 | |
| 6 | 1-2-1 | 5 | | 333 | |
| 7 | 1-2-1-1 | 6 | | | 8 |
| 8 | 1-2-1-2 | 6 | | | 9 |
Parameters | should RETURN |
(300,111,8) | 4 |
(null,111,8) | 4 |
(null,null,8) | first match, 4 |
(null,222,8) | best match, 5 |
(null,333,null) | 6 |
I am at a loss on how I could create a function that receives (ProfitType, CostType, Unit) and return the best matching ID from the structure.
This isn't giving exactly the answers you provided as example, but see my comment above - if (null,222,8) should be 7 to match how (null,333,8) returns 4 then this is correct.
Also note that I formatted this using temp tables instead of as a function, I don't want to trip a schema change audit so I posted what I have as temp tables, I can rewrite it as a function Monday when my DBA is available, but I thought you might need it before the weekend. Just edit the "DECLARE #ProfitType int = ..." lines to the values you want to test
I also put in quite a few comments because the logic is tricky, but if they aren't enough leave a comment and I can expand my explanation
/*
ASSUMPTIONS:
A tree can be of arbitrary depth, but will not exceed the recursion limit (defaults to 100)
All trees will include at least 1 CostType
All trees will have at most 1 ProfitType
CostType can appear multiple times in a traversal from root to leaf (can units?)
*/
SELECT *
INTO #Temp
FROM (VALUES (1,'Root',NULL, NULL, NULL, NULL)
, (2,'1', 1, 300, NULL, NULL)
, (3,'1-1', 2, NULL, 111, NULL)
, (4,'1-1-1', 3, NULL, NULL, 8)
, (5,'1-2', 2, NULL, 222, NULL)
, (6,'1-2-1', 5, NULL, 333, NULL)
, (7,'1-2-1-1', 6, NULL, NULL, 8)
, (8,'1-2-1-2', 6, NULL, NULL, 9)
) as TempTable(ID, RName, Parent_ID, ProfitType, CostType, UnitID)
--SELECT * FROM #Temp
DECLARE #ProfitType int = NULL--300
DECLARE #CostType INT = 333 --NULL --111
DECLARE #UnitID INT = NULL--8
--SELECT * FROM #Temp
;WITH cteMatches as (
--Start with all nodes that match one criteria, default a score of 100
SELECT N.ID as ReportID, *, 100 as Score, 1 as Depth
FROM #Temp AS N
WHERE N.CostType= #CostType OR N.ProfitType=#ProfitType OR N.UnitID = #UnitID
), cteEval as (
--This is a recursive CTE, it has a (default) limit of 100 recursions
--, but that can be raised if your trees are deeper than 100 nodes
--Start with the base case
SELECT M.ReportID, M.RName, M.ID ,M.Parent_ID, M.Score
, M.Depth, M.ProfitType , M.CostType , M.UnitID
FROM cteMatches as M
UNION ALL
--This is the recursive part, add to the list of matches the match when
--its immediate parent is also considered. For that match increase the score
--if the parent contributes another match. Also update the ID of the match
--to the parent's IDs so recursion can keep adding if more matches are found
SELECT M.ReportID, M.RName, N.ID ,N.Parent_ID
, M.Score + CASE WHEN N.CostType= #CostType
OR N.ProfitType=#ProfitType
OR N.UnitID = #UnitID THEN 100 ELSE 0 END as Score
, M.Depth + 1, N.ProfitType , N.CostType , N.UnitID
FROM cteEval as M INNER JOIN #Temp AS N on M.Parent_ID = N.ID
)SELECT TOP 1 * --Drop the "TOP 1 *" to see debugging info (runners up)
FROM cteEval
ORDER BY SCORE DESC, DEPTH
DROP TABLE #Temp
I'm sorry I don't have enough rep to comment.
You'll have to define "best answer" (for example, why isn't the answer to null,222,8 7 or null instead of 5?), but here's the approach I'd use:
Derive a new table where ProfitType and CostType are listed explicitly instead of only by inheritance. I would approach that by using a cursor (how awful, I know) and following the parent_id until a ProfitType and CostType is found -- or the root is reached. This presumes an unlimited amount of child/grandchild levels for parent_id. If there is a limit, then you can instead use N self joins where N is the number of parent_id levels allowed.
Then you run multiple queries against the derived table. The first query would be for an exact match (and then exit if found). Then next query would be for the "best" partial match (then exit if found), followed by queries for 2nd best, 3rd best, etc. until you've exhausted your "best" match criteria.
If you need nested parent CostTypes to be part of the "best match" criteria, then I would make duplicate entries in the derived table for each row that has multiple CostTypes with a CostType "level". level 1 is the actual CostType. level 2 is that CostType's parent, level 3 etc. Then your best match queries would return multiple rows and you'd need to pick the row with the lowest level (which is the closest parent/grandparent).

SQL Duplicate With Nulls

For somewhat insane business reasons related to getting rather messy data from our customer, I have the following problem;
1)I have a table with 6 semi-unique identifiers and one automatically incrementing unique ID. The table has more fields. But, those aren't important to this discussion. Nor is the exact type of data the fields hold.
2)I want to get a list of the unique IDs of all rows that participate in at least one duplicate relationship. (There's not any additional value in identifying all of the row pairs that indicate a duplication. But, if a solution provides that, it's fairly trivial to retrieve the set of duplicate rows. So, that would also be fine)
3)A duplicate is defined as;
3a)For each of these 6 fields, record A must either match record B or one of them must be null
3b)At least one field must match exactly (i.e. neither is null)
4)All of the potential duplicate fields of interest are strings and are not empty strings. Many rows have at least one of the fields of interest as null but (at least assuming our ingest logic is working) none of them can have more than 3 of these fields as null.
5)Exact string content matching is fine. We don't need any regex-based, case-insensitive... sort of matching.
6)Actual duplicates in the table are fairly rare.
7)We are running PostgreSQL 9. Using database-specific functionality is acceptable.
8)The table has 500,000 rows. So, the naive query I started out with, provided below, takes far too long to be viable. Presumably, it operates principally in exponential time. Ideally, the results should return in less than a minute, running on a midrange server.
SELECT a.id
FROM myTable a
JOIN myTable b ON a.id < b.id
AND (a.field1 = b.field1 OR a.field1 IS NULL OR b.field1 IS NULL )
AND (a.field2 = b.field2 OR a.field2 IS NULL OR b.field2 IS NULL)
....
WHERE
a.field1 = b.field1 OR a.field2 = b.field2 ...
9)I also looked into using "group by". But, "group by" does not consider two rows to be equal if a grouped column in one contains null and the other contains a value. Unless there is a way to achieve that behavior, group by won't work for my "both equal or at least one is null" logic.
10)The set of values that might be expected to appear in each row can be assumed non-overlapping with other columns. i.e., other than null, you will not expect a value from field 1 to appear in any rows for field 2.
Update: Sorry for the lack of information. I'll provide as close an approximation of the table schema as I can. Unfortunately, the project in question is in defense and even just the field names of the table could reveal information about operational security.
CREATE TABLE a (
id serial NOT NULL PRIMARY KEY,
f1 character varying,
f2 character varying,
f3 character varying,
f4 character varying,
f5 character varying,
f6 character varying,
...Other columns that aren't really relevant
)
CREATE INDEX f1_idx
ON public.a
USING btree
(f1 COLLATE pg_catalog."default");
...Same index for the other 5 fields.
For ease of reference, I'll copy Lorenze Albe's question and answer it here.
If you have the three rows
(1, 2, 3, 4, NULL, 6)
(1, 2, 3, NULL, 5, NULL)
(1, 2, 3, 4, 7, NULL)
which are duplicates?
(1, 2, 3, NULL, 5, NULL)
and
(1, 2, 3, 4, 7, NULL)
are not duplicates because field 5 is non-null in both and they are not equal. The other two are duplicates.
I'll give a few more examples of my own for clarity. (Just for completeness, I'll provide my row examples as strings. But, like I said, their string-iness isn't really important because we require exact string matches.
("1", "2", "3", "4", NULL, NULL)
AND
("1","2","3",NULL,"9",NULL)
are duplicates because columns 4, 5, and 6 are null in at least one and all other fields are equal.
("1", "2", "3", "4", NULL, "6")
AND
("1","2","3",NULL,"9","7")
are not duplicates because field 6 differs and neither is null
And two examples more typical of the actual data;
(NULL, NULL, "3", NULL, "5", "6")
and
("1", "2", NULL, "4", NULL, "6")
are duplicates because all fields wherein they differ, at least one side is null.
(NULL, NULL, "3", NULL, "5", "6")
and
("1", "2", NULL, "4", NULL, "6")
Yes, that does mean that
(NULL, NULL, NULL, "4", "5", "6")
and
("1", "2", "3", NULL, NULL, NULL)
would be duplicates if not for the requirement that at least one field match exactly. Which fields are null and which aren't is very nearly random. All that we require from our data provider is that at least 2 of the 6 fields must be provided.
Another Update: I've updated point 2 to reflect the fact that I want all rows that participate in at least one duplicate pair. So, for the three rows
(1, 2, 3, 4, NULL, 6)
(1, 2, 3, NULL, 5, NULL)
(1, 2, 3, 4, 7, NULL)
all three would be returned because even though rows 2 and 3 would not be considered duplicates of each other, row pairs 1,2 are duplicates and 1,3 are duplicates and therefore all three participate in a duplicate relationship and therefore would be returned.
Use count() over(partition by ...) then filter that result for any counts greater than 1:
CREATE TABLE mytable(
ID INTEGER NOT NULL PRIMARY KEY
,col1 VARCHAR(2) NOT NULL
,col2 VARCHAR(2)
,col3 VARCHAR(2) NOT NULL
,col4 VARCHAR(2)
,col5 VARCHAR(2)
,col6 VARCHAR(2)
);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1001,'a1','b1','c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1002,'a1',NULL,'c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1003,'a1','b1','c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1004,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1005,'a1','b1','c1',NULL,'e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1006,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1007,'f1',NULL,'b1','c1','d1','e1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1008,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1009,'c1','d1','e1','f1',NULL,NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1010,'c1','d1','e1','f1',NULL,'a1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1011,'a1','b1','c1','d1','e1','f1');
select
*
, count(*) over(partition by
coalesce(col1,'NULL')
, coalesce(col2,'NULL')
, coalesce(col3,'NULL')
, coalesce(col4,'NULL')
, coalesce(col5,'NULL')
, coalesce(col6,'NULL')
) cv
from mytable
id | col1 | col2 | col3 | col4 | col5 | col6 | cv
---: | :--- | :--- | :--- | :--- | :--- | :--- | -:
1001 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1003 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1011 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1005 | a1 | b1 | c1 | null | e1 | f1 | 1
1002 | a1 | null | c1 | d1 | e1 | f1 | 1
1008 | b1 | c1 | d1 | e1 | f1 | null | 3
1004 | b1 | c1 | d1 | e1 | f1 | null | 3
1006 | b1 | c1 | d1 | e1 | f1 | null | 3
1010 | c1 | d1 | e1 | f1 | null | a1 | 1
1009 | c1 | d1 | e1 | f1 | null | null | 1
1007 | f1 | null | b1 | c1 | d1 | e1 | 1
Use the approach above as a subquery, and then use where cv > 1 to locate all rows that have "duplicates" in those 6 columns.
db<>fiddle here
Please note the power of having some sample data to work with. Really it is your responsibility to provide sample data with your question (as you already own that data anyway). Do NOT try to explain in words alone, use data to illustrate the "as is" and the "to be", you will find your questions easier to prepare and faster to answer. See Why should I provide a MCVE

How do I split two columns in two separate tables into joined multiple rows in a view?

I have two separate tables, each containing character-separated cells. One table contains all the key data, the other all the val data. In the application, both tables are loaded into recordsets, the two cells are split into two arrays and the arrays are then used 'side-by side' to create key/val pair.
Trying to decouple the database from the application, I will basically create a view that emulates this behavior.
I created some sample tables and data to better illustrate it.
/* Create tables */
CREATE TABLE [dbo].[tblKeys](
[KeyId] [int] NOT NULL, /*PK*/
[KeyData] [nvarchar](max) NULL
)
CREATE TABLE [dbo].[tblValues](
[ValId] [int] NOT NULL, /*PK*/
[KeyId] [int] NOT NULL, /*FK*/
[Language] [nvarchar](5) NOT NULL,
[ValData] [nvarchar](max) NULL
)
/* Populate tables */
INSERT INTO [dbo].[tblKeys] ([KeyId], [KeyData]) VALUES
(1, '1|2|3'),
(2, '1|2|3|4'),
(3, '2|1')
INSERT INTO [dbo].[tblValues] ([ValId], [KeyId], [Language], [ValData]) VALUES
(1, 1, 'en', 'Apple|Orange|Pear'),
(2, 1, 'sv-se', 'Äpple|Apelsin|Päron'),
(3, 2, 'en', 'Milk|Butter|Cheese|Cream'),
(4, 2, 'sv-se', 'Mjölk|Smör|Ost|Grädde'),
(5, 3, 'en', 'Male|Female'),
(6, 3, 'sv-se', 'Man|Kvinna')
The desired end result in the view looks like this:
| KeyId | KeyData | Language | ValData
+-------+---------+----------+----------+
| 1 | 1 | en | Apple |
| 1 | 2 | en | Orange |
| 1 | 3 | en | Pear |
| 1 | 1 | sv-se | Äpple |
| 1 | 2 | sv-se | Apelsin |
| 1 | 3 | sv-se | Päron |
...
etc.
I've seen similar questions here on StackOverflow, but they all deal with a single table being tilted in similar ways. I need to tilt both tables while using the positions of the data in the two columns KeyData and ValData as siginficants when recombining them into proper key/val pairs.
How would I do this in an efficient manner?
[Edit]: The database design is not mine. It's some old crap I inhereted. I know it's bad. Bad as in horrendous.
This would work for you. But: you really should change your data design!
;WITH Splitted AS
(
SELECT v.Language
,v.ValData
,v.ValId
,k.KeyData
,k.KeyId
,CAST('<x>' + REPLACE(v.ValData,'|','</x><x>') + '</x>' AS XML) AS ValData_XML
,CAST('<x>' + REPLACE(k.KeyData,'|','</x><x>') + '</x>' AS XML) AS KeyData_XML
FROM tblValues AS v
INNER JOIN tblKeys AS k ON v.KeyId=k.KeyId
)
,NumberedValData AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY KeyId,Language ORDER BY (SELECT NULL)) AS RowNumber
,KeyId
,Language
,A.B.value('.','varchar(max)') AS ValD
FROM Splitted
CROSS APPLY Splitted.ValData_XML.nodes('/x') AS A(B)
)
,NumberedKeyData AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY KeyId,Language ORDER BY (SELECT NULL)) AS RowNumber
,KeyId
,Language
,A.B.value('.','varchar(max)') AS KeyD
FROM Splitted
CROSS APPLY Splitted.KeyData_XML.nodes('/x') AS A(B)
)
,Combined AS
(
SELECT nk.KeyId
,nk.KeyD
,nk.Language
,nv.ValD
FROM NumberedKeyData AS nk
INNER JOIN NumberedValData AS nv ON nk.KeyId=nv.KeyId AND nk.Language=nv.Language AND nk.RowNumber=nv.RowNumber
)
SELECT Combined.KeyId
,Combined.KeyD AS KeyData
,Splitted.Language
,Combined.ValD AS ValData
FROM Splitted
INNER JOIN Combined ON Splitted.KeyId=Combined.KeyId AND Splitted.Language=Combined.Language
ORDER BY Splitted.KeyId,Splitted.
The result
KeyId KeyData Language ValData
1 1 en Apple
1 2 en Orange
1 3 en Pear
1 1 sv-se Äpple
1 2 sv-se Apelsin
1 3 sv-se Päron
2 1 en Milk
2 2 en Butter
[...]