Query to get count of distinct items in groupings - sql

I have a table that stores created grouping for items from another table like this:
table1
table2
So giving the above, I want to write a query that returns the count of items from table1 that a grouping has been created for.
It may sound like doing the below but that is actually not what I'm looking for because the groups have to be manually created for them to appear in table 2 so you may have an item from table1 that does't exist in table 2 because the grouping hasn't been created (i.e id: 555).
SELECT count(id)
FROM table1
WHERE group IS NOT NULL
The above will return 4 but I need something that looks at table2 and returns 3 which is count of items from table1 whose group exists in the category column of table2.
My real table for this can be pretty large up to 100k+ rows so I don't think it is efficient to check if group string from table1 it exists in table2 one by one as that would probably take forever to run - or is that the only viable solution?
PS: tried to use table markdown but I must have screwed up somehow
PPS categories column is not of json type, its just string

Not sure that this will be faster but you can prepare an existing categories aggregate. Something like that (also you can try set_union instead of array_agg with flatten and array_distinct):
SELECT array_distinct(flatten(array_agg(CAST(JSON_EXTRACT(categories, '$.x') as ARRAY(VARCHAR)))))
FROM table2
And check that group is in the result.

Assuming that table2 would not contain any groups in the array that are not there in table1, you can try the following:
WITH table1(id, "group", qty) AS (
SELECT *
FROM (VALUES (111, 'cups', 1),
(222, 'plates', 2),
(333, 'spoons', 5),
(444, null, 2),
(555, 'knives', 2))
),
table2(group_id, categories, count_inventory) as (
SELECT *
FROM (VALUES ('A1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups', 'plates']]) AS JSON), 3),
('B1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups']]) AS JSON), 1),
('C1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups', 'spoons']]) AS JSON), 6),
('C4', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['spoons']]) AS JSON), 5)
))
SELECT reduce(
array_agg(CAST(json_extract(categories, '$.x') AS ARRAY(VARCHAR))),
array[],
(s, x) -> array_union(s, x),
x -> cardinality(x)
)
FROM table2 WHERE categories is not null;

Related

SQL Query for insert many values in a table and take only a value from another table

I'm looking for insert many values in a table and take the ID refernce from another table. I have tried diffent ways, and finaly I have found this that works.
INSERT INTO tblUserFreeProperty (id, identname, val, pos)
VALUES ((SELECT id FROM tblpart where tblPart.ordernr=N'3CFSU05'),N'DSR_Mag.G', N'??_??#False', 1),
((SELECT id FROM tblpart where tblPart.ordernr=N'3CFSU05'),N'DSR_Mag.Qta_C', N'??_??#0', 2),
((SELECT id FROM tblpart where tblPart.ordernr=N'3CFSU05'),N'DSR_Mag.Qta_M', N'??_??#0', 3),
((SELECT id FROM tblpart where tblPart.ordernr=N'3CFSU05'),N'DSR_Mag.UbicM', N'??_??#No', 4),
((SELECT id FROM tblpart where tblPart.ordernr=N'3CFSU05'),N'DSR_Mag.UbicS', N'??_??#', 5),
((SELECT id FROM tblpart where tblPart.ordernr=N'3CFSU05'),N'DSR_Mag.UbicP', N'??_??#', 6),
((SELECT id FROM tblpart where tblPart.ordernr=N'3CFSU05'),N'DSR_Mag.UbicC', N'??_??#', 7);
This works, but I'm looking for a "easy query" because I need to write the command from Visual Studio
The link I noted earlier should have sufficed to explain the correct syntax.
Insert into ... values ( SELECT ... FROM ... )
But seeing as there has been much misinformation on this post, I will show how you should do it.
INSERT INTO tblUserFreeProperty (id, identname, val, pos)
SELECT p.id, v.identname, v.val, v.pos
FROM (VALUES
(N'DSR_Mag.G', N'??_??#False', 1),
(N'DSR_Mag.Qta_C', N'??_??#0', 2),
(N'DSR_Mag.Qta_M', N'??_??#0', 3),
(N'DSR_Mag.UbicM', N'??_??#No', 4),
(N'DSR_Mag.UbicS', N'??_??#', 5),
(N'DSR_Mag.UbicP', N'??_??#', 6),
(N'DSR_Mag.UbicC', N'??_??#', 7)
) AS v(identname, val, pos)
JOIN tblpart p ON p.ordernr = N'3CFSU05';
Note the use of a standard JOIN clause, there are no subqueries. Note also good use of short, meaningful table aliases.
As far as the VALUES table constructor goes, it can also be replaced with a temp table, or table variable, or Table Valued parameter. Or indeed another table.
Side note: I don't know what you are storing in those columns, but it appears you have multiple pieces of info in each. Do not do this. Store each atomic value in its own column.
INSERT tblUserFreeProperty (id, identname, val, pos)
SELECT tblpart.id, X.A, X.B, X.C)
FROM (
VALUES (
(N'DSR_Mag.G0', N'??_??#True', 1),
(N'DSR_Mag.G1', N'??_??#True', 2),
(N'DSR_Mag.G2', N'??_??#False', 3);
)
) X(A,B,C)
CROSS JOIN tblPart
WHERE tblPart.ordernr=N'555'

How to use LIMIT in PostgreSQL correctly to only query one row per ID

I have a query where users enter a list of stocks called {placeholders}, that is stored in a python variable. The query will pull 9 columns from t1, and 1 column from t2.
f''' SELECT t1.id, cast(t1.enterprisevalue as money), ROUND(t1.enterprise_value_revenue, 2),
ROUND(t1.revenuepershare, 2),
ROUND(t1.debt_to_equity, 2),
ROUND(t1.profitmargin, 2),
ROUND(t1.price_to_sales, 2),
ROUND(t1.price_to_book, 2),
ROUND(t1.put_call_ratio, 2),
t2.employees,
cast(ROUND(t1.revenue_per_employee, 2) as money)
FROM
security_advanced_stats as t1
LEFT JOIN security_stats as t2 USING (id)
WHERE id IN ({placeholders})
ORDER by id LIMIT 1;
'''
I want ONE row per Stock symbol in {placeholders}, which is why I'm using the LIMIT here. However, the syntax is wrong and the query is now limiting {placeholders} to only the first symbol in the list. The output of my query is only showing data for one stock symbol, and not the others in {placeholders}
If I take away the LIMIT command, then I get all of the rows in the database, when I'm only looking for the most recent record for my stocks (Which I label as id).
This is what happens when I take out limit, Notice there are two symbols EXPD and VFC, but they each all other entries with the same data.
I only want the most recent row for EXPD and VFC in the case above.
How can I fix my query?
The DISTINCT ON feature is very well suited for this. Basically, you can choose the fields which you don't want duplicated, and only get the first row per your sorting.
(I'm assuming you have some kind of timestamp column, so we can get the most recent row for each ID.)
SELECT DISTINCT ON (t1.id)
t1.id,
cast(t1.enterprisevalue as money),
ROUND(t1.enterprise_value_revenue, 2),
ROUND(t1.revenuepershare, 2),
ROUND(t1.debt_to_equity, 2),
ROUND(t1.profitmargin, 2),
ROUND(t1.price_to_sales, 2),
ROUND(t1.price_to_book, 2),
ROUND(t1.put_call_ratio, 2),
t2.employees,
cast(ROUND(t1.revenue_per_employee, 2) as money)
FROM security_advanced_stats as t1
LEFT JOIN security_stats as t2 USING (id)
WHERE id IN ({placeholders})
ORDER by id, timestamp DESC;
You can use first_value with group by:
SELECT t1.id,
cast(first_value(t1.enterprisevalue as money) over w),
round(first_value(t1.enterprise_value_revenue) over w, 2),
...
from security_advanced_stats as t1
left join security_stats as t2 USING (id)
where id in ({placeholders})
group by t1.id
window w as (partition by t1.id);

Presto filter an array during aggregation

I would like to filter an aggregated array depending on all values associated with an id. The values are strings and can be of three type all-x:y, x:y and empty (here x and y are arbitrary substrings of values).
I have a few conditions:
If an id has x:y then the result should contain x:y.
If an id always has all-x:y then the resulting aggregation should have all-x:y
If an id sometimes has all-x:y then the resulting aggregation should have x:y
For example with the following
WITH
my_table(id, my_values) AS (
VALUES
(1, ['all-a','all-b']),
(2, ['all-c','b']),
(3, ['a','b','c']),
(1, ['all-a']),
(2, []),
(3, ['all-c']),
),
The result should be:
(1, ['all-a','b']),
(2, ['c','b']),
(3, ['a','b','c']),
I have worked multiple hours on this but it seems like it's not feasible.
I came up with the following but it feels like it cannot work because I can check the presence of all-x in all arrays which would go in <<IN ALL>>:
SELECT
id,
SET_UNION(
CASE
WHEN SPLIT_PART(my_table.values,'-',1) = 'all' THEN
CASE
WHEN <<my_table.values IN ALL>> THEN my_table.values
ELSE REPLACE(my_table.values,'all-')
END
ELSE my_table.values
END
) AS values
FROM my_table
GROUP BY 1
I would need to check that all arrays values for the specific id contains all-x and that's where I'm struggling to find a solution.
I was trying to co
After a few hours of searching how to do so I am starting to believe that it is not feasible.
Any help is appreciated. Thank you for reading.
This should do what you want:
WITH my_table(id, my_values) AS (
VALUES
(1, array['all-a','all-b']),
(2, array['all-c','b']),
(3, array['a','b','c']),
(1, array['all-a']),
(2, array[]),
(3, array['all-c'])
),
with_group_counts AS (
SELECT *, count(*) OVER (PARTITION BY id) group_count -- to see if the number of all-X occurrences match the number of rows for a given id
FROM my_table
),
normalized AS (
SELECT
id,
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'), -- if its an all-X value and every original row for the given id contains it ...
value,
if(starts_with(value, 'all-'), substr(value, 5), value)) AS extracted
FROM with_group_counts CROSS JOIN UNNEST(with_group_counts.my_values) t(value)
)
SELECT id, array_agg(DISTINCT extracted)
FROM normalized
GROUP BY id
The trick is to compute the number of total rows for each id in the original table via the count(*) OVER (PARTITION BY id) expression in the with_group_counts subquery. We can then use that value to determine whether a given value should be treated as an all-x or the x should be extracted. That's handled by the following expression:
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'),
value,
if(starts_with(value, 'all-'), substr(value, 5), value))
For more information about window functions in Presto, check out the documentation. You can find the documentation for UNNEST here.

Testing equality of referencing rows

I have a number of tables that follow this rather common pattern: A <-->> B. I would like to find the pairs of matching rows in table A where certain columns have equal values and also have referencing rows in B where certain columns have equal values. In other words, a pair of rows (R, S) in A matches, iff for given sets of columns {a1, a2, …, an} in A and {b1, b2, …, bn} in B:
We have R.a1 = S.a1, R.a2 = S.a2, …, R.an = S.an.
For every R's referencing row T in B exists S's referencing row U in B s.t. T.b1 = U.b1, T.b2 = U.b2, …, T.bn = U.bn.
(R, S) matches iff (S, R) matches.
(I'm not very familiar with relational algebra, so my definition above might not follow any convention.)
The approach that I came up with was:
Find pairs (R, S) that have matching columns.
See if there's and equal number of (any) R's and S's referencing rows in B.
For each row in B find a matching row, group by the referencing row in A and count. Check that there are as many matching rows as referencing rows.
However, the query that I wrote (below) for steps 2 and 3, to find matching rows in B, is quite complex. Is there a better solution?
-- Tables similar to those that I have.
CREATE TABLE a (
id INTEGER PRIMARY KEY,
data TEXT
);
CREATE TABLE b (
id INTEGER PRIMARY KEY,
a_id INTEGER REFERENCES a (id),
data TEXT
);
SELECT DISTINCT dup.lhs_parent_id, dup.rhs_parent_id
FROM (
SELECT DISTINCT
MIN(lhs.a_id, rhs.a_id) AS lhs_parent_id, -- Normalize.
MAX(lhs.a_id, rhs.a_id) AS rhs_parent_id,
COUNT(*) AS count
FROM b lhs
INNER JOIN b rhs USING (data)
WHERE NOT (lhs.id = rhs.id OR lhs.a_id = rhs.a_id) -- Remove self-matching rows and duplicate values with the same parent.
GROUP BY lhs.a_id, rhs.a_id
) dup
INNER JOIN ( -- Check that lhs has the same number of rows.
SELECT
a_id AS parent_id,
COUNT(*) AS count
FROM b
GROUP BY a_id
) lhs_ct ON (
dup.lhs_parent_id = lhs_ct.parent_id AND
dup.count = lhs_ct.count
)
INNER JOIN ( -- Check that rhs has the same number of rows.
SELECT
a_id AS parent_id,
COUNT(*) AS count
FROM b
GROUP BY a_id
) rhs_ct ON (
dup.rhs_parent_id = rhs_ct.parent_id AND
dup.count = rhs_ct.count
);
-- Test data.
-- Expected query result is three rows with values (1, 2), (1, 3) and (2, 3) for a_id,
-- since the first three rows (with values 'row 1', 'row 2' and 'row 3')
-- have referencing rows, each of which has a matching pair. The fourth row
-- ('row 3') only has one referencing row with the value 'foo', so it doesn't have a
-- pair for the referenced rows with the value 'bar'.
INSERT INTO a (id, data) VALUES
(1, 'row 1'),
(2, 'row 2'),
(3, 'row 3'),
(4, 'row 4');
INSERT INTO b (id, a_id, data) VALUES
(1, 1, 'foo'),
(2, 1, 'bar'),
(3, 2, 'foo'),
(4, 2, 'bar'),
(5, 3, 'foo'),
(6, 3, 'bar'),
(7, 4, 'foo');
I'm using SQLite.
To find matching and different rows it is easier to use INTERSECT and MINUS operations then joins...
But when only one field actually used in comparison JOIN solution looks better:
Select B1.A_Id, B2.A_Id
From (
Select Data, A_Id, Count(Id) A_Count
From B
Group By Data, A_Id
) b1
inner join (
Select Data, A_Id, Count(Id) a_count
From B Group By Data, A_Id
) b2 on b1.data = b2.data and b1.a_count = b2.a_count and b1.a_id <> b2.a_id
As I understand you need to find out the pairs of different a_id which have same data and count of data.
The result of my script gives, the possible couples in two directions, that left room for optimization on SQLlite specific syntax.
Result example:
{1,2}, {1,3}, {2,1}, {2,3}, {3,2}, {3,1}

MSSQL ORDER BY Passed List

I am using Lucene to perform queries on a subset of SQL data which returns me a scored list of RecordIDs, e.g. 11,4,5,25,30 .
I want to use this list to retrieve a set of results from the full SQL Table by RecordIDs.
So SELECT * FROM MyFullRecord
where RecordID in (11,5,3,25,30)
I would like the retrieved list to maintain the scored order.
I can do it by using an Order by like so;
ORDER BY (CASE WHEN RecordID = 11 THEN 0
WHEN RecordID = 5 THEN 1
WHEN RecordID = 3 THEN 2
WHEN RecordID = 25 THEN 3
WHEN RecordID = 30 THEN 4
END)
I am concerned with the loading of the server loading especially if I am passing long lists of RecordIDs. Does anyone have experience of this or how can I determine an optimum list length.
Are there any other ways to achieve this functionality in MSSQL?
Roger
You can record your list into a table or table variable with sorting priorities.
And then join your table with this sorting one.
DECLARE TABLE #tSortOrder (RecordID INT, SortOrder INT)
INSERT INTO #tSortOrder (RecordID, SortOrder)
SELECT 11, 1 UNION ALL
SELECT 5, 2 UNION ALL
SELECT 3, 3 UNION ALL
SELECT 25, 4 UNION ALL
SELECT 30, 5
SELECT *
FROM yourTable T
LEFT JOIN #tSortOrder S ON T.RecordID = S.RecordID
ORDER BY S.SortOrder
Instead of creating a searched order by statement, you could create an in memory table to join. It's easier on the eyes and definitely scales better.
SQL Statement
SELECT mfr.*
FROM MyFullRecord mfr
INNER JOIN (
SELECT *
FROM (VALUES (1, 11),
(2, 5),
(3, 3),
(4, 25),
(5, 30)
) q(ID, RecordID)
) q ON q.RecordID = mfr.RecordID
ORDER BY
q.ID
Look here for a fiddle
Something like:
SELECT * FROM MyFullRecord where RecordID in (11,5,3,25,30)
ORDER BY
CHARINDEX(','+CAST(RecordID AS varchar)+',',
','+'11,5,3,25,30'+',')
SQLFiddle demo