SQL group by values stored in arrays - sql

I use OrientDB.
I have a table like this:
NAME | CATEGORIES
-------------------
N1 | [A,B]
N2 | [C]
N3 | [C,A]
N4 | [A,B]
And I would like to build a query that returns a list of categories, and for each category a list of related names, like this:
CATEGORY | NAMES
-----------------------
A | [N1,N3,N4]
B | [N1,N4]
C | [N2,N3]
If "categories" wasn't an array, I could achieve it with:
SELECT
Categories as Category,
set(Name) as Names
FROM Table
GROUP BY Categories
But, being Categories arrays, this is what I get:
CATEGORY | NAMES
--------------------
[A,B] | [N1,N4]
[C] | [N2]
[C,A] | [N3]
What query should I rather write?

I reproduced your structure with this command
create class test extends v
create property test.name string
create property test.categories embeddelist string
insert into test(name,categories) values ("N1",["A","B"]),("N2",["C"]),("N3",["C","A"]),("N4",["A","B"])
and I used this query
select categories,$a.name as name from (select distinct(categories) as categories from (select categories from test order by categories unwind categories))
let $a = (select name from test where categories contains $parent.$current.categories)
Hope it helps.

Related

SQL query filter list within a list

Consider a db with 2 tables like below. CompanyId is a foreign key in Department table pointing to Company. Company names are unique but the department names are not.
Table : Department
+-------+-----------+-----------+-------------------+
| id | name | CompanyId | phone |
+-------+-----------+-----------+-------------------+
| 1 | Sales | 1 | 214-444-1934 |
| 2 | R&D | 1 | 555-111-1834 |
| 3 | Sales | 2 | 214-222-1734 |
| 4 | Finance | 2 | 817-333-1634 |
| 5 | Sales | 3 | 214-555-1434 |
+-------+-----------+-----------+-------------------+
Table : Company
+-------+-----------+
| id | name |
+-------+-----------+
| 1 | Best1 |
| 2 | NewTec |
| 3 | JJA |
+-------+-----------+
I have a filter like below. when department name is null (empty) it means all the department id for that company should be included in the result but when there is list it should only include the ones which are listed.
[ {
companyName: "Best1",
departmentName: ["Sales", "R&D"]
},
{
companyName: "NewTec",
departmentName: ["Finance"]
} ,
{
companyName: "JJA",
departmentName: null
}
}]
Note: The filter is dynamic (a request to an API endpoint) and may include thousands of companies and departments.
I want a sql query to return all department id which fits in the criteria. for this example the result would be "1,2,4,5". (all department ids except the NewTec Sales department's id (3) are returned)
I'm looking for efficient SQL and/or linq query to return the result.
I can loop through companies and filter out departments for each individual one but it means that for each company there would be one trip to database using an ORM. Is there any better way to handle this case?
Here is the SQL query you want:
SELECT
d.id
FROM Department d
INNER JOIN Company c
ON d.CompanyId = c.id
WHERE
(c.name = 'Best1' AND d.name IN ('Sales', 'R&D')) OR
(c.name = 'NewTec' AND d.name = 'Finance') OR
c.name = 'JJA';
Demo
You want to deal with a variable number of conditions. There are mainly two ways to solve this:
Build the query string dynamically from your criteria.
Put the criteria in a separate table and query against that table.
With a filter table as such:
COMPANY_NAME | DEPARTMENT_NAME
-------------+----------------
Best1 | Sales
Best1 | R&D
NewTec | Finance
JJA | (null)
The unvarying (!) query would be:
SELECT *
FROM Department d
INNER JOIN Company c ON d.CompanyId = c.id
WHERE EXISTS
(
SELECT *
FROM Filter f
WHERE f.company_name = c.name
AND (f.department_name = d.name OR f.department_name IS NULL)
);
Here is a demo: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=49ff15426776536acc6f5bd7f88aaf8f (I've hijacked Tim's demo for this :-)
And here is an idea how to combine the two approaches mentioned: Make the filter table a temporary view, i.e. put it in a WITH clause. But in order to keep the query unvarying, you fill that WITH clause from a stored procedure which does the dynamic part. The stored procedure would read an XML containing the criteria and select rows from it. Thus you may be able to have your ORM call this procedure to get the results you are after.
Here is a thread explaining how to build a query in a stored procedure taking input from an XML: Creating a query from XML input parameter in SQL Server stored procedure and verifying output
The query would then look somethink like:
WITH Filter AS (SELECT company_name, department_name FROM dbm.GetMyFilterDataFromXml(#Xml))
SELECT *
FROM Department d
INNER JOIN Company c ON d.CompanyId = c.id
WHERE EXISTS
(
SELECT *
FROM Filter f
WHERE f.company_name = c.name
AND (f.department_name = d.name OR f.department_name IS NULL)
);

Select all rows that have at least a list of features with wildcard support

given a table definition:
Objects:
obj_id | obj_name
-------|--------------
1 | object1
2 | object2
3 | object3
Tags:
tag_id | tag_name
-------|--------------
1 | code:python
2 | code:cpp
3 | color:green
4 | colorful
5 | image
objects_tags:
obj_id | tag_id
-------|---------
1 | 1
1 | 2
2 | 1
2 | 3
3 | 1
3 | 2
3 | 3
I'd like to select objects that contain all tags from given list with wildcards. Similar question has been asked several times and answer to simpler variant looks more or less like this:
SELECT obj_id,count(*) c FROM objects_tags
INNER JOIN objects USING(obj_id)
INNER JOIN tags USING(tag_id)
WHERE (name GLOB 'code*' OR name GLOB 'color*')
GROUP BY obj_id
HAVING (c==2)
However this solution doesn't work with wildcards. Is it possible to create similar query that would return objects that for each given wildcard query returned at least 1 tag? Checking if c>=2 doesn't work because one wildcard tag can return multiple results while another may return 0 still passing query even though it shouldn't.
I considered builting dynamic query built by client software that would consist of N INTERSECTs (one per tag) because there's probably not going to be many of them but it sounds like really dirty solution and if there's any more SQL way then I'd prefer to use it.
SQLite supports WITH clause so I would try to use it to determine all tags first, and then use these tags to find objects in the below way.
The example (demo) is made for PostGreSQL because I could not run SQLIte on any online tester, but I belive you will convert it easily to SQLite:
this query retrieves all tags:
WITH tagss AS (
SELECT * FROM Tags
WHERE tag_name LIKE 'code:%' OR tag_name LIKE 'color:%'
)
SELECT * FROM tagss;
| tag_id | tag_name |
|--------|-------------|
| 1 | code:python |
| 2 | code:cpp |
| 3 | color:green |
and the final query uses the above subquery in this way:
WITH tagss AS (
SELECT * FROM Tags
WHERE tag_name LIKE 'code:%' OR tag_name LIKE 'color:%'
)
SELECT obj_id,count(*) c
FROM objects_tags
INNER JOIN tagss USING(tag_id)
WHERE tag_name IN ( SELECT tag_name FROM tagss)
GROUP BY obj_id
HAVING count(*) >= (
SELECT count(*) FROM tagss
)
| obj_id | c |
|--------|---|
| 3 | 3 |

Counting SQLite rows that might match multiple times in a single query

I have a SQLite table which has a column containing categories that each row may fall into. Each row has a unique ID, but may fall into zero, one, or more categories, for example:
|-------+-------|
| name | cats |
|-------+-------|
| xyzzy | a b c |
| plugh | b |
| quux | |
| quuux | a c |
|-------+-------|
I'd like to obtain counts of how many items are in each category. In other words, output like this:
|------------+-------|
| categories | total |
|------------+-------|
| a | 2 |
| b | 2 |
| c | 2 |
| none | 1 |
|------------+-------|
I tried to use the case statement like this:
select case
when cats like "%a%" then 'a'
when cats like "%b%" then 'b'
when cats like "%c%" then 'c'
else 'none'
end as categories,
count(*)
from test
group by categories
But the problem is this only counts each row once, so it can't handle multiple categories. You then get this output instead:
|------------+-------|
| categories | total |
|------------+-------|
| a | 2 |
| b | 1 |
| none | 1 |
|------------+-------|
One possibility is to use as many union statements as you have categories:
select case
when cats like "%a%" then 'a'
end as categories, count(*)
from test
group by categories
union
select case
when cats like "%b%" then 'b'
end as categories, count(*)
from test
group by categories
union
...
but this seems really ugly and the opposite of DRY.
Is there a better way?
Fix your data structure! You should have a table with one row per name and per category:
create table nameCategories (
name varchar(255),
category varchar(255)
);
Then your query would be easy:
select category, count(*)
from namecategories
group by category;
Why is your data structure bad? Here are some reasons:
A column should contain a single value.
SQL has pretty lousy string functionality.
SQL queries to do what you want cannot be optimized.
SQL has a great data structure for storing lists. It is called a table, not a string.
With that in mind, here is one brute force method for doing what you want:
with categories as (
select 'a' as category union all
select 'b' union all
. . .
)
select c.category, count(t.category)
from categories c left join
test t
on ' ' || t.categories || ' ' like '% ' || c.category || ' %'
group by c.category;
If you already have a table of valid categories, then the CTE is not needed.

postgres - pivot query with array values

Suppose I have this table:
Content
+----+---------+
| id | title |
+----+---------+
| 1 | lorem |
+----|---------|
And this one:
Fields
+----+------------+----------+-----------+
| id | id_content | name | value |
+----+------------+----------+-----------+
| 1 | 1 | subtitle | ipsum |
+----+------------+----------+-----------|
| 2 | 1 | tags | tag1 |
+----+------------+----------+-----------|
| 3 | 1 | tags | tag2 |
+----+------------+----------+-----------|
| 4 | 1 | tags | tag3 |
+----+------------+----------+-----------|
The thing is: i want to query the content, transforming all the rows from "Fields" into columns, having something like:
+----+-------+----------+---------------------+
| id | title | subtitle | tags |
+----+-------+----------+---------------------+
| 1 | lorem | ipsum | [tag1,tag2,tag3] |
+----+-------+----------+---------------------|
Also, subtitle and tags are just examples. I can have as many fields as I desired, them being array or not.
But I haven't found a way to convert the repeated "name" values into an array, even more without transforming "subtitle" into array as well. If that's not possible, "subtitle" could also turn into an array and I could change it later on the code, but I needed at least to group everything somehow. Any ideas?
You can use array_agg, e.g.
SELECT id_content, array_agg(value)
FROM fields
WHERE name = 'tags'
GROUP BY id_content
If you need the subtitle, too, use a self-join. I have a subselect to cope with subtitles that don't have any tags without returning arrays filled with NULLs, i.e. {NULL}.
SELECT f1.id_content, f1.value, f2.value
FROM fields f1
LEFT JOIN (
SELECT id_content, array_agg(value) AS value
FROM fields
WHERE name = 'tags'
GROUP BY id_content
) f2 ON (f1.id_content = f2.id_content)
WHERE f1.name = 'subtitle';
See http://www.postgresql.org/docs/9.3/static/functions-aggregate.html for details.
If you have access to the tablefunc module, another option is to use crosstab as pointed out by Houari. You can make it return arrays and non-arrays with something like this:
SELECT id_content, unnest(subtitle), tags
FROM crosstab('
SELECT id_content, name, array_agg(value)
FROM fields
GROUP BY id_content, name
ORDER BY 1, 2
') AS ct(id_content integer, subtitle text[], tags text[]);
However, crosstab requires that the values always appear in the same order. For instance, if the first group (with the same id_content) doesn't have a subtitle and only has tags, the tags will be unnested and will appear in the same column with the subtitles.
See also http://www.postgresql.org/docs/9.3/static/tablefunc.html
If the subtitle value is the only "constant" that you wan to separate, you can do:
SELECT * FROM crosstab
(
'SELECT content.id,name,array_to_string(array_agg(value),'','')::character varying FROM content inner join
(
select * from fields where fields.name = ''subtitle''
union all
select * from fields where fields.name <> ''subtitle''
) fields_ordered
on fields_ordered.id_content = content.id group by content.id,name'
)
AS
(
id integer,
content_name character varying,
tags character varying
);

Remove duplicates when joining tables

I have a news table as follows
News:
| id | title | description
| 1 | Breaking news | bla bla bla
| 2 | Heavy snowfall in london | bla bla bla
a Type table as follows:
| id | type_name | type_code
| 1 | weather | 0567
| 2 | city | 0653
and a NewsType table as follows
|id | news_id | type_id | created_by |
| 1 | 2 | 1 | "John" |
| 2 | 2 | 2 | "Alex" |
As you can see from the NewsType table that a single news can fall into two or more types.
I need to display news corresponding to types. A user might say give me all the news about cities and weather. To display this I am doing something like:
select distinct n.* , nt.created_at
from news n, newstype nt, type t where
n.id = nt.news_id and
t.id = nt.type_id
order by nt.created_at
limit 25
The problem is this query returns the same news twice (I think it's because of the inner join I am doing). What should I change in the query so that if a news is classified as two types, and the user has requested to view the same two types of news, I get only single news item? instead of two!
simple solution:
select * from news where news_id in (
select news_id
from NewsType
where type_id in (the types you want)
)
most people would say that you should add a DISTINCT on the news_id on the inner query. You can try that, but Im quite sure it will decrese performance.
Over all, if you think this solution doesnt perform well, you can make the inner query a CTE, which usually behaves better:
with my_CTE as(
select news_id
from NewsType
where type_id in (the types you want)
)
select *
from news
where news_id in (select news_id from my_CTE)
A group by is another approach to this:
select n.id, n.title, n.description, max(nt.created_at)
from news n, newstype nt, type t where
n.id = nt.news_id and
t.id = nt.type_id
group by n.id, n.title, n.description
order by nt.created_at
limit 25
Try
select distinct n.id, n.title, n.description
but, as #Jan Dvorak stated,
select distinct n.*
shouldn't select the same news twice
You want to select all of the stories that have an entry in the NewsType table for a praticular type. Therefore you want to select the news items where a relationship to the type exists:
SELECT
News.ID,
News.Title,
News.Description
FROM
News
WHERE
EXISTS
(SELECT
NULL
FROM
NewsType
INNER JOIN Type ON NewsType.Type_ID = Type.ID
WHERE
News.ID = NewsType.News_ID
AND Type.Type_Code = #typeCode)
The last line of the where clause may need to be changed to Type.Type_Name = #typeName if you are using the type name as the parameter
You need to decide what to do with the "duplicate" types: Do you want to display just one type for a news item associated with multiple types, or do you want to list them all?
If the latter, you could investigate using the string_agg function, see http://www.postgresql.org/docs/9.2/static/functions-aggregate.html
select distinct n.id, n.title, n.description, string_agg(t.type_name, ',')
from news n, newstype nt, type t where
n.id = nt.news_id and
t.id = nt.type_id
group by n.id, n.title, n.description
limit 25