In Postgres: Select columns from a set of arrays of columns and check a condition on all of them - sql

I have a table like this:
I want to perform count on different set of columns (all subsets where there is at least one element from X and one element from Y). How can I do that in Postgres?
For example, I may have {x1,x2,y3}, {x4,y1,y2,y3},etc. I want to count number of "id"s having 1 in each set. So for the first set:
SELECT COUNT(id) FROM table WHERE x1=1 AND x2=1 AND x3=1;
and for the second set does the same:
SELECT COUNT(id) FROM table WHERE x4=1 AND y1=1 AND y2=1 AND y3=1;
Is it possible to write a loop that goes over all these sets and query the table accordingly? The array will have more than 10000 sets, so it cannot be done manually.

You should be able convert the table columns to an array using ARRAY[col1, col2,...], then use the array_positions function, setting the second parameter to be the value you're checking for. So, given your example above, this query:
SELECT id, array_positions(array[x1,x2,x3,x4,y1,y2,y3,y4], 1)
FROM tbl
ORDER BY id;
Will yield this result:
+----+-------------------+
| id | array_positions |
+----+-------------------+
| a | {1,4,5} |
| b | {1,2,4,7} |
| c | {1,2,3,4,6,7,8} |
+----+-------------------+
Here's a SQL Fiddle.

Related

Get total count and first 3 columns

I have the following SQL query:
SELECT TOP 3 accounts.username
,COUNT(accounts.username) AS count
FROM relationships
JOIN accounts ON relationships.account = accounts.id
WHERE relationships.following = 4
AND relationships.account IN (
SELECT relationships.following
FROM relationships
WHERE relationships.account = 8
);
I want to return the total count of accounts.username and the first 3 accounts.username (in no particular order). Unfortunately accounts.username and COUNT(accounts.username) cannot coexist. The query works fine removing one of the them. I don't want to send the request twice with different select bodies. The count column could span to 1000+ so I would prefer to calculate it in SQL rather in code.
The current query returns the error Column 'accounts.username' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. which has not led me anywhere and this is different to other questions as I do not want to use the 'group by' clause. Is there a way to do this with FOR JSON AUTO?
The desired output could be:
+-------+----------+
| count | username |
+-------+----------+
| 1551 | simon1 |
| 1551 | simon2 |
| 1551 | simon3 |
+-------+----------+
or
+----------------------------------------------------------------+
| JSON_F52E2B61-18A1-11d1-B105-00805F49916B |
+----------------------------------------------------------------+
| [{"count": 1551, "usernames": ["simon1", "simon2", "simon3"]}] |
+----------------------------------------------------------------+
If you want to display the total count of rows that satisfy the filter conditions (and where username is not null) in an additional column in your resultset, then you could use window functions:
SELECT TOP 3
a.username,
COUNT(a.username) OVER() AS cnt
FROM relationships r
JOIN accounts a ON r.account = a.id
WHERE
r.following = 4
AND EXISTS (
SELECT 1 FROM relationships t1 WHERE r1.account = 8 AND r1.following = r.account
)
;
Side notes:
if username is not nullable, use COUNT(*) rather than COUNT(a.username): this is more efficient since it does not require the database to check every value for nullity
table aliases make the query easier to write, read and maintain
I usually prefer EXISTS over IN (but here this is mostly a matter of taste, as both techniques should work fine for your use case)

Why postgres returns unordered data in select query, after updation of row?

I am bit confused over default ordering of the rows returned by postgres.
postgres=# select * from check_user;
id | name
----+------
1 | x
2 | y
3 | z
4 | a
5 | c1\
6 | c2
7 | c3
(7 rows)
postgres=# update check_user set name = 'c1' where name = 'c1\';
UPDATE 1
postgres=# select * from check_user;
id | name
----+------
1 | x
2 | y
3 | z
4 | a
6 | c2
7 | c3
5 | c1
(7 rows)
Before any updation, it was returning rows ordered by id, but after updation, the order has changed. So my question is that if order by is not specified, what default ordering does postgres uses ?
Thanks in advance.
Put very simply the "default order" is whatever it happens to read from the disk. Updating a row will not change the row in place... Usually it marks the old row as deleted and writes a new one.
When postgres reads rows from pages of memory, it will (probably) read them in the order they are stored on the page. It will read pages in whatever order it thinks is quickest (that may or may not be how they appear on disk). It can change based on whether or not it decides to use an index. So it can suddenly change without your app asking for anything different.
If you don't specify an order by it will not take any action to re-order them.
NEVER rely on the default order. It is undefined behaviour.
SQL tables represent unordered sets.
SQL results sets are unordered unless you explicitly include an order by.
Your select has no order by. Hence, the rows can come back in any order. Even running the same query twice can produce different orders.

For Sql performances, several equals or one between

For a new developement, I will have a big SQL table (~100M rows).
4 fields will be used to query the data.
Is it better to query one concatenated field with between or several equals ?
Exemple :
MainTable
PkId | Label | FkId1 | FkId2 | FkId3 | FkId4
1 | test | 1 | 4 | 3 | 1
Datas in Fk tables are static, example :
FkTable1
Id | Value
1 | a
2 | b
3 | c
To query the datas, the classic sql query is :
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where FkId1=1 and FkId2=2 and FkId3 in(2, 3)
The idea to optimize performance is to add one field "UniqueId" calculated backend before the insert :
UniqueId = FkId1*1000000 + FkId2*10000 + FkId3*100 + FkId4
PkId | Label | FkId1 | FkId2 | FkId3 | FkId4 | UniqueId
1 | test | 1 | 4 | 3 | 1 | 1040301
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where UniqueId between 1020200 and 1040000
Moreover, with the UniqueId field, an index on this field only will be sufficient.
What do you think ?
Thanks
For this query:
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where FkId1 = 1 and FkId2 = 2 and FkId3 in (2, 3)
The optimal index is on MainTable(FkID1, FkId2, FkId3). You can also add Label and FkId4 to the index if you want a covering index (so the index can handle the entire query without referring to the original data pages).
There is no need for a computed field for the example you provided.
Since you will have 100M rows, thinking about optimisations from the start seems sensible to me.
However, your proposed solution will not work in this way:
Your formula above has two times the SAME factor 10000. You have to use different factors, i.e. different powers of 10.
Your select example has a "IN" clause (FkId3 in(2, 3)). This will only work, if only one of the FKs is queried this way. This fk should be the one with no factor in the formula for computing UniqueId (i.e. gives the least significant Digits of UniqueId).
Now seeing Gordons answer, I agree with him, i.e. using a combined index may be good enough for you (though your solution would probably slightly better). However, also the combined index has a similar problem: The FK field beeing queried with the IN clause should be the last field in the index.

The record represented by the ID with the Highest aggregate value

I already have the code to display the highest aggregate value for a ID.
select max(fk3_job_role_id),max(sum(no_of_placements))
from fact_accounts
group by fk3_job_role_id
the result looks like:
[max(fk3_job_role_id)] | [max(sum(no_of_placements))]
-----------------------|-----------------------------
5 | 25
However, i want to display the job_role_desc instead of fk3_job_role_id represented by the same id.
The table for it looks like:
[job_role_id] | [job_role_desc]
--------------------------------
1 | job1
2 | job2
3 | job3
4 | job4
5 | job5
select job_role_desc,T.total_sum from fact_accounts where job_role_id in (select max(fk3_job_role_id),max(sum(no_of_placements)) as total_sum from fact_accounts group by fk3_job_role_id) T
You need to query for the job description by using a subquery. The above query first fetches the data according to the query inside the brackets( also known popularly as a subquery ). The result returned from this query is used to compare with all the other id's in the main table by a simple "in" clause.
Edit
If you also need the sum of placements you can get it by using a reference to the table created during the execution of the subquery

PostgreSQL calculate the top places per group and other statistics

I have a table with the following structure
|user_id | place | type_of_place | money_earned| time |
|--------+-------+---------------+-------------+------|
| | | | | |
The table is very large, several millions of rows. The data is in a PostgreSQL 9.1 database.
I want to calculate, per user_id and type_of_place: the mean, the standard deviation, and the top 5 of places (ordered by counts), and the most used hour of time (mode).
The resulting data must be in this form:
| user_id | type_of_place | avg | stddev | top5_places | mode |
+---------+---------------+-----+--------+------------------+------+
| 1 | tp1 | 10 | 1 | {p1,p2,p3,p4,p5} | 8 |
| 2 | tp1 | 3 | 2 | {p3,p4} | 23 |
| 1 | tp3 | 1 | 1 | {p1} | 4 |
etc.
Is there a for of doing this with window functions efficiently?
What if I want to grouping by week? (i.e. another column that represents the number of week)
Thank you!
A standard GROUP BY query will get you most of the way:
SELECT
user_id,
type_of_place,
avg(money_earned) AS avg,
stddev(money_earned) AS stddev
FROM
earnings -- I'm not sure what your data table is called...
GROUP BY
user_id,
type_of_place
This leaves the top5_places and mode columns. These are both also aggregates, but not ones which are defined in the standard PostgreSQL installation. Luckily, you can add them.
Here's a page discussing how to define a mode aggregate function: http://wiki.postgresql.org/wiki/Aggregate_Mode
Once you have a mode aggregate function, assuming time is a timestamp of some kind, the expression you will add to the select list will be:
SELECT
...
mode(extract(hour FROM time)) AS mode -- Add this expression
FROM
...
Assuming order by money
For top5_places, there are several approaches, but the quickest is probably to use PostgreSQL's builtin array_agg function, and take the first 5 elements:
SELECT
...
(array_agg(place ORDER BY money_earned DESC))[1:5] AS top5_places -- Add this expression
FROM
...
One alternative is to define another aggregate called (for instance) top5, which performs the same function. This could be more efficient if there are many distinct places for each user/type of place combination, since it can stop accumulating after the first 5, whereas the above expression will generally build a complete array of all places, and then truncate to the first 5.
This assumes that a place has a unique earnings entry for each user/type combination. If a place can occur more than once, and you want to sort by sum(money_earned) for each place, then you need to use a subquery like in the examples below...
Order by counts
Ok, so the places should be ordered by how often they occur. Here's a quick way, which uses a couple of subqueries -- add this as an expression to the select-clause of the above query:
(SELECT
(array_agg(place ORDER BY cnt DESC))[1:5]
FROM
(SELECT place, count(*) FROM earnings AS t2
WHERE t2.user_id = earnings.user_id AND t2.type_of_place = earnings.type_of_place
GROUP BY place) AS s (place, cnt)
) AS top5_places
The inner subquery called s evaluates to a table of each place for that user/type combination, and the number of times it occurs (which I've called cnt). These are then fed to array_agg in descending order of that count.
I suspect there could be much neater (and probably more efficient) ways of writing it. If not, then I would recommend trying to move this complicated expression into a function or aggregate, if you can...
Histrogram of places in each hour
We'll use a similar expression, which will return the array of counts, ordered by hour:
(SELECT
array_agg(cnt ORDER BY hour DESC)
FROM
(SELECT extract(hour FROM time), count(*) FROM earnings AS t2
WHERE t2.user_id = earnings.user_id AND t2.type_of_place = earnings.type_of_place
GROUP BY 1) AS s (hour, cnt)
) AS hourly_histogram
(Add that to the select-clause of the original query.)