Joining a large number of tables so that all dates are kept - sql

I have around 50-70 tables that look very similar, say:
Table 1:
id | date | count_A | count_B
1 12.05.2021 12 15
Table 2:
id | date | count_A | count_B
1 15.05.2021 8 24
The main table looks like the following:
id | label
1 X
In the end, what I would like to get is:
id | date | count_A | count_B | label
1 12.05.2021 12 15 X
1 15.05.2021 8 24 X
One intuitive approach is to use the full outer join and join on id but that would result in strange rows with several date values.
Joining on (id, date) doesn't seem to be a great option either.
What can be a possible solution here? Thanks!

You can use a subquery with the statement WITH. Inside this subquery, you can use the UNION with all the tables with the same schema.
Use a join statement between the subquery, in this case tablaC and the main table, which has a different schema.
You can see this example:
WITH tablaC AS (
SELECT ID,date,count_C,Count_D FROM Table_C
UNION ALL
SELECT ID,date,count_C,Count_D FROM Table_D
)
select c.ID,date,c.count_C,c.Count_D,m.label
from tablaC as c
join table_main as m on c.id=m.id

Related

How to join tables on a partly-overlapping key column while retaining all data and not creating duplicate columns?

I have three tables that all have a “date” column and another column with counts of different variables - let’s call the tables T1, T2, and T3 and each of their columns are counts of dogs, cats, and birds spotted that day.
Not every table has the same set of dates. Example:
T1: Dogs spotted by day
date | dogs
------------------
2020-08-26 | 1
2020-08-27 | 4
T2: Cats spotted by day
date | cats
---------------------
2020-08-25 | 2
2020-08-26 | 5
T3: Cats spotted by day
date | birds
---------------------
2020-08-26 | 8
2020-08-27 | 3
2020-08-28 | 5
I’m trying to join them together on date while keeping all column data, but I’m having trouble doing so without getting a table that has 3 date columns. There’s no table that has all of the dates, so if I just select one of the date columns (e.g. select t1.date, t1.dogs, t2.cats, t3.birds) then I lose some of the date data. What I’m seeking is a table like this:
Desired Output: All Animals Spotted by Day
date | dogs | cats | birds |
----------------------------------------------------------
2020-08-25 | 0 (or null) | 2 | 0 (or null) |
2020-08-26 | 1 | 5 | 8 |
2020-08-27 | 4 | 0 (or null) | 3 |
2020-08-28 | 0 (or null) | 0 (or null) | 5 |
I’ve read about every stack overflow post on this I could find but maybe I’m not putting in the correct keywords because I’m not finding this. I’m working specifically in Postgres. Thank you!!
Use generate_series to construct a table of dates and use outer joins with the other tables:
SELECT d.d::date,
t1.dogs,
t2.cats,
t3.birds
FROM generate_series ('2020-08-25'::timestamp, '2020-08-28'::timestamp, '1 day'::interval) AS d(d)
LEFT JOIN t1 ON t1.date = d.d::date
LEFT JOIN t2 ON t1.dat3 = d.d::date
LEFT JOIN t3 ON t3.date = d.d::date;
Regardless of knowing the design why you need or you could change it further,
Using union and aggregation could be one option,
select date
, max(dogs) as dogs
, max(cats) as cats
, max(birds) as birds
from
(
select date,dogs,0 cats,0 birds from t1
union all
select date,0,cats,0 from t2
union all
select date,0,0,birds from t3
) t
group by date
order by date;
Note: Don't know if multiple entry possible for a single date , in case yes you need to use sum instead max
You can also use full join. For your example select * does what you want:
select *
from cats c full join
dogs d
using (date) full join
birds b
using (date);
I might recommend, however, that you put all the counts into a single table, with an additional column specifying "cat", "dog" and so on. If you had that, then simple aggregation would work:
select date,
count(*) filter (where type = 'cat'),
count(*) filter (where type = 'dog'),
count(*) filter (where type = 'bird')
from t
group by date;

Joining two tables where one contains multiple rows that refer to one row in the other table PSQL

So I have two tables that look like this:
nalog:
prod_num
id
qa_stavka_kontrola:
status
id_nalog
id
redak --question id
The second table is used to store yes or no answers (in column status(boolean)) about books that are contained in the first table. Multiple rows of the second table refer to one row in the first one. I want to make a report that looks something like this:
|prod_num | question 1 | question 2 | question 3 |
|52 | 1 | 0 | 1 |
|53 | 0 | 1 | 1 |
This is my query but it is very very slow:
select nalog.prod_num
, r1.status as question1
, r2.status as question2
, r3.status as question3
from nalog
left join qa_stavka_kontrola as r1
on nalog.id=r1.id_nalog and r1.redak=1 and (r1.status=1 or r1.status=0)
left join qa_stavka_kontrola as r2
on nalog.id=r2.id_nalog and r2.redak=2 and (r2.status=1 or r2.status=0)
left join qa_stavka_kontrola as r3
on nalog.id=r3.id_nalog and r3.redak=3 and (r3.status=1 or3.status=0)
where nalog.date BETWEEN '2017-09-01' and '2018-01-11'
group by nalog.prod_num, r1.status, r2.status, r3.status
I might be off, but you can try to use PIVOT, check documentation for more info.
with something as
(select t1.prod_num, t2.redak
from nalog t1
left outer join qa_stavka_kontrola t2
on t1.id=t2.id_nalog
where t2.status in (0,1)
and t1.date BETWEEN '2017-09-01' and '2018-01-11' -- is there a date column?
)
select *
from something
pivot (count(redak) for redak in (1 as question_1, 2 as question_2, 3 as question_3));

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

How to fill in empty date rows multiple times?

I am trying to fill in dates with empty data, so that my query returned has every date and does not skip any.
My application needs to count bookings for activities by date in a report, and I cannot have skipped dates in what is returned by my SQL
I am trying to use a date table (I have a table with every date from 1/1/2000 to 12/31/2030) to accomplish this by doing a RIGHT OUTER JOIN on this date table, which works when dealing with one set of activities. But I have multiple sets of activities, each needing their own full range of dates regardless if there were bookings on that date.
I also have a function (DateRange) I found that allows for this:
SELECT IndividualDate FROM DateRange('d', '11/01/2017', '11/10/2018')
Let me give an example of what I am getting and what I want to get:
BAD: Without empty date rows:
date | activity_id | bookings
-----------------------------
1/2 | 1 | 5
1/4 | 1 | 4
1/3 | 2 | 6
1/4 | 2 | 2
GOOD: With empty date rows:
date | activity_id | bookings
-----------------------------
1/2 | 1 | 5
1/3 | 1 | NULL
1/4 | 1 | 4
1/2 | 2 | NULL
1/3 | 2 | 6
1/4 | 2 | 2
I hope this makes sense. I get the whole point of joining to a table of just a list of dates OR using the DateRange table function. But neither get me the "GOOD" result above.
Use a cross join to generate the rows and then left join to fill in the values:
select d.date, a.activity_id, t.bookings
from DateRange('d', ''2017-11-01',''2018-11-10') d cross join
(select distinct activity_id from t) a left join
t
on t.date = d.date and t.activity_id = a.activity_id;
It is a bit hard to follow what your data is and what comes from the function. But the idea is the same, wherever the data comes from.
I figured it out:
SELECT TOP 100 PERCENT masterlist.dt, masterlist.activity_id, count(r_activity_sales_bymonth.bookings) AS totalbookings
FROM (SELECT c.activity_id, dateadd(d, b.incr, '2016-12-31') AS dt
FROM (SELECT TOP 365 incr = row_number() OVER (ORDER BY object_id, column_id), *
FROM (SELECT a.object_id, a.column_id
FROM sys.all_columns a CROSS JOIN
sys.all_columns b) AS a) AS b CROSS JOIN
(SELECT DISTINCT activity_id
FROM r_activity_sales_bymonth) AS c) AS masterlist LEFT OUTER JOIN
r_activity_sales_bymonth ON masterlist.dt = r_activity_sales_bymonth.purchase_date AND masterlist.activity_id = r_activity_sales_bymonth.activity_id
GROUP BY masterlist.dt, masterlist.activity_id
ORDER BY masterlist.dt, masterlist.activity_id

Deleting similar columns in SQL

In PostgreSQL 8.3, let's say I have a table called widgets with the following:
id | type | count
--------------------
1 | A | 21
2 | A | 29
3 | C | 4
4 | B | 1
5 | C | 4
6 | C | 3
7 | B | 14
I want to remove duplicates based upon the type column, leaving only those with the highest count column value in the table. The final data would look like this:
id | type | count
--------------------
2 | A | 29
3 | C | 4 /* `id` for this record might be '5' depending on your query */
7 | B | 14
I feel like I'm close, but I can't seem to wrap my head around a query that works to get rid of the duplicate columns.
count is a sql reserve word so it'll have to be escaped somehow. I can't remember the syntax for doing that in Postgres off the top of my head so I just surrounded it with square braces (change it if that isn't correct). In any case, the following should theoretically work (but I didn't actually test it):
delete from widgets where id not in (
select max(w2.id) from widgets as w2 inner join
(select max(w1.[count]) as [count], type from widgets as w1 group by w1.type) as sq
on sq.[count]=w2.[count] and sq.type=w2.type group by w2.[count]
);
There is a slightly simpler answer than Asaph's, with EXISTS SQL operator :
DELETE FROM widgets AS a
WHERE EXISTS
(SELECT * FROM widgets AS b
WHERE (a.type = b.type AND b.count > a.count)
OR (b.id > a.id AND a.type = b.type AND b.count = a.count))
EXISTS operator returns TRUE if the following SQL statement returns at least one record.
According to your requirements, seems to me that this should work:
DELETE
FROM widgets
WHERE type NOT IN
(
SELECT type, MAX(count)
FROM widgets
GROUP BY type
)