How to query sum total of transitively linked child transactions from database? - sql

I got this one assignment which has a lot of weird stuff to do. I need to create an API for storing transaction details and do some operations. One such operation involves retrieving a sum of all transactions that are transitively linked by their parent_id to $transaction_id.
If A is the parent of B and C, and C is the parent of D and E, then
sum(A) = A + B + C + D + E
note: not just immediate child transactions.
I have this sample data in the SQL database as given below.
MariaDB [test_db]> SELECT * FROM transactions;
+------+-------+----------+---------+
| t_id | t_pid | t_amount | t_type |
+------+-------+----------+---------+
| 1 | NULL | 10000.00 | default |
| 2 | NULL | 25000.00 | cars |
| 3 | 1 | 30000.00 | bikes |
| 4 | NULL | 10000.00 | bikes |
| 5 | 3 | 15000.00 | bikes |
+------+-------+----------+---------+
5 rows in set (0.000 sec)
MariaDB [test_db]>
where t_id is a unique transaction_id and t_pid is a parent_id which is either null or an existing t_id.
so, when I say sum(t_amount) where t_id=1, I want the result to be
sum(1+3+5) -> sum(10000 + 30000 + 15000) = 55000.
I know I can achieve this in a programmatic way with some recursion which will do repeated query operations and add the sum. But, that will give me poor performance if the data is very large say, millions of records.
I want to know if there is any possibility of achieving this with a complex query. And if yes, then how to do it?
I have very little knowledge and experience with databases. I tried with what I know and I couldn't do it. I tried searching for any similar queries available here and I didn't find any.
With what I have researched, I guess I can achieve this with stored procedures and using the HAVING clause. Let me know if I am right there and help me do this.
So, any sort of help will be appreciated.
Thanks in advance.

You need a recursive CTE:
with recursive cte as (
select t_id as ultimate_id, t_id, t_amount
from tranctions t
where t_id = 1
union all
select cte.ultimate_id, t.t_id, t.amount
from cte join
transactions tc
on tc.p_id = cte.t_id
)
select ultimate_id, sum(t_amount)
from cte
group by ultimate_id;

Related

Recursive Function appropriate?

Hi guys wondering could yous help me with a recursive query within SQL. Or even if a recursive query is the right choice.
I have columns like so lets say
ID | CUS | CASHIERID | RECEIPTID | PAYMENTNUM | ORIGINALRECEIPT
Now assume there is data like so:
+----------+--------+-------------+-------------+--------------+------------------+
| ID | CUS | CASHIERID | RECEIPTID | PAYMENTNUM | ORIGINALRECEIPT |
+----------+--------+-------------+-------------+--------------+------------------+
| 1 | jeff | 2 | 123 | 00005 | NULL |
| 4 | jeff | 2 | 548 | 00005 | 123 |
| 16 | jeff | 2 | 897 | 00005 | 123 |
| 151 | jeff | 2 | 1095 | 00005 | 123 |
+----------+--------+-------------+-------------+--------------+------------------+
Now say the Database was Huge and there could be X amount of related receipts as we see above ID is the original and the all others are related (refunds or something). Now say I was given the RECEIPTID for any one of these. To get all parent/child rows of this what is the best route? My first initial thought is to just simply do a sort of IF ELSE lets say and if ORIGINALRECEIPT is not empty then do a where clause with whatever is in it. But for sake of argument would you be able to do a recursive query of sorts to be able put in any receiptID and receive all 4 records back
EDIT
Hi guys so bit of a change so I got a recursive function working but now you see the data base is HUGE and when I perform the recursive function which is finding all reissued receipts (new ones) after the user inputs a receipt ID so user inputs receiptID, this then runs a recursive query that gets all related newer receipts by using the 'prevRecep' column which has the before receiptID in it so like a chain as mentioned in the comments. I have it working no problem on the small test database but the HUGE DB is super slow its been 40 mins and still has not finished. there is an index on CU,cashierid,receiptid but unfortnately for now I can't have an index on any other column. So I know that will already really slow my query down as im using the prevRecep column in it but is there any way I can quicken it up or better approach? Below is the recursive query
with cte as (
select *
from receipts
where cus='jeff' and casherid='2' and receiptid= '548'
union all
select cur.*
from receiptscur, cte
where cur.prevRecep = cte.recieptID
)
select * from cte
Yes, a recursive query should be fine :
declare #ReceiptId int = 123;
with cte as (
--These are the anchor (the parents)
select *
from Receipts
where ReceiptId = #ReceiptId and OriginalReceipt is null
union all
--These are the recursive childs. Could be multiple levels : parent, child, subchild, ...
select Receipts.*
from Receipts
inner join cte on cte.ReceiptId = Receipts.OriginalReceipt
)
select * from cte;
By the way, if your parent-child relations don't have more than one level, then the query doesn't need to be recursive, a simple UNION would be enough:
declare #ReceiptId int = 123;
select *
from Receipts
where ReceiptId = #ReceiptId
union all
select Receipts.*
from Receipts
where OriginalReceipt = #ReceiptId

Calculate overall percentage of Access Query

I have an MS Access Query which returns the following sample data:
+-----+------+------+
| Ref | ANS1 | ANS2 |
+-----+------+------+
| 123 | A | A |
| 234 | B | B |
| 345 | C | C |
| 456 | D | E |
| 567 | F | G |
| 678 | H | I |
+-----+------+------+
Is it possible to have Access return the overall percentage where ANS1 = ANS2?
So my new query would return:
50
I know how to get a count of the records returned by the original query, but not how to calculate the percentage.
Since you're looking for a percentage of some condition being met across the entire dataset, the task can be reduced to having a function return either 1 (when the condition is validated), or 0 (when the condition is not validated), and then calculating an average across all records.
This could be achieved in a number of ways, one example might be to use a basic iif statement:
select avg(iif(t.ans1=t.ans2,1,0)) from YourTable t
Or, using the knowledge that a boolean value in MS Access is represented using -1 (True) or 0 (False), the expression can be reduced to:
select -avg(t.ans1=t.ans2) from YourTable t
In each of the above, change YourTable to the name of your table.
If you know how to get a count, then apply that same knowledge twice:
SELECT Count([ANS1]) As MatchCount FROM [Data]
WHERE [ANS1] = [ANS2]
divided by the total count
SELECT Count([ANS1]) As AllCount FROM [Data]
To combine both of these in a basic SQL query, one needs a "dummy" query since Access doesn't allow selection of only raw data:
SELECT TOP 1
((SELECT Count([ANS1]) As MatchCount FROM [Data] WHERE [ANS1] = [ANS2])
/
(SELECT Count([ANS1]) As AllCount FROM [Data]))
AS MatchPercent
FROM [Data]
This of course assumes that there is at least one row... so it doesn't divide by zero.

Is it possible to do complex SQL queries using Django?

I have the following Script to get a list of calculated index for each day after specific date:
with test_reqs as (
select id_test, date_request, sum(n_requests) as n_req from cdr_test_stats
where
id_test in (2,4) and -- List of Ids included in index calc
date_request >= 20170823 -- Start date (end date -> Last in DB -> Today)
group by id_test, date_request
),
date_reqs as (
select date_request, sum(n_req) as n_req
from test_reqs
group by date_request
),
test_reqs_ratio as (
select H.id_test, H.date_request,
case when D.n_req = 0 then null else H.n_req/D.n_req end as ratio_req
from test_reqs H
inner join date_reqs D
on H.date_request = D.date_request
),
test_reqs_index as (
select HR.*, least(nullif(HA.n_dates_hbalert, 0), 10) as index_hb
from test_reqs_ratio HR
left join cdr_test_alerts_stats HA
on HR.id_test = HA.id_test and HR.date_request = HA.date_request
)
select date_request, 10-sum(ratio_req*index_hb) as index_hb
from test_reqs_index
group by date_request
Result:
---------------------------
| date_request | index_hb |
---------------------------
| 20170904 | 7.5508 |
| 20170905 | 7.6870 |
| 20170825 | 7.4335 |
| 20170901 | 7.7116 |
| 20170824 | 1.6568 |
| 20170823 | 0.0000 |
| 20170903 | 5.1850 |
| 20170830 | 0.0000 |
| 20170828 | 0.0000 |
---------------------------
The problem is that I want to get the same in Django and avoid to execute the raw query using the cursor.
Many thanks for any suggestion.
Without going deep into the specifics of your query, I'd say the Django ORM has enough expressiveness to handle most problems, but generally, would require you to redesign the query from the ground up. You would have to use subqueries and joins instead of the CTE's, and you might end up with a solution that does some of the work in Python land instead of the DB.
Taking this into account the answer is: depends. Your functional requirements, such as performance and data size play a role.
Another solution worth considering is declaring your SQL query as a view, and at least in the case of Postgres, use something like django-pgviews to query it with Django ORM almost as if it were a model.

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

prevent from double/triple SUMing when JOINing

i am joining two tables: accn_demographics and accn_payments. The relationship between the two tables is one to many between accn_demographics.accn_id and accn_payments.accn_id
My question is when I am summing the PAID_AMT and COPAY_AMT, I am getting double/triple/quadrouple the number that I should be getting.
Is there an obvious problem with my join condition?
select sum(p.paid_amt) as SumPaidAmount
, sum(p.copay_amt) as SumCoPay
, p.pmt_date
, d.load_Date
, p.ACCN_ID
from accn_payments p
join
(
select distinct load_date, accn_id
from accn_demographics
) d
on p.ACCN_ID=d.ACCN_ID
where p.POSTED='Y'
and p.pmt_date between '20120701' and '20120731'
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
thanks so much for your guidance.
You need to do the summation in a subquery:
select sum(p.SumPaidAmount) as SumPaidAmount, sum(p.SumCoPay) as SumCoPay,
p.pmt_date, d.load_Date, p.ACCN_ID
from (select accn_id, p.pmt_date, sum(paid_amt) as SumPaidAmt,
sum(copay_amt) as SumCoPay
from accn_payments p
where p.POSTED='Y' and
p.pmt_date between '20120701' and '20120731'
group by accn_id, pmt_date
) p join
(select distinct load_date, accn_id from accn_demographics) d
on p.ACCN_ID=d.ACCN_ID
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
Question: do you really intend for pmt_date to be in the final results? It looks like you want to remove it from both the outer SELECT and the subquery.
The only thing I can see if that (select distinct load_date, accn_id from accn_demographics) might return several matches. Look at your data and run a separate query
select distinct load_date, accn_id from accn_demographics WHERE accn_id=SomeID
where SomeID is one of the result accounts that is returning double/triple values. That should pinpoint your problem.
Yes, but it's not so obvious for beginners. What happens is that for every accn_payments record, you're matching on ONLY the accn_id, which means if there are multiple records in accn_demographics for that particular accn_id, then you will get duplicate accn_payment records due to the join. Is there another limiting field on accn_demographics to join back to the payments?
Ultimately, think of it this way:
accn_payments (p):
accn_id | paid_amt | copay_amt | ...
----------------------------------------------------
1 | 100.00 | 20.00 | ...
accn_demographics (d):
accn_id | load_date | ...
------------------------------------
1 | 2012/01/01 | ...
1 | 2012/03/05 | ...
1 | 2012/06/23 | ...
After joining, your results will look like this:
p.accn_id | p.paid_amt | p.copay_amt | p... | d.accn_id | d.load_date | d...
----------------------------------------------------------------------------
1 | 100.00 | 20.00 | .... | 1 | 2012/01/01 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/03/05 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/06/21 | ....
As you can see, the same row from accn_payments gets replicated for every matching accn_demographics record, since you specified only the accn_id column to be the join criteria. It can't limit the results any further, so it the DB engine says "Hey, look, this p record matches for all these d records, this must be what he was asking for!" Obviously not what was intended, as when you sum on the p.paid_amt and p.copay_amt, it performs a sum for ALL ROWS (even though they are duplicated).
Ultimately, see if you can limit the join criteria for accn_demographics even further (by some date, perhaps), that way you limit the number of duplicate payment records during the join.