Teradata-replacing self join - sql

I have table in Teradata and have trillion of record.
Temp- with cat_nbr as PI
Cat_nbr | brand_Nbr |card_nbr
1 | 10 | 100
1 | 10 |101
1 |20 | 100
1 | 20 | 102
2 |10 | 100
2 | 10 |103
2 |30 |100
2 |30 |105
3 |40 |106
3 | 30 |107
I need to find out categories total no of customer for a particular brand.
Just an ex. for brand no:10
First we need to check which cat have brand no 10, in this cat 1,2 have it.
Then for all cutomer in cat 1,2 ; we need count(distinct card_no).
result shoul be like
brand_nbr|total_cust
10 | 5
I have written the below query to achive this:-
select k.brand_nbr,count(distinct l.card_nbr)
from temp k join temp l on k.cat_nbr=l.cat_nbr
group by 1;
It give me proper result but the thing , we have trillion of records in table and when I do run the query it goes on processing like more than 2 hrs.
I need a solution to improve the performance so that it can max in 30 min.
I have checked the amps , there are 16 amps for my database.
Please masters help me out if you have any solution for this.
Thanks in advance.

The only other approach I can think of is using two steps:
-- This will remove duplicates
CREATE VOLATILE SET TABLE vt AS
(
SELECT k.brand_nbr,l.card_nbr
FROM temp k JOIN temp l ON k.cat_nbr=l.cat_nbr
)
WITH DATA
PRIMARY INDEX(brand_nbr)
ON COMMIT PRESERVE ROWS;
-- Now you can simply count without distinct
SELECT brand_nbr, COUNT(*)
FROM vtab
GROUP BY 1;
Depending on your data (number of rows per cat_nbr/brand_nbr) this might be faster. Or slower and totally skewed :-)
Btw, I doubt you store 1 trillion rows on a 16 AMP system, this is at least 30TB, maybe 16 nodes

If you don't want to create volatile table as a set (as dnoeth suggested), try using an ordered analytical function:
SELECT DISTINCT
k.brand_Nbr,
COUNT(l.card_nbr) OVER(PARTITION BY k.brand_Nbr) AS cnt
FROM temp k JOIN temp l ON k.cat_nbr=l.cat_nbr
Ordered analytical functions don't need GROUP BY statement. I am not really sure if it would be actually better than a volatile table regarding performance (since a volatile table mentioned in dnoeth's solution also uses indexing, which theoretically should be better for Teradata), but you can give it a try.

Related

How to query sum total of transitively linked child transactions from database?

I got this one assignment which has a lot of weird stuff to do. I need to create an API for storing transaction details and do some operations. One such operation involves retrieving a sum of all transactions that are transitively linked by their parent_id to $transaction_id.
If A is the parent of B and C, and C is the parent of D and E, then
sum(A) = A + B + C + D + E
note: not just immediate child transactions.
I have this sample data in the SQL database as given below.
MariaDB [test_db]> SELECT * FROM transactions;
+------+-------+----------+---------+
| t_id | t_pid | t_amount | t_type |
+------+-------+----------+---------+
| 1 | NULL | 10000.00 | default |
| 2 | NULL | 25000.00 | cars |
| 3 | 1 | 30000.00 | bikes |
| 4 | NULL | 10000.00 | bikes |
| 5 | 3 | 15000.00 | bikes |
+------+-------+----------+---------+
5 rows in set (0.000 sec)
MariaDB [test_db]>
where t_id is a unique transaction_id and t_pid is a parent_id which is either null or an existing t_id.
so, when I say sum(t_amount) where t_id=1, I want the result to be
sum(1+3+5) -> sum(10000 + 30000 + 15000) = 55000.
I know I can achieve this in a programmatic way with some recursion which will do repeated query operations and add the sum. But, that will give me poor performance if the data is very large say, millions of records.
I want to know if there is any possibility of achieving this with a complex query. And if yes, then how to do it?
I have very little knowledge and experience with databases. I tried with what I know and I couldn't do it. I tried searching for any similar queries available here and I didn't find any.
With what I have researched, I guess I can achieve this with stored procedures and using the HAVING clause. Let me know if I am right there and help me do this.
So, any sort of help will be appreciated.
Thanks in advance.
You need a recursive CTE:
with recursive cte as (
select t_id as ultimate_id, t_id, t_amount
from tranctions t
where t_id = 1
union all
select cte.ultimate_id, t.t_id, t.amount
from cte join
transactions tc
on tc.p_id = cte.t_id
)
select ultimate_id, sum(t_amount)
from cte
group by ultimate_id;

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

GROUP BY and SUMS in MS ACCESS

I'm trying to get a report of how many article have been sold, especially which one was sold more, both in terms of numbers and price.
I'm trying the above query, thinking that using the [PRICE]*[total] in the group by expression, it could worked. unluckily it does not. I've try also to put the alias in the group by expression, but nothing more, it only says that I need to use a grouping expression for the column: [PRICE]*[total] which is what I thought I have done.
SELECT TOP 20 ARTIC, Sum(TOTGIA) AS total, [PRICE]*[total] AS a
FROM Car
GROUP BY ARTIC, [PRICE]*[total]
ORDER BY Sum(TOTGIA) DESC;
anyone could lead me in the good direction?
the error is:
"You tried to execute a query that does not include the specified expression '[PRICE]*[total]' as part of an aggregate function."
the table is something like this:
|artic|totgia|price
+++++++++++++++++++
|aaa | 1 | 10
|aaa | 4 | 10
|bbb | 1 | 200
I would like to have:
|aaa| 5 | 50
|bbb| 1 | 200
so aaa is the first one for number of sells, but bbb is first for cash
The problem here is that you are trying to use the total alias in the select and in the group by. You do not have access to the alias at this time. Instead, you will either need to refer to the actual column values in place of total. In other cases, you can create a subselect and use the alias, but this does not apply to your query as it is written.
SELECT TOP 20 ARTIC, Sum(TOTGIA) AS total, PRICE*Sum(TOTGIA) AS a
FROM Car
GROUP BY ARTIC, PRICE
ORDER BY Sum(TOTGIA) DESC;
If you have an article listed with several different prices, this query will return several rows. So, this data:
|artic|totgia|price
+++++++++++++++++++
|aaa | 1 | 10
|aaa | 4 | 20
|bbb | 1 | 200
Would return these results:
|aaa| 1 | 10
|aaa| 4 | 80
|bbb| 1 | 200
This would happen because we have specifically told sql that we want the unique articles and prices as their own rows. However, this is probably a good thing because in the above scenario, you wouldn't want to return that aaa has a quantity of 5 with a value of 50, since the total value is 90. If this is a possible scenario for your data, you would make this query into a subselect and group all the data for the unique articles together.

MS Access sum of 2 table in one query

I have 2 tables:
name "mfr"
name "pomfr"
Both have many columns, but some are same, and I want to sum of that similar column in one query based on one of them similar column group by
Data sample is
table1. mfr
rfno|ppic|pcrt
101 | 10| .30
102 | 15| .50
103 | 18| .68
table2 pomfr
rfno|ppic|pcrt
101 |100 | 1.15
102 | 50 | 1.50
103 | 0 | 0
and result in query should be
mfrquery
rfno|ppic|pcrt
101|110 |1.45
102| 65 |2.00
103| 18 | .68
I'll be somewhat nice. This probably isn't the most efficient method, but it'll work...
select* into #temp from table1
union
select* from table2
select id,sum(ppic) as ppic, sum(pcrt) as pcrt from #temp group by id
What this says is, select everything from table 1 and use a union to table two and place it in a temporary table called #temp. Filter this to the variables and ranges you need.
Then the 2nd part says, take the sum of ppic and the sum of pcrt from the #temp table and group it by the id.
Since you're new to SO, for future reference, SO people aren't mean, they just want to see you put forth some sort of effort into the problem, I've gotten help SEVERAL times here. Very helpful community! Best of luck to you!

How to get row count in all rows?

select id from table;
+------+
| id |
+------+
| 774 |
| 2775 |
+------+
return 2 rows
select count(id) as count, id from table;
+-------+-----+
| count | id |
+-------+-----+
| 2 | 774 |
+-------+-----+
but return 1 row
How to return all rows, but with counter in each record ?
SQL ???
+-------+------+
| count | id |
+-------+------+
| 2 | 774 |
| 2 | 2775 |
+-------+------+
SELECT id, (select count(*) from table) AS TotalRows
FROM table;
Although this seems unnecessary, as the total count will not change per row.
Use a group by
select id, count(id)
from table
group by id;
(BTW, your SQL in question does not work, at least in oracle and AFAIK in MySql)
I'm not sure what you're trying to do, but if you're trying to fetch the rows and get the total count in the same query because its a resource-intensive and you don't want to repeat your joins/conditions/whatever in two queries, under MySQL you can do:
# Returns a regular results set
SELECT SQL_CALC_FOUND_ROWS foo, bar FROM baz WHERE qux = 'corge' LIMIT 2;
# Returns the total count of found rows (without the LIMIT)
SELECT FOUND_ROWS();
If you want the total number of rows after the LIMIT, or don't have a LIMIT at all, you can skip the SQL_CALC_FOUND_ROWS.
However, generally speaking, counting the total number of rows doesn't scale very well. If you can, find an alternative way that doesn't require you to do that. for example, if its for paging, consider showing only 'next' / 'prev' buttons, without displaying the total number of pages. If you have 30 rows in a page, you can LIMIT 31 instead of 30, only display the first 30 rows, and check if the 31th row exists to know if a 'next' button should be displayed.
if you are useing oracle database you can use count Analytic function also for achieve this task as follow -
SELECT COUNT(*) OVER (PARTITION BY 1) AS COUNT, ID FROM TABLE