Sql cartesian product (summing with group by) - sql

I am trying to calculate the sum of volume for the last thirty days for a set of stocks on particular days in the table important_stock_dates. The table all_stock_dates contains the same stocks but with trading volume for all dates, not just the particular days.
Sample data
all_stock_dates
stockid, date, volume
0231245, 20060314, 153
0231245, 20060315, 154
2135411, 20060314, 23
important_stock_dates
stockid, date, thirtydaysprior
0231245, 20060314, 20060130
0231245, 20060315, 20060201
2135411, 20060314, 20060130
My code
create table sum_trading_volume as
select a.stockid, a.date, sum(b.volume) as thirty_day_volume
from important_stock_dates a, all_stock_dates b
where b.date<a.date AND b.date ge a.thirtydaysprior
group by a.stockid, a.date;
Desired outcome
A table with all the observations from important_stock_dates that also has the sum of the volume from the previous 30 days based on matching stockid and dates in all_stock_dates.
Problem
The problem I'm running into is that important_stock_dates has 15 million observations and all_stock_dates has 350 million. It uses up a few hundred gigabytes of swap file running this code (maxes out the hard drive) then aborts. I can't see how to optimize the code. I couldn't find a similar problem on StackOverflow or Google.

Presumably, the query that you want joins on stockid:
create table sum_trading_volume as
select isd.stockid, isd.date, sum(asd.volume) as thirty_day_volume
from important_stock_dates isd join
all_stock_dates asd
on isd.stockid = asd.stockid and
asd.date < isd.date and asd.date >= isd.thirtydaysprior
group by isd.stockid, isd.date;
If this worked, it will probably run to completion.

Related

Getting rows sorted by date (only first unique instance on one column) in Postgresql

So I have a table called "reports" organized as:
id | stock_id | type | time_period | report_date
type can either be "Cash_Flow", "Income_Statement" or "Balance_Sheet", time_period can either be "quarterly" or "yearly", report_date is a date. What I'd want to do is essentially get only the most recent (sorted by report_date) reports for each stock_id.
So if a stock has a "Cash_Flow", "Income_Statement", and "Balance_Sheet" quaterly reports for both 2019-04-30 and 2019-01-30 and yearly reports for 2019-04-30 as well, I'd only want to get the most recent quarterly reports for each stock and not return any yearly reports or any older reports. So let's say there's 100 stocks witih a total of 8 quarters for each report type in the table (2400 rows total for quarterly reports) and 2 yearly reports for each type (600 row yearly reports).
So I'm currently running postgresql 10.8 on Ubuntu 18.04. I don't usually write raw sql (usually use an ORM), so sorry if the answer is really simple.
So I've tried the following for each report, but it returns all rows(as expected), whereas I'd only want the most recent. I think the solution would likely require a distinct, but I can't get to seem that working with the orderby.
SELECT *
FROM public.reports where time_period='quarterly' and type='Cash_Flow' group by id, stock_id order by report_date desc;
I'd want a select query that would only return 300 rows, containing only the most recent 3 (Income_Statement, Balance_Sheet, Cash_Flow) reports for each of the 100 stocks or if its easier 3 queries for each of those 3 report types returning 100 rows each.
I'd want a select query that would only return 300 rows, containing only the most recent 3 reports for each of the 100 stocks.
Use window functions. Assuming you have 100 stocks:
select r.*
from (select r.*,
row_number() over (partition by stock_id, type order by report_date desc) as seqnum
from public.reports r
where time_period = 'quarterly' and
type in ( 'Cash_Flow', 'Income_Statement', 'Balance_Sheet' )
) r
where seqnum = 1;

Find the timestamp of a unique ticket number

I have a table that looks like this:
**ActivityNumber -- TimeStamp -- PreviousActivityNumber -- Team**
1234-4 -- 01/01/2017 14:12 -- 1234-3 -- Team A
There are 400,000 rows.
The ActivityNumber is a unique ticket number with the activity count attached. There are 4 teams.
Each activitynumber is in the table.
I need to calculate the average time taken between updates for each team, for each month (to see how each team is improving over time).
I produced a query which counts the number of activities per team per month - so I'm part way there.
I'm unable to find the timestamp for the previousActivityNumber so I can subtract it from the current Activity number. If I could get this, I could run an average on it.
Conceptually:
select a1.Team,
a1.ActivityNumber,
a1.TimeStamp,
a2.Timestamp as PrevTime,
datediff('n',a1.Timestamp, a2.timestamp) as WorkMinutes
from MyTable a1
left join MyTable a2
on ((a1.Team = a2.Team)
and (a1.PreviousActivityNumber = a2.ActivityNumber )

SQL statement to match dates that are the closest?

I have the following table, let's call it Names:
Name Id Date
Dirk 1 27-01-2015
Jan 2 31-01-2015
Thomas 3 21-02-2015
Next I have the another table called Consumption:
Id Date Consumption
1 26-01-2015 30
1 01-01-2015 20
2 01-01-2015 10
2 05-05-2015 20
Now the problem is, that I think that doing this using SQL is the fastest, since the table contains about 1.5 million rows.
So the problem is as follows, I would like to match each Id from the Names table with the Consumption table provided that the difference between the dates are the lowest, so we have: Dirk consumes on 27-01-2015 about 30. In case there are two dates that have the same "difference", I would like to calculate the average consumption on those two dates.
While I know how to join, I do not know how to code the difference part.
Thanks.
DBMS is Microsoft SQL Server 2012.
I believe that my question differs from the one mentioned in the comments, because it is much more complicated since it involves comparison of dates between two tables rather than having one date and comparing it with the rest of the dates in the table.
This is how you could it in SQL Server:
SELECT Id, Name, AVG(Consumption)
FROM (
SELECT n.Id, Name, Consumption,
RANK() OVER (PARTITION BY n.Id
ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date]))) AS rnk
FROM Names AS n
INNER JOIN Consumption AS c ON n.Id = c.Id ) t
WHERE t.rnk = 1
GROUP BY Id, Name
Using RANK with PARTITION BY n.Id and ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date])) you can locate all matching records per Id: all records with the smallest difference in days are going to have rnk = 1.
Then, using AVG in the outer query, you are calculating the average value of Consumption between all matching records.
SQL Fiddle Demo

oracle sql: efficient way to calculate business days in a month

I have a pretty huge table with columns dates, account, amount, etc. eg.
date account amount
4/1/2014 XXXXX1 80
4/1/2014 XXXXX1 20
4/2/2014 XXXXX1 840
4/3/2014 XXXXX1 120
4/1/2014 XXXXX2 130
4/3/2014 XXXXX2 300
...........
(I have 40 months' worth of daily data and multiple accounts.)
The final output I want is the average amount of each account each month. Since there may or may not be record for any account on a single day, and I have a seperate table of holidays from 2011~2014, I am summing up the amount of each account within a month and dividing it by the number of business days of that month. Notice that there is very likely to be record(s) on weekends/holidays, so I need to exclude them from calculation. Also, I want to have a record for each of the date available in the original table. eg.
date account amount
4/1/2014 XXXXX1 48 ((80+20+840+120)/22)
4/2/2014 XXXXX1 48
4/3/2014 XXXXX1 48
4/1/2014 XXXXX2 19 ((130+300)/22)
4/3/2014 XXXXX2 19
...........
(Suppose the above is the only data I have for Apr-2014.)
I am able to do this in a hacky and slow way, but as I need to join this process with other subqueries, I really need to optimize this query. My current code looks like:
<!-- language: lang-sql -->
select
date,
account,
sum(amount/days_mon) over (partition by last_day(date))
from(
select
date,
-- there are more calculation to get the account numbers,
-- so this subquery is necessary
account,
amount,
-- this is a list of month-end dates that the number of
-- business days in that month is 19. similar below.
case when last_day(date) in ('','',...,'') then 19
when last_day(date) in ('','',...,'') then 20
when last_day(date) in ('','',...,'') then 21
when last_day(date) in ('','',...,'') then 22
when last_day(date) in ('','',...,'') then 23
end as days_mon
from mytable tb
inner join lookup_businessday_list busi
on tb.date = busi.date)
So how can I perform the above purpose efficiently? Thank you!
This approach uses sub-query factoring - what other RDBMS flavours call common table expressions. The attraction here is that we can pass the output from one CTE as input to another. Find out more.
The first CTE generates a list of dates in a given month (you can extend this over any range you like).
The second CTE uses an anti-join on the first to filter out dates which are holidays and also dates which aren't weekdays. Note that Day Number varies depending according to the NLS_TERRITORY setting; in my realm the weekend is days 6 and 7 but SQL Fiddle is American so there it is 1 and 7.
with dates as ( select date '2014-04-01' + ( level - 1) as d
from dual
connect by level <= 30 )
, bdays as ( select d
, count(d) over () tot_d
from dates
left join holidays
on dates.d = holidays.hol_date
where holidays.hol_date is null
and to_number(to_char(dates.d, 'D')) between 2 and 6
)
select yt.account
, yt.txn_date
, sum(yt.amount) over (partition by yt.account, trunc(yt.txn_date,'MM'))
/tot_d as avg_amt
from your_table yt
join bdays
on bdays.d = yt.txn_date
order by yt.account
, yt.txn_date
/
I haven't rounded the average amount.
You have 40 month of data, this data should be very stable.
I will assume that you have a cold body (big and stable easily definable range of data) and hot tail (small and active part).
Next, I would like to define a minimal period. It is a data range that is a smallest interval interesting for Business.
It might be year, month, day, hour, etc. Do you expect to get questions like "what was averege for that account between 1900 and 12am yesterday?".
I will assume that the answer is DAY.
Then,
I will calculate sum(amount) and count() for every account for every DAY of cold body.
I will not create a dummy records, if particular account had no activity on some day.
and I will save day, account, total amount, count in a TABLE.
if there are modifications later to the cold body, you delete and reload affected day from that table.
For hot tail there might be multiple strategies:
Do the same as above (same process, clear to support)
always calculate on a fly
use materialized view as an averege between 1 and 2.
Cold body table totalc could also be implemented as materialized view, but if data never change - no need to rebuild it.
With this you go from (number of account) x (number of transactions per day) x (number of days) to (number of account)x(number of active days) number of records.
That should speed up all following calculations.

SQL for statement of accounts

I would like to thank in advance for any help.
My problem relates to two tables in MySQL (Now switching to postgresql). The tables are related to a ticketing database.
a) booking. It has four columns ccode,date,time,amount
b) account It has three columns ccode,date,amount
The booking table has ticket bookings and account table has advances and payments received.
I have to prepare a statement of account based on ccode (customer code).
The statement shows columns as below
*Ccode Type Date time amount balance*
- the report in sorted on ccode and then on date (account table row appears first)
- Type column displays B or A depending on record type
- Time column is present only in booking table
- Report has a running balance for each row
- At the end for a customercode, the totals of amount and balance is displayed
I have had success, so far in creating a join as below. (and after dicussion below, have been able to generate TYPE column using IF)
SELECT booking.cname, booking.bdate, booking.btime, booking.rate, booking.ID,
IF(booking.btime IS NOT NULL, "B", "A") AS type, account.cname, account.date,
account.amount, account.ID
FROM booking
LEFT JOIN account ON booking.bdate = account.date AND booking.cname=account.cname AND
booking.rate = account.amount
UNION
SELECT booking.cname, booking.bdate, booking.btime, booking.rate, booking.ID,
IF(booking.btime IS NOT NULL, "B", "A") AS type, account.cname, account.date,
account.amount, account.ID
FROM booking
RIGHT JOIN account ON booking.bdate = account.date AND booking.cname=account.cname AND
booking.rate = account.amount
It displays all the records. A report can be generated using this table.
But is there a way to display the formatted report just by SQL.
I can change the order of columns and even add or remove existing ones as long as Record type is known and a running balance is displayed against each record.
A SAMPLE REPORT ---- REPORT A
CODE DATE TYPE AMOUNT BALANCE TIME
A1 02/19/2011 A 50 50
A1 02/20/2011 B 35 15 1230
A1 02/21/2011 A 40 55
A1 02/21/2011 B 20 35 1830
optional > TOTAL Account = 90 Booking = 55 Balance = 35
A SAMPLE REPORT ---- REPORT B
CODE AMOUNT BOOKED AMOUNT PAID BALANCE
A1 50 50 0
A1 35 15 20
A1 40 55 -15
A1 20 35 -15
this is a weekly statement version of REPORT A.
the reason is i can add where and between to get only records in a given week.
and since it is a weekly report, running balance is just omitted.
It is report grouped for all entries with customercode A1 present
in booking and account tables.
thanx
Since your data is not normalized for this, you pay the price in query complexity, as shown below:
SELECT ccode, date, time, type, amount,
(SELECT SUM(amount) AS balance
FROM numbered AS n
WHERE numbered.rownum <= n.rownum AND numbered.ccode = n.ccode) AS balance
FROM (SELECT sorted.*, #rownum := #rownum + 1 AS rownum
FROM (SELECT *
FROM (SELECT ccode, date, time, 'B' AS type, amount
FROM booking
UNION
SELECT ccode, date, "0000" AS time, 'A' AS type, -amount
FROM account) AS unsorted
ORDER BY ccode, date, time, type) AS sorted) AS numbered
The idea here is that you first need to get your booking (debits) and account (credits) lined up as in the "unsorted" statement above. Then, you need to sort them by date and time as in the "sorted" statement. Next, add row numbers to the results as in the "numbered" statement. Finally, select all that data along with a sum of amounts with row number less than or equal to the current row number that matches your ccode.
In the future, please consider using a transaction table of some sort which holds all account balance changes in a single table.
I have found the answer.
It is to use cummulative sql statement to find a running balance