Odd behavior doing join and - sql

create table umd2
as select a.permno, a.date, a.realdate, exp(sum(log(1+b.ret))) - 1 as cum_return
from msex2 (obs=50 keep=permno date realdate) as a, msex2 (obs=50 keep=permno date ret) as b
where a.permno=b.permno and 0<=intck('month', b.date, a.date)<3
group by a.permno, a.date
having count(b.ret)=3;
This query is to calculate the momentum (cumulative return in the past 3 month of a stock). However, this gives me duplicate rows. I thought group by would not return duplicate rows?
When I added the realdate column to my group by statement,
create table umd2
as select a.permno, a.date, a.realdate, exp(sum(log(1+b.ret))) - 1 as cum_return
from msex2 (obs=50 keep=permno date realdate) as a, msex2 (obs=50 keep=permno date ret) as b
where a.permno=b.permno and 0<=intck('month', b.date, a.date)<3
group by a.permno, a.date, a.realdate
having count(b.ret)=3;
those duplicate rows disapear. Why is this?

This is the way that SAS behaves. SAS recognizes the following query:
select a.permno, a.date, a.realdate, count(*)
from <whatever>
group by a.permno, a.date, a.realdate;
as being an aggregation query. That means that the rows are aggregated and reduced, with one result row per combination of the three columns. In particular, the non-aggregated columns in the select match (or are a subset) of the columns in the group by.
When you do this:
select a.permno, a.date, a.realdate, count(*)
from <whatever>
group by a.permno, a.date;
You are now using non-standard SQL. Most databases would generate an error. MySQL would accept this syntax, and assign an arbitrary value to a.read_date from the matching values. SAS does something different. SAS says: "Well, you clearly don't intend for this to be an aggregation query." So, it doesn't aggregate the rows, but it appends the aggregated values onto each row. In other databases, you would do this using window functions.
Technically, SAS calls this remerging summary data, which is documented here.

Related

SQL Query not working

I seem to be getting this error while trying to run the below query:
SELECT
to_char(EFFECTIVE_DT,'YYYY-MM') as YYYYMM,
--EFFECTIVE_DT,
AH01_PAYMENT_STATUS_CTD,
TSYS_ACCT_ID
FROM OIS_TSYS.AH_CYCLE_HIST
WHERE 1=1
AND EFFECTIVE_DT BETWEEN '01-MAY-2017' AND '31-MAY-2017'
GROUP BY 2
ORDER BY 1
error: ORA-00979: not a GROUP BY expression
I am trying to group by date as at the moment i get the results daily for each individual account.
Result set:
65589 N 03-MAY-17
65590 S 03-MAY-17
65591 M 03-MAY-17
65592 F 03-MAY-17
65617 G 03-MAY-17
Any help be amazing.
Best,
Saad
When you "group by 2", all other columns must have an aggregate function like (sum, avg, min, max,..)
The "1=1" is pretty useless
To get the desired result use the below query:
When you apply group by clause in any query you cannot just put one column in the group by clause if there are more than one colum in the select clause apart from the aggregate functions like sum, count, min, max etc. So in your case you have to put all the three columns in group by that you selected in the select clause.
SELECT
TSYS_ACCT_ID,
AH01_PAYMENT_STATUS_CTD,
to_char(EFFECTIVE_DT,'YYYY-MM') as YYYYMM
FROM OIS_TSYS.AH_CYCLE_HIST
WHERE EFFECTIVE_DT BETWEEN '01-MAY-2017' AND '31-MAY-2017'
GROUP BY
TSYS_ACCT_ID,
AH01_PAYMENT_STATUS_CTD,
to_char(EFFECTIVE_DT,'YYYY-MM')
ORDER BY 1

SELECT columns OVER (PARTITION BY column)

Suppose I want to retrieve the swimmer and their time at the 75th Percentile for each day.
This is what I was trying to do:
SELECT tableA.DATE, tableA.SWIMMER, tableA.TIME
OVER (PARTITION BY tableA.DATE)
FROM tableA
WHERE RANK = CEIL(0.75 * NUM_OF_SWIMMERS);
But this errors at the OVER statement.
What's the best way to get the data I need?
Thanks!
Your error is that the OVER clause of a windowing function requires an ORDER BY clause.
But assuming that num_swimmers , why not just return
select
date,
swimmer,
time
from tablea
where
RANK = CEIL(0.75 * NUM_OF_SWIMMERS)
?
The WHERE clause will ensure the only rows returned are the 75th percentile for a given day

Subqueries: What am I doing fundamentally wrong?

I thought that selecting values from a subquery in SQL would only yield values from that subset until I found a very nasty bug in code. Here is an example of my problem.
I'm selecting the rows that contain the latest(max) function by date. This correctly returns 4 rows with the latest check in of each function.
select *, max(date) from cm where file_id == 5933 group by function_id;
file_id function_id date value max(date)
5933 64807 1407941297 1 1407941297
5933 64808 1407941297 11 1407941297
5933 895175 1306072348 1306072348
5933 895178 1363182349 1363182349
When selecting only the value from the subset above, it returns function values from previous dates, i.e. rows that don't belong in the subset above. You can see the result below where the dates are older than in the first subset.
select temp.function_id, temp.date, temp.value
from (select *, max(date)
from cm
where file_id 5933
group by function_id) as temp;
function_id date value
64807 1306072348 1 &lt-outdated row, not in first subset
64808 1306072348 17 &lt-outdated row, not in first subset
895175 1306072348
895178 1363182349
What am I doing fundamentally wrong? Shouldn't selects performed on subqueries only return possible results from those subqueries?
SQLite allows you to use MAX() to select the row to be returned by a GROUP BY, but this works only if the MAX() is actually computed.
When you throw the max(date) column away, this no longer works.
In this case, you actually want to use the date value, so you can just keep the MAX():
SELECT function_id,
max(date) AS date,
value
FROM cm
WHERE file_id = 5933
GROUP BY function_id
You seem to be missing the fact that your subquery is returning ALL rows for the given file_id. If you want to restrict your subquery to recs with the most recent date, then you need to restrict it with a WHERE NOT EXISTS clause to check that no more recent records exist for the given condition.
Perhaps my question was not formulated correctly, but this post had the solutions I was essentially looking for:
https://stackoverflow.com/a/123481/2966951
https://stackoverflow.com/a/121435/2966951
Filtering out the most recent row was my problem. I was surprised that selecting from a subquery with a max value could yield anything other than that value.

20 Day moving average with joins alone

There are questions like this all over the place so let me specify where I specifically need help.
I have seen moving averages in SQL with Oracle Analytic functions, MSSQL apply, or a variety of other methods. I have also seen this done with self joins (one join for each day of the average, such as here How do you create a Moving Average Method in SQL? ).
I am curious as to if there is a way (only using self joins) to do this in SQL (preferably oracle, but since my question is geared towards joins alone this should be possible for any RDBMS). The way would have to be scalable (for a 20 or 100 day moving average, in contrast to the link I researched above, which required a join for each day in the moving average).
My thoughts are
select customer, a.tradedate, a.shares, avg(b.shares)
from trades a, trades b
where b.tradedate between a.tradedate-20 and a.tradedate
group by customer, a.tradedate
But when I tried it in the past it hadn't worked. To be more specific, I am trying a smaller but similar exmaple (5 day avg instead of 20 day) with this fiddle demo and cant find out where I am going wrong. http://sqlfiddle.com/#!6/ed008/41
select a.ticker, a.dt_date, a.volume, avg(b.volume)
from yourtable a, yourtable b
where b.dt_date between a.dt_date-5 and a.dt_date
and a.ticker=b.ticker
group by a.ticker, a.dt_date, a.volume
I don't see anything wrong with your second query, I think the only reason it's not what you're expecting is because the volume field is an integer data type so when you calculate the average the resulting output will also be an integer data type. For an average you have to cast it, because the result won't necessarily be an integer (whole number):
select a.ticker, a.dt_date, a.volume, avg(cast(b.volume as float))
from yourtable a
join yourtable b
on a.ticker = b.ticker
where b.dt_date between a.dt_date - 5 and a.dt_date
group by a.ticker, a.dt_date, a.volume
Fiddle:
http://sqlfiddle.com/#!6/ed008/48/0 (thanks to #DaleM for DDL)
I don't know why you would ever do this vs. an analytic function though, especially since you mention wanting to do this in Oracle (which has analytic functions). It would be different if your preferred database were MySQL or a database without analytic functions.
Just to add to the answer, this is how you would achieve the same result in Oracle using analytic functions. Notice how the PARTITION BY acts as the join you're using on ticker. That splits up the results so that the same date shared across multiple tickers don't interfere.
select ticker,
dt_date,
volume,
avg(cast(volume as decimal)) over( partition by ticker
order by dt_date
rows between 5 preceding
and current row ) as mov_avg
from yourtable
order by ticker, dt_date, volume
Fiddle:
http://sqlfiddle.com/#!4/0d06b/4/0
Analytic functions will likely run much faster.
http://sqlfiddle.com/#!6/ed008/45 would appear to be what you need.
select a.ticker,
a.dt_date,
a.volume,
(select avg(cast(b.volume as float))
from yourtable b
where b.dt_date between a.dt_date-5 and a.dt_date
and a.ticker=b.ticker)
from yourtable a
order by a.ticker, a.dt_date
not a join but a subquery

Subtracting 2 values from a query and sub-query using CROSS JOIN in SQL

I have a question that I'm having trouble answering.
Find out what is the difference in number of invoices and total of invoiced products between May and June.
One way of doing it is to use sub-queries: one for June and the other one for May, and to subtract the results of the two queries. Since each of the two subqueries will return one row you can (should) use CROSS JOIN, which does not require the "on" clause since you join "all" the rows from one table (i.e. subquery) to all the rows from the other one.
To find the month of a certain date, you can use MONTH function.
Here is the Erwin document
This is what I got so far. I have no idea how to use CROSS JOIN in this situation
select COUNT(*) TotalInv, SUM(ILP.ProductCount) TotalInvoicedProducts
from Invoice I, (select Count(distinct ProductId) ProductCount from InvoiceLine) AS ILP
where MONTH(inv_date) = 5
select COUNT(*) TotalInv, SUM(ILP.ProductCount) TotalInvoicedProducts
from Invoice I, (select Count(distinct ProductId) ProductCount from InvoiceLine) AS ILP
where MONTH(inv_date) = 6
If you guys can help that would be great.
Thanks
The problem statement suggests you use the following steps:
Construct a query, with a single result row giving the values for June.
Construct a query, with a single result row giving the values for May.
Compare the results of the two queries.
The issue is that, in SQL, it's not super easy to do that third step. One way to do it is by doing a cross join, which yields a row containing all the values from both subqueries; it's then easy to use SELECT (b - a) ... to get the differences you're looking for. This isn't the only way to do the third step, but what you have definitely doesn't work.
can't you do something with subqueries? I haven't tested this, but something like the below should give you 4 columns, invoices and products for may and june.
select (
select 'stuff' a, count(*) as june_invoices, sum(products) as products from invoices
where month = 'june'
) june , (
select 'stuff' a, count(*) as may_invoices, sum(products) as products from invoices
where month = 'may'
) may
where june.a = may.a