how to loop through a specified range - sql

I have a database of movies where one field is the year which it was released.
I want to create a query which will loop through each decade and will calculate the sum of a particular field for that decade. I have no idea how I can get a loop for every decade. Can anyone help?

If you want the decades where you don't have any movies as well as those with movies, then you can use generate_series to build you list of decades and the do a left outer join to your table; generate_series is the standard way to build numeric and time lists on the fly in PostgreSQL. Something like this should get you started:
select decade.d, count(t.year)
from generate_series(1900, 2100, 10) as decade(d)
left outer join your_table t on decade.d = floor(t.year / 10) * 10
group by decade.d
order by decade.d
That will produce output like this:
d | count
------+-------
1900 | 1
1910 | 0
1920 | 1
1930 | 3
1940 | 0
1950 | 0
1960 | 1
1970 | 0
1980 | 3
-- ...
2100 | 0
You could adjust the first and last values for the generate_series call to match your data if desired.
The floor(t.year / 10) * 10 bit gives you decade for a given year; it will convert 1942 to 1940, 2000 to 2000, etc.
You can set up a decade table (a one column table with one entry for each decade) if you move to a database that doesn't have something like generate_series. The SQL would be pretty much the same, just replace the generate_series call with your decade table.

Try something like this(don't know how your tables look, guessing):
SELECT movie_year, sum(column_x)
FROM (
SELECT year
, date_trunc('decade', movie_year)::date as decade
, column_x
FROM movies) as movies_with_decades
GROUP BY decade
ORDER BY decade;

Related

Hive QL to populate a sequence of numbers between limits

Not sure how to put this in a straight forward manner but I'm trying to make something work in Hive SQL. I need to create a sequence of numbers from lower limit to upper limit.
Ex:
select min(year) from table
Let's assume it results in 2010
select max(year) from table
Let's assume it results in 2015
I need to publish each year from 2010 to 2015 in a select query.
And I'm trying to put the min calculation & max calculation inside the same SQL which will/should create sequential years in the output.
Any ideas?
Well I have an idea but in order to use it, you will have to define the lowest possible and the largest possible values for the years that might be present in your table.
Let's say the smallest possible year is 1900 and the largest possible year is 2200.
Since the largest possible difference in this case is 2200-1900=300, you will have to use the following string: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ... ... 298 299 300.
In the query, you split this string using space as a delimiter thus getting an array, and then you explode that array.
Have a look:
SELECT
minval + delta
FROM
(
SELECT
min(year) minval,
max(year) maxval,
split('0 1 2 3 4 5 6 7 8 9 10 11 12 13 ... ... ... 298 299 300', ' ') delta_list
FROM
table
) t
LATERAL VIEW explode(delta_list) dlist AS delta
WHERE (maxval-minval) >= delta
;
So you end up with 301 rows but you only need the rows with delta values not exceeding the difference between max year and min year, which is reflected in the where clause
set hivevar:end_year=2019;
set hivevar:start_year=2010;
select ${hivevar:start_year}+i as year
from
(
select posexplode(split(space((${hivevar:end_year}-${hivevar:start_year})),' ')) as (i,x)
)s;
Result:
year
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
Have a look also at this answer about generating missing dates.

Count Distinct values in one column based on other columns

I have a table that looks like the following:
app_id supplier_reached creation_date platform
10001 1 9/11/2018 iOS
10001 2 9/18/2018 iOS
10002 1 5/16/2018 android
10003 1 5/6/2018 android
10004 1 10/1/2018 android
10004 1 2/3/2018 android
10004 2 2/2/2018 web
10005 4 1/5/2018 web
10005 2 5/1/2018 android
10006 3 10/1/2018 iOS
10005 4 1/1/2018 iOS
The objective is to find the unique number of app_id submitted per month.
If I just do a count(distinct app_id) I will get the following results:
Group by month count(app number)
Jan 1
Feb 1
may 3
september 1
october 2
However, an application is considered unique based on a combination of other fields as well. For example, for the month of January, the app_id is the same however a combination of app_id, supplier_reached and platform show different values and hence the app_id should be counted twice.
Following the same pattern, the desired result should be:
Group by month Desired answer
Jan 2
Feb 2
may 3
september 2
october 2
Lastly, there can be many other columns in the table which may or may not contribute to the uniqueness of an application.
Is there a way to do this type of count in SQL?
I am using Redshift.
As pointed out above, in Redshift count(distinct ...) does not work with multiple fields.
You can first group by the columns that you want to be unique and then count the records like this:
select month,count(1) as app_number
from (
select month,app_id,supplier_reached,platform
from your_table
group by 1,2,3,4
)
group by 1
I don't think Postgres or Redshift supports COUNT(DISTINCT) with multiple arguments. One workaround is to use concatenation:
count(distinct app_id || ':' || supplier_reached || ':' || platform)
Your objective's mean is wrong.
You don't want
to find the unique number of app_id submitted per month
you want
to find the unique number of app_id + supplier_reached + platform submitted per month.
And so, you need to use a) combination of columns like count(distinct col1||col2||col3) or b)
select t1.month, count(t1.*)
(select distinct
app_id,
supplier_reached,
platform,
month
from sometable) t1
group by month
Actually, you can count distinct ROW values conveniently in Postgres:
SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM tbl
GROUP BY 1;
The ROW keyword would be just noise here:
count(DISTINCT ROW(app_id, supplier_reached, platform))
I would discourage concatenating columns for the purpose. This is comparatively expensive, error prone (think of distinct data types and locale-dependent text representation) and introduces corner-case errors if the used separator can be contained in column values.
Alas, not supported by Redshift:
...
Value expressions
Subscripted expressions
Array constructors
Row constructors
...

SQL Percentile function missing?

It is a mystery to me with this text book example. We have simply:
Transaction_ID (primary key), Client_ID, Transaction_Amount, Month
1 1 500 1
2 1 1000 1
3 1 10 2
4 2 11 2
5 3 300 2
6 3 10 2
... ... ... ...
I want to calculate in SQL the mean(Transaction_Amount), std(Transaction_Amount) and the some percentile(Transaction amount) grouped by Client_ID. But is seems, even given that percentile is a very similar calculation than the standard deviation, SQL cannot do it with a simple statement as:
SELECT
mean(Transaction_Amount),
std(Transaction_Amount),
percentile(Transaction_Amount)
FROM
myTable
GROUP BY
Client_ID, Month
Or can it?
It gets worse becuase I also need to Group By Month in addition to Client_ID.
Thanks a lot!
Sven
I'm sure Oracle can do the calculations you want. I just don't know what they are. You specify that you want something grouped by ClientId. Yet, your sample query has two keys in the GROUP BY.
Some functions that you want to look at are:
AVG()
STDDEV()
PERCENT_RANK()
Without sample data and desired results (or a very clear explanation of what you are trying to calculate), I can't put together a query.

Is there an established pattern for SQL queries which group by a range?

I've seen a lot of questions on SO concerning how to group data by a range in a SQL query.
The exact scenarios vary, but the general underlying problem in each is to group by a range of values rather than each discrete value in the GROUP BY column. In other words, to group by a less precise granularity than you're storing in the database table.
This crops up often in the real world when producing things like histograms, calendar representations, pivot tables and other bespoke reporting outputs.
Some example data (tables unrelated):
| OrderHistory | | Staff |
--------------------------- ------------------------
| Date | Quantity | | Age | Name |
--------------------------- ------------------------
|01-Jul-2012 | 2 | | 19 | Barry |
|02-Jul-2012 | 5 | | 53 | Nigel |
|08-Jul-2012 | 1 | | 29 | Donna |
|10-Jul-2012 | 3 | | 26 | James |
|14-Jul-2012 | 4 | | 44 | Helen |
|17-Jul-2012 | 2 | | 49 | Wendy |
|28-Jul-2012 | 6 | | 62 | Terry |
--------------------------- ------------------------
Now let's say we want to use the Date column of the OrderHistory table to group by weeks, i.e. 7-day ranges. Or perhaps group the Staff into 10-year age ranges:
| Week | QtyCount | | AgeGroup | NameCount |
-------------------------------- -------------------------
|01-Jul to 07-Jul | 7 | | 10-19 | 1 |
|08-Jul to 14-Jul | 8 | | 20-29 | 2 |
|15-Jul to 21-Jul | 2 | | 30-39 | 0 |
|22-Jul to 28-Jul | 6 | | 40-49 | 2 |
-------------------------------- | 50-59 | 1 |
| 60-69 | 1 |
-------------------------
GROUP BY Date and GROUP BY Age on their own won't do it.
The most common answers I see (none of which are consistently voted "correct") are to use one or more of:
a bunch of CASE statements, one per grouping
a bunch of UNION queries, with a different WHERE clause per grouping
as I'm working with SQL Server, PIVOT() and UNPIVOT()
a two-stage query using a sub-select, temp table or View construct
Is there an established generic pattern for dealing with such queries?
You can use some of the dimensional modeling techniques, such as fact tables and dimension tables. Order History can act as a fact table with DateKey foreign key relation to a Date dimension.
Date dimension can have a schema such as below:
Note that Date table is pre-filled with data up-to N number of years.
Using an example above, here is a sample query to get the result:
select CalendarWeek, sum(Quantity)
from OrderHistory a
join DimDate b
on a.DateKey = b.DateKey
group by CalendarWeek
For Staff table, you can store Birthday Key instead of age and let the query calculate the age and ranges.
Here is SQL Fiddle
Date dimension population script was taken from here.
As is often the case this SQL problem requires using more than one pattern in composition.
In this case the two you can use are
NTILE
Numbers Table
You can use NTITLE to create a set number of groups. However since you don't have each member of the groups represented you also need to use a numbers table Since you're using SQL Server you have it easy as you don't have to simulate either.
Here's an example for the Staff problem
WITH g as (
SELECT
NTILE(6) OVER (ORDER BY number) grp,
NUMBER
FROM
master..spt_values
WHERE
TYPE = 'P'
and number >=10 and number <=69
)
SELECT
CAST(min(g.number) as varchar) + ' - ' +
CAST(max(g.number) as varchar) AgeGroup ,
COUNT(s.age) NameCount
FROM
g
LEFT JOIN Staff s
ON g.NUMBER = s.Age
GROUP BY
grp
DEMO
You can apply this to dates as well it just requires some date to day maniplulation
Take a look at the OVER clause and its associated clauses: PARTITION BY, ROW, RANGE...
Determines the partitioning and ordering of a rowset before the
associated window function is applied. That is, the OVER clause
defines a window or user-specified set of rows within a query result
set. A window function then computes a value for each row in the
window. You can use the OVER clause with functions to compute
aggregated values such as moving averages, cumulative aggregates,
running totals, or a top N per group results.
My favorite case in this genre is where transactions must be grouped by fiscal quarter or fiscal year. The fiscal quarter or fiscal year boundaries of various enterprises can border on the bizarre.
My favorite way to implement this is to create a separate table for the attributes of a date. Let's call the table "Almanac". One of the columns in this table is the fiscal quarter, and another one is the fiscal year. The key to this table is of course the date. Ten years worth of data fill up 3,650 rows, plus a few for leap years. You then need a program that can populate this table from scratch. All the enterprise calendar rules are built into this one program.
When you need to group transaction data by fiscal quarter, you just join with this table over date, and then group by fiscal quarter.
I figure this pattern could be extended to groupings by other kinds of ranges, but I've never done it myself.
In your first example your intervals are regular so you can achieve the desired result simply by using functions. Below is an example that gets the data as you require it. The first query keeps the first column in date format (how I would preferably deal with it doing any formatting outside of SQL), the second does the string conversion for you.
DECLARE #OrderHistory TABLE (Date DATE, Quantity INT)
INSERT #OrderHistory VALUES
('20120701', 2), ('20120702', 5), ('20120708', 1), ('20120710', 3),
('20120714', 4), ('20120717', 2), ('20120728', 6)
SET DATEFIRST 7
SELECT DATEADD(DAY, 1 - DATEPART(WEEKDAY, Date), Date) AS WeekStart,
SUM(Quantity) AS Quantity
FROM #OrderHistory
GROUP BY DATEADD(DAY, 1 - DATEPART(WEEKDAY, Date), Date)
SELECT WeekStart,
SUM(Quantity) AS Quantity
FROM #OrderHistory
CROSS APPLY
( SELECT CONVERT(VARCHAR(6), DATEADD(DAY, 1 - DATEPART(WEEKDAY, Date), Date), 6) + ' to ' +
CONVERT(VARCHAR(6), DATEADD(DAY, 7 - DATEPART(WEEKDAY, Date), Date), 6) AS WeekStart
) ws
GROUP BY WeekStart
Something similar can be done for your age grouping using:
SELECT CAST(FLOOR(Age / 10.0) * 10 AS INT)
However this fails for 30-39 because there is no data for this group.
My stance on the matter would be, if you are doing the query as a one off, using a temp table, cte or case statement should work just fine, this should also extend to reusing the same query on small sets of data.
If you are likely to reuse the group however, or you are referring to significant amounts of data then create a permanent table with the ranges defined and indices applied to any columns required. This is the basis of creating dimensions in OLAP.
Couldn't you treat the age (or date) as a foreign key in a new, tiny table that is just ages (or dates) and their corresponding ranges? A join statement could provide a new table with a column that contains AgeGroups. With the new table you could use the standard group-by method.
It does seem reckless to make a new table for grouping, but it would be easy to make programatically and I think it would be easier to maintain (or drop and recreate) than a case statement or a where clause. If the result of this query is a one-off, a throwaway sql statement would probably work best, but I think my method makes the most sense for long-term use.
Well, some years ago with Oracle DB we did it the following way:
We had two tables: Sessions and Ranges. Ranges had foreign key that referenced Session.
When we needed to perform SQL, we created a new record in Sessions and several new records in Ranges that referred to that session.
Our SQL joined Ranges with filter by Session:
select sum(t.Value), r.Name
from DataTable t
join Ranges r on (r.Session = ? and r.Start t.MyDate)
group by r.Name
After we got results we deleted that record from Sessions and records from Ranges where deleted by cascade.
We had daemon job that purged Sessions from junk records that were leaked in case of extraordinary situation (killed processes, etc).
This worked perfectly. Since that time Oracle added new SQL clauses, and maybe they could be used instead. But on other RDBMSes this is still a valid way.
Another approach is to create a number of functions such as GET_YEAR_BY_DATE or GET_QUARTER_BY_DATE or GET_WEEK_BY_DATE (they would return start date of corresponding
period, for example, for any date return start date of year). And then group by them:
select sum(Value), GET_YEAR_BY_DATE(MyDate) from DataTable
group by GET_YEAR_BY_DATE(MyDate)

SQL complex view for virtually showing data

I have a table with the following table.
----------------------------------
Hour Location Stock
----------------------------------
6 2000 20
9 2000 24
----------------------------------
So this shows stock against some of the hours in which there is a change in the quantity.
Now my requirement is to create a view on this table which will virtually show the data (if stock is not htere for a particular hour). So the data that should be shown is
----------------------------------
Hour Location Stock
----------------------------------
6 2000 20
7 2000 20 -- same as hour 6 stock
8 2000 20 -- same as hour 6 stock
9 2000 24
----------------------------------
That means even if the data is not there for some particular hour then we should show the last hour's stock which is having stock. And i have another table with all the available hours from 1-23 in a column.
I have tried partition over by method as given below. But i think i am missing some thing around this to get my requirement done.
SELECT
HOUR_NUMBER,
CASE WHEN TOTAL_STOCK IS NULL
THEN SUM(TOTAL_STOCK)
OVER (
PARTITION BY LOCATION
ORDER BY CURRENT_HOUR ROWS 1 PRECEDING
)
ELSE
TOTAL_STOCK
END AS FULL_STOCK
FROM
(
SELECT HOUR_NUMBER AS HOUR_NUMBER
FROM HOURS_TABLE -- REFEERENCE TABLE WITH HOURS FROM 1-23
GROUP BY 1
) HOURS_REF
LEFT OUTER JOIN
(
SEL CURRENT_HOUR AS CURRENT_HOUR
, STOCK AS TOTAL_STOCK
,LOCATION AS LOCATION
FROM STOCK_TABLE
WHERE STOCK<>0
) STOCKS
ON HOURS_REF.HOUR_NUMBER = STOCKS.CURRENT_HOUR
This query is giving all the hours with stock as null for the hours without data.
We are looking at ANSI sql solution so that it can be used on databases like Teradata.
I am thinking that i am using partition over by wrongly or is there any other way. We tried with CASE WHEN but that needs some kind of looping to check back for an hour with some stock.
I've run into similar problems before. It's often simpler to make sure that the data you need somehow gets into the database in the first place. You might be able to automate it with a stored procedure that runs periodically.
Having said that, did you consider trying COALESCE() with a scalar subquery? (Or whatever similar function your dbms supports.) I'd try it myself and post the SQL, but I'm leaving for work in two minutes.
Haven't tried, but along the lines of what Mike said:
SELECT a.hour
, COALESCE( a.stock
, ( select b.stock
from tbl.b
where b.hour=a.hour-1 )
) "stock"
FROM tbl a
Note: this will impact performance greatly.
Thanks for your responses. I have tried out RECURSIVE VIEW for the above requirement and is giving correct results (I am fearing about the CPU usage for big tables as it is recursive). So here is stock table
----------------------------------
Hour Location Stock
----------------------------------
6 2000 20
9 2000 24
----------------------------------
Then we will have a view on this table which will give all 12 hours data using Left outer join.
----------------------------------
Hour Location Stock
----------------------------------
6 2000 20
7 2000 NULL
8 2000 NULL
9 2000 24
----------------------------------
Then we will have a recursive view which joins the table recursively with the same view to get the Stock of each hour moved one hour up and appended with level of data coming incremented.
REPLACE RECURSIVE VIEW HOURLY_STOCK_VIEW
(HOUR_NUMBER,LOCATION, STOCK, LVL)
AS
(
SELECT
HOUR_NUMBER,
LOCATION,
STOCK,
1 AS LVL
FROM STOCK_VIEW_WITH_LEFT_OUTER_JOIN
UNION ALL
SELECT
STK.HOUR_NUMBER,
THE_VIEW.LOCATION,
THE_VIEW.STOCK,
LVL+1 AS LVL
FROM STOCK_VIEW_WITH_LEFT_OUTER_JOIN STK
JOIN
HOURLY_STOCK_VIEW THE_VIEW
ON THE_VIEW.HOUR_NUMBER = STK.HOUR_NUMBER -1
WHERE LVL <=12
)
;
You can observe that first we select from the Left outer joined view then we union it with the left outer join view joined on the same view which we are creating and giving it its level at which data is coming.
Then we select the data from this view with the minimum level.
SEL * FROM HOURLY_STOCK_VIEW
WHERE
(
HOUR_NUMBER,
LVL
)
IN
(
SEL
HOUR_NUMBER,
MIN(LVL)
FROM HOURLY_STOCK_VIEW
WHERE STOCK IS NOT NULL
GROUP BY 1
)
;
This is working fine and giving the result as
----------------------------------
Hour Location Stock
----------------------------------
6 2000 20
7 2000 20 -- same as hour 6 stock
8 2000 20 -- same as hour 6 stock
9 2000 24
10 2000 24
11 2000 24
12 2000 24
----------------------------------
I know this is going to take huge CPU for large tables to get the recursion work ( we are limiting the recursion to only 12 levels as 12 hours data is needed to stop it go into infinite loop). But I thought some body can use this for some kind of Hierarchy building. I will look for some more responses from you guys on any other approaches available. Thanks. You can have a look at Recursive views in the below link for teradata.
http://forums.teradata.com/forum/database/recursion-in-a-stored-procedure
The most common uses of view is, the removal of complexity.
For example:
CREATE VIEW FEESTUDENT
AS
SELECT S.NAME,F.AMOUNT FROM STUDENT AS S
INNER JOIN FEEPAID AS F ON S.TKNO=F.TKNO
Now do a SELECT:
SELECT * FROM FEESTUDENT