Base HSQLDB Query Nested Sub-Query/Sub-Set Problems - hsqldb

I'm relatively new to databases and LibreOffice Base/HSQLDB. I have the latest Base installed.
My biking database has 1 table "BikeDate" with 4 fields as follows:
| RideID | RideDate | Bike | Miles |
1 10/2/97 Y22 15.6
....... with say 620 entries
What I am trying to obtain is a comparison of the monthly rides over the last 17 years. My metrics would be the sum of all the rides for each month across all the years, then the average ride during each month across all the years, then the sum of the miles for each month for 2014 and the average ride for each month for 2014.
This is what I have so far:
SELECT MONTH( "RideDate" ) AS "MONTH", SUM( "Miles" ) AS "SUM of Miles", AVG( "Miles" ) AS "AVG of Miles",
( SELECT SUM( "Miles" ) FROM "BikeDate" WHERE ( YEAR( "RideDate" ) = '2014' ) ORDER BY MONTH( "RideDate" ) ) AS "Sum 2014",
( SELECT AVG( "Miles" ) FROM "BikeDate" WHERE ( YEAR( "RideDate" ) = '2014' ) ORDER BY MONTH( "RideDate" ) ) AS "Avg 2014"
FROM "BikeDate" AS "BikeDate"
GROUP BY MONTH( "RideDate" )
ORDER BY MONTH( "RideDate" ) ASC
With output of:
|MONTH | Sum of Miles | AVG of Miles | Sum 2014 | AVG 2014 |
2 | 12.2 | 6.1 | 29 | 14.5 |
3 | 217.9 | 10.38 | 29 | 14.5 |
4 | 744.3 | 12.2 | 29 | 14.5 |
5 | 1316.3 | 17.55 | 29 | 14.5 |
....
12 | 70.2 | 11.7 | 29 | 14.5 |
First: Can this be done? Both 2014 columns for month 2 should be zero (we had 48" of snowfall for a current total of 338"). How can I get this to work? I'd like to stay in LO Base because its free and currently installed.
Thanks, Dave.

This looks quite difficult to achieve with HSQLDB 1.8 database that is bundled with LO Base. You can use the latest HSQLDB 2.3.x as an external database with LO. This version allows you to use the expression below for the current year:
AVG ("Miles") FILTER (WHERE YEAR("RideDate") = 2014)
You can use this and a similar expression for SUM alongside the SUM and AVG expressions in the main select. There won't be a need to use additional sub-select queries.
SELECT MONTH( "RideDate" ) AS "MONTH",
SUM( "Miles" ) AS "SUM of Miles",
AVG( "Miles" ) AS "AVG of Miles",
SUM( "Miles" ) FILTER (WHERE YEAR("RideDate") = 2014) AS "Sum 2014",
AVG( "Miles" ) FILTER (WHERE YEAR("RideDate") = 2014) AS "Avg 2014"
FROM "BikeDate" AS "BikeDate"
GROUP BY MONTH( "RideDate" )
ORDER BY MONTH( "RideDate" ) ASC
Check the http://www.oooforum.org and other resources for the "split database" solution for using the latest HSQLDB with LO.

Related

Filling null values in timeseries data (for weekends) with previous day values

I have a view with dates, stock name and daily stock prices for weekdays. This excludes data for Saturdays and Sundays.
I want to fill data on Saturdays and Sundays with all the stock names and corresponding stock prices for previous day (Friday).
How can I run a SQL query to get the desired output?
Thank you for your help in resolving this query.
E.g.
Original data
Date Stock-Name Stock-Price
2019/06/30 null null
2019/06/29 null null
2019/06/28 Appl $200
2019/06/28 Goog $1100
2019/06/28 Tsla $300
2019/06/27 Appl $210
2019/06/27 Goog $1200
2019/06/27 Tsla $200
Expected Output
Date | Stock Name | Stock Price
--------------------------------------
2019/06/30 | Appl | $200
2019/06/30 | Goog | $1100
2019/06/30 | Tsla | $300
2019/06/29 | Appl | $200
2019/06/29 | Goog | $1100
2019/06/29 | Tsla | $300
2019/06/28 | Appl | $200
2019/06/28 | Goog | $1100
2019/06/28 | Tsla | $300
2019/06/27 | Appl | $210
2019/06/27 | Goog | $1200
2019/06/27 | Tsla | $200
Use a cross join to generate the rows and a left join to bring in values.
To get the preceding value, you can use lag() with ignore nulls for some similar mechanism (not all databases support this functionality).
So:
select d.date, s.stockname,
coalesce(t.price,
lag(t.price ignore nulls) over (partition by s.stockname order by d.date)
) as price
from (select distinct date from t
) d cross join
(select stockname from t where stockname is not null
) s left join
t
on t.date = d.date and t.stockname = s.stockname
order by d.date desc, s.stockname;
demo:db<>fiddle
For Postgres:
With the lead() window function you are able to get values from next records:
SELECT
mydate,
CASE
WHEN date_part('dow', mydate) = 6 THEN
lead(stock_name) OVER w
WHEN date_part('dow', mydate) = 0 THEN
lead(stock_name, 2) OVER w
ELSE
stock_name
END AS stock_name,
CASE
WHEN date_part('dow', mydate) = 6 THEN
lead(stock_price) OVER w
WHEN date_part('dow', mydate) = 0 THEN
lead(stock_price, 2) OVER w
ELSE
stock_price
END AS stock_price
FROM
stock
WINDOW w AS (ORDER BY mydate DESC)
This query checks for the current week day using the date_part() function. The parameter dow (day of week) gives out 0 for Sundays and 6 for Saturdays.
If it is Saturday, the lead() function only looks one record further and if it Sunday 2 days. Otherwise the original data is taken.
Note: "Date" is not a really good column name because it is a reserved keyword in Postgres. To avoid certain problems you should adjust the names a little bit.

Postgres - How would I generate a daily budget by day from a given overall budget and start date/end date

For example, if I had the following table:
campaign | budget | start_date | end_date
-----------|------------|-------------|------------
Microsoft | 25400 | 2018-04-01 | 2018-06-30
VMWare | 12340 | 2018-04-01 | 2018-06-01
How would I use the start date and end date to give me a number of days the campaign will be active, so that I may divide that number by the overall budget to get the daily budget? And I wish to fit that in a date series between 2018-04-01 and 2018-06-30
I would hope to get a table like this:
date | daily_budget
-----------|--------------
2018-04-01 | 486
2018-04-02 | 486
2018-04-03 | 486
(...)
2018-06-29 | 282
2018-06-30 | 282
To get all dates in a range, you would probably use a recursive CTE. To get days in an interval, you can just subtract, and to sum the different campaigns, just group by date. This SQL Fiddle seems to be what you're looking for (code below).
WITH RECURSIVE calendar(
campaign
, budget_date
, end_date
) AS
(
SELECT
campaign
, start_date budget_date
, end_date
FROM
budgetTotal
UNION ALL
SELECT
c.campaign
, c.budget_date + 1
, c.end_date
FROM
calendar c
WHERE
c.budget_date < c.end_date
)
SELECT
c.budget_date
, ROUND(
SUM(bt.budget / (bt.end_date - bt.start_date))
, 2
) dailyBudget
FROM
budgetTotal bt
JOIN
calendar c
ON
c.budget_date BETWEEN bt.start_date AND bt.end_date
AND
c.campaign = bt.campaign
GROUP BY
c.budget_date
ORDER BY
c.budget_date

How do I group by month when I have data in a time range, accurate up to the second?

I'd like to ask if there's a way to group my data by months in this case:
I have table of orders, with order Ids in a column and the dates the orders were created in another.
For example,
orderId | creationDate
58111 | 2017-01-01 00:00:00
58111 | 2017-01-12 00:00:00
58232 | 2017-01-31 00:00:00
62882 | 2017-02-21 00:00:00
90299 | 2017-03-20 00:00:00
I need to find the number of unique orderIds, grouped by month. Normally this would be simple, but with my creationDates accurate to the second, I have no idea how to segment them into months. Ideally, this is what I'd obtain:
creationMonth | count_orderId
January | 2
February | 1
March | 1
Try this:
select count( distinct orderId ), year( creationDate ), month( creationDate )
from my_table group by year( creationDate ), month( creationDate )

Max and Min group by 2 columns

I have the following table that shows me every time a car has his tank filled. It returns the date, the car id, the mileage it had at that time and the liters filled:
| Date | Vehicle_ID | Mileage | Liters |
| 2016-10-20 | 234 | 123456 | 100 |
| 2016-10-20 | 345 | 458456 | 215 |
| 2016-10-20 | 323 | 756456 | 265 |
| 2016-10-25 | 234 | 123800 | 32 |
| 2016-10-26 | 345 | 459000 | 15 |
| 2016-10-26 | 323 | 756796 | 46 |
The idea is to calculate the average comsumption by month (I can't do it by day because not every car fills the tank every day).
To get that, i tried to get max(mileage)-min(mileage)/sum(liters) group by month. But this will only work for 1 specific car and 1 specific month.
If I try for 1 specific car and several months, the max and min will not return properly. If I add all the cars, even worse, as it will assume the max and min as if every car was the same.
select convert(char(7), Date, 127) as year_month,
sum("Liters tanked")/(max("Mileage")-min("Mileage"))*100 as Litres_per_100KM
from Tanking
where convert(varchar(10),"Date",23) >= DATEADD(mm, -5, GETDATE())
group by convert(char(7), Date, 127)
This will not work as it will assume the max and min from all the cars.
The "workflow" shoud be this:
- For each month, get the max and min mileage for each car. Calculate max-min to get the mileage it rode that month. Sum the mileage for each car to get a total mileage driven by all the cars. Sum the liters tanked. Divide the total liters by the total mileage.
How can I get the result:
| YearMonth | Average |
| 2016-06 | 30 |
| 2016-07 | 32 |
| 2016-08 | 46 |
| 2016-09 | 34 |
This is a more complicated problem than it seems. The problem is that you don't want to lose miles between months. It is tempting to do something like this:
select year(date), month(date),
sum(liters) / (max(mileage) - min(mileage))
from Tanking
where Date >= dateadd(month, -5, getdate())
group by year(date), month(date);
However, this misses miles and liters that span month boundaries. In addition, the liters on the first record of the month are for the previous milage difference. Oops! That is not correct.
One way to fix this is to look up the next values. The query looks something like this:
select year(date), month(date),
sum(next_liters) / (max(next_mileage) - min(mileage))
from (select t.*,
lead(date) over (partition by vehicle_id order by date) as next_date,
lead(mileage) over (partition by vehicle_id order by date) as next_mileage,
lead(liters) over (partition by vehicle_id order by date) as next_liters
from Tanking t
) t
where Date >= dateadd(month, -5, getdate())
group by year(date), month(date);
These queries use simplified column names, so escape characters don't interfere with the logic.
EDIT:
Oh, you have multiple cars (probably what vehicle_Id is there for). You want two levels of aggregation. The first query would look like:
select yyyy, mm, sum(liters) as liters, sum(mileage_diff) as mileage_diff,
sum(mileage_diff) / sum(liters) as mileage_per_liter
from (select vehicle_id, year(date) as yyyy, month(date) as mm,
sum(liters) as liters,
(max(mileage) - min(mileage)) as mileage_diff
from Tanking
where Date >= dateadd(month, -5, getdate())
group by vehicle_year(date), month(date)
) t
group by yyyy, mm;
Similar changes to the second query (with vehicle_id in the partition by clauses) would work for the second version.
Try to get the sums per car per month in a subquery. Then calculate the average per month in an outer query using the values of the subquery:
select year_month,
(1.0*sum(liters_per_car)/sum(mileage_per_car))*100.0 as Litres_per_100KM
from (
select convert(char(7), [Date], 127) as year_month,
sum(Liters) as liters_per_car,
max(Mileage)-min(Mileage) as mileage_per_car
from Tanking
group by convert(char(7), [Date], 127), Vehicle_ID) as t
group by year_month
You can use a CTE to get dif(mileage) and then calculate consumption:
Can check it here: http://rextester.com/OKZO55169
with cte (car, datec, difm, liters)
as
(
select
car,
datec,
mileage - lag(mileage,1,mileage) over(partition by car order by car, mileage) as difm,
liters
from #consum
)
select
car,
year(datec) as [year],
month(datec) as [month],
((cast(sum(liters) as float)/cast(sum(difm) as float)) * 100.0) as [l_100km]
from
cte
group by
car, year(datec), month(datec)

Rolling counts based on rolling cohorts

Using Postgres 9.5. Test data:
create temp table rental (
customer_id smallint
,rental_date timestamp without time zone
,customer_name text
);
insert into rental values
(1, '2006-05-01', 'james'),
(1, '2006-06-01', 'james'),
(1, '2006-07-01', 'james'),
(1, '2006-07-02', 'james'),
(2, '2006-05-02', 'jacinta'),
(2, '2006-05-03', 'jacinta'),
(3, '2006-05-04', 'juliet'),
(3, '2006-07-01', 'juliet'),
(4, '2006-05-03', 'julia'),
(4, '2006-06-01', 'julia'),
(5, '2006-05-05', 'john'),
(5, '2006-06-01', 'john'),
(5, '2006-07-01', 'john'),
(6, '2006-07-01', 'jacob'),
(7, '2006-07-02', 'jasmine'),
(7, '2006-07-04', 'jasmine');
I am trying to understand the behaviour of existing customers. I am trying to answer this question:
What is the likelihood of a customer to order again based on when their last order was (current month, previous month (m-1)...to m-12)?
Likelihood is calculated as:
distinct count of people who ordered in current month /
distinct count of people in their cohort.
Thus, I need to generate a table that lists a count of the people who ordered in the current month, who belong in a given cohort.
Thus, what are the rules for being in a cohort?
- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc
I am using the DVD Store database as sample data to develop the query: http://linux.dell.com/dvdstore/
Here is an example of cohort rules and aggregations, based on July being the
"month's orders being analysed" (please notice: the "month's orders being analysed" column is the first column in the 'Desired output' table below):
customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james | 1 1 | 1 | 1 | <- member of jul cohort, made order in jul
jasmine | 1 1 | | | <- member of jul cohort, made order in jul
jacob | 1 | | | <- member of jul cohort, did NOT make order in jul
john | 1 | 1 | 1 | <- member of jun cohort, made order in jul
julia | | 1 | 1 | <- member of jun cohort, did NOT make order in jul
juliet | 1 | | 1 | <- member of may cohort, made order in jul
jacinta | | | 1 1 | <- member of may cohort, did NOT make order in jul
This data would output the following table:
--where m = month's orders being analysed
month's orders |how many people |how many people from |how many people |how many people from |how many people |how many people from |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16 |5 |1 | | | | |
jun-16 | | |5 |3 | | |
jul-16 |3 |2 |2 |1 |2 |1 |
My attempts so far have been on variations of:
generate_series()
and
row_number() over (partition by customer_id order by rental_id desc)
I haven't been able to get everything to come together yet (I've tried for many hours and haven't yet solved it).
For readability, I think posting my work in parts is better (if anyone wants me to post the sql query in its entirety please comment - and I'll add it).
series query:
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
rank query:
(select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked
I want to do something like: run the orders_ranked query for every row returned by the series query, and then base aggregations on each return of orders_ranked.
Something like:
(--this query counts the customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,
(--this query counts the customers in cohort m-1 who ordered in month m
select
count(distinct customer_id)
from
(--this query returns the orders by customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
This query does everything. It operates on the whole table and works for any time range.
Based on some assumptions and assuming current Postgres version 9.5. Should work with pg 9.1 at least. Since your definition of "cohort" is unclear to me, I skipped the "how many people in cohort" columns.
I would expect it to be faster than anything you tried so far. By orders of magnitude.
SELECT *
FROM crosstab (
$$
SELECT mon
, sum(count(*)) OVER (PARTITION BY mon)::int AS m0
, gap -- count of months since last order
, count(*) AS gap_ct
FROM (
SELECT mon
, mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
FROM (
SELECT DISTINCT ON (1,2)
date_trunc('month', rental_date)::date AS mon
, customer_id AS c_id
, extract(YEAR FROM rental_date)::int * 12
+ extract(MONTH FROM rental_date)::int AS mon_int
FROM rental
) dist_customer
) gap_to_last_month
GROUP BY mon, gap
ORDER BY mon, gap
$$
, 'SELECT generate_series(1,12)'
) ct (mon date, m0 int
, m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
, m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);
Result:
mon | m0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12
------------+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
2015-01-01 | 63 | 36 | 15 | 5 | 3 | 3 | | | | | | |
2015-02-01 | 56 | 35 | 9 | 9 | 2 | | 1 | | | | | |
...
m0 .. customers with >= 1 order this month
m01 .. customers with >= 1 order this month and >= 1 order 1 month before (nothing in between)
m02 .. customers with >= 1 order this month and >= 1 order 2 month before and no order in between
etc.
How?
In subquery dist_customer reduce to one row per month and customer_id (mon, c_id) with DISTINCT ON:
Select first row in each GROUP BY group?
To simplify later calculations add a count of months for the date (mon_int). Related:
How do you do date math that ignores the year?
If there are many orders per (month, customer), there are faster query techniques for the first step:
Optimize GROUP BY query to retrieve latest record per user
In subquery gap_to_last_month add the column gap indicating the time gap between this month and the last month with any orders of the same customer. Using the window function lag() for this. Related:
PostgreSQL window function: partition by comparison
In the outer SELECT aggregate per (mon, gap) to get the counts you are after. In addition, get the total count of distinct customers for this month m0.
Feed this query to crosstab() to pivot the result into the desired tabular form for the result. Basics:
PostgreSQL Crosstab Query
About the "extra" column m0:
Pivot on Multiple Columns using Tablefunc