postgresql - how to calculate the percentage of number of entries in one table w.r.t to no of entries in another table - sql

I have two tables(A & B) in my database(c). both tables contains two coulmns as shown below
column name Datatype
eventtime timestamp without time zone
serialnumber numeric
Table A contains total number of products (good and downgraded -- but in these table they are not defined as affected/downgraded) produced in each day. And table B contains total number of only downgraded products.
I want to make an quality process control chart using the percentage of downgraded products w.r.t total number of products produced (using serialnumber to join for example).
could some one tell me how can i get the percentage value for each day (also for each hour)

Use date_trunc() to group rows by desired period, e.g.:
select
a.r_date::date date,
downgraded,
total,
round(downgraded::numeric/total* 100, 2) percentage
from (
select date_trunc('day', eventtime) r_date, count(*) downgraded
from b
group by 1
) b
join (
select date_trunc('day', eventtime) r_date, count(*) total
from a
group by 1
) a
using (r_date)
order by 1;
date | downgraded | total | percentage
------------+------------+-------+------------
2015-05-05 | 3 | 4 | 75.00
2015-05-06 | 1 | 4 | 25.00
(2 rows)

Related

Monthly average without aggregating on monthly level

The table below is aggregating order_value
using data from the last 6 month, without creating an additional subquery I would like to calculate the monthly average of all orders over the last 6 month (see desired output table below).
Table
customer_id | order_timestamp | order_value
A 2020-01-01 10:30:51 100
Desired output table
customer_id | sum_orders | monthly_avg_orders
A 100.000 16.666
This is what I already have
select
customer_id,
sum(order_value)
from table
group by customer_id

Apply SUM( where date between date1 and date2)

My table is currently looking like this:
+---------+---------------+------------+------------------+
| Segment | Product | Pre_Date | ON_Prepaid |
+---------+---------------+------------+------------------+
| RB | 01. Auto Loan | 2020-01-01 | 10645976180.0000 |
| RB | 01. Auto Loan | 2020-01-02 | 4489547174.0000 |
| RB | 01. Auto Loan | 2020-01-03 | 1853117000.0000 |
| RB | 01. Auto Loan | 2020-01-04 | 9350258448.0000 |
+---------+---------------+------------+------------------+
I'm trying to sum values of 'ON_Prepaid' over the course of 7 days, let's say from '2020-01-01' to '2020-01-07'.
Here is what I've tried
drop table if exists ##Prepay_summary_cash
select *,
[1W_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 1 following and 7 following),
[2W_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 8 following and 14 following),
[3W_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 15 following and 21 following),
[1M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 22 following and 30 following),
[1.5M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 31 following and 45 following),
[2M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 46 following and 60 following),
[3M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 61 following and 90 following),
[6M_Prepaid] = sum(ON_Prepaid) over (partition by SEGMENT, PRODUCT order by PRE_DATE rows between 91 following and 181 following)
into ##Prepay_summary_cash
from ##Prepay1
Things should be fine if the dates are continuous; however, there are some missing days in 'Pre_Date' (you know banks don't work on Sundays, etc.).
So I'm trying to work on something like
[1W] = SUM(ON_Prepaid) over (where Pre_date between dateadd(d,1,Pre_date) and dateadd(d,7,Pre_date))
something like that. So if per se there's no record on 2020-01-05, the result should only sum the dates on the 1,2,3,4,6,7 of Jan 2020, instead of 1,2,3,4,6,7,8 (8 because of "rows 7 following"). Or for example I have missing records over the span of 30 days or something, then all those 30 should be summed as 0s. So 45 days should return only the value of 15 days.
I've tried looking up all over the forum and the answers did not suffice. Can you guys please help me out? Or link me to a thread which the problem had already been solved.
Thank you so much.
Things should be fine if the dates are continuous
Then make them continuous. Left join your real data (grouped up so it is one row per day) onto your calendar table (make one, or use a recursive cte to generate you a list of 360 dates from X hence) and your query will work out
WITH d as
(
SELECT *
FROM
(
SELECT *
FROM cal
CROSS JOIN
(SELECT DISTINCT segment s, product p FROM ##Prepay1) x
) c
LEFT JOIN ##Prepay1 p
ON
c.d = p.pre_date AND
c.segment = p.segment AND
c.product = p.product
WHERE
c.d BETWEEN '2020-01-01' AND '2021-01-01' -- date range on c.d not c.pre_date
)
--use d.d/s/p not d.pre_date/segment/product in your query (sometimes the latter are null)
select *,
[1W_Prepaid] = sum(ON_Prepaid) over (partition by s, s order by d.d rows between 1 following and 7 following),
...
CAL is just a table with a single column of dates, one per day, no time, extending for n thousand days into the past/future
Wish to note that months have variable number of days so 6M is a bit of a misnomer.. might be better to call the month ones 180D, 90D etc
Also want to point out that your query performs a per row division of your data into into groups. If you want to perform sums up to 180 days after the date of the row you need to pull a year's worth of data so that on row 180(June) you have the December data available to sum (dec being 6 months from June)
If you then want to restrict your query to only showing up to June (but including data summed from 6 months after June) you need to wrap it all again in a sub query. You cannot "where between jan and jun" in the query that does the sum over because where clauses are done before window clauses (doing so will remove the dec data before it is summed)
Some other databases make this easier, Oracle and Postgres spring to mind; they can perform sum in a range where the other rows values are within some distance of the current row's values. SQL server only usefully supports distancing based on a row's index rather than its values (the distancing-based-on-values support is limited to "rows that have the same value", rather than "rows that have values n higher or lower than the current row"). I suppose the requirement could be met with a cross apply, or a coordinated sub in the select, though I'd be careful to check the performance..
SELECT *,
(SELECT SUM(tt.a) FROM x tt WHERE t.x = tt.x AND tt.y = t.y AND tt.z BETWEEN DATEADD(d, 1, t.z) AND DATEADD(d, 7, t.z) AS 1W
FROM
x t

SQL Joining One Table to a Selection of Rows from Second Table that Contains a Max Value per Group

I have a table of Cases with info like the following -
ID
CaseName
Date
Occupation
11
John
2020-01-01
Joiner
12
Mark
2019-10-10
Mechanic
And a table of Financial information like the following -
ID
CaseID
Date
Value
1
11
2020-01-01
1,000
2
11
2020-02-03
2,000
3
12
2019-10-10
3,000
4
12
2019-12-25
4,000
What I need to produce is a list of Cases including details of the most recent Financial value, for example -
ID
CaseName
Occupation
Lastest Value
11
John
Joiner
2,000
12
Mark
Mechanic
4,000
Now I can join my tables easy enough with -
SELECT *
FROM Cases AS c
LEFT JOIN Financial AS f ON f.CaseID = c.ID
And I can find the most recent date per case from the financial table with -
SELECT CaseID, MAX(Date) AS LastDate
FROM Financial
GROUP BY CaseID
But I am struggling to find a way to bring these two together to produce the required results as per the table set out above.
A simple method is window functions:
SELECT *
FROM Cases c LEFT JOIN
(SELECT f.*, MAX(date) OVER (PARTITION BY CaseId) as max_date
FROM Financial f
) f
ON f.CaseID = c.ID AND f.max_date = f.date;

How to get the count of distinct values until a time period Impala/SQL?

I have a raw table recording customer ids coming to a store over a particular time period. Using Impala, I would like to calculate the number of distinct customer IDs coming to the store until each day. (e.g., on day 3, 5 distinct customers visited so far)
Here is a simple example of the raw table I have:
Day ID
1 1234
1 5631
1 1234
2 1234
2 4456
2 5631
3 3482
3 3452
3 1234
3 5631
3 1234
Here is what I would like to get:
Day Count(distinct ID) until that day
1 2
2 3
3 5
Is there way to easily do this in a single query?
Not 100% sure if will work on impala
But if you have a table days. Or if you have a way of create a derivated table on the fly on impala.
CREATE TABLE days ("DayC" int);
INSERT INTO days
("DayC")
VALUES (1), (2), (3);
OR
CREATE TABLE days AS
SELECT DISTINCT "Day"
FROM sales
You can use this query
SqlFiddleDemo in Postgresql
SELECT "DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN days
WHERE "Day" <= "DayC"
GROUP BY "DayC"
OUTPUT
| DayC | count |
|------|-------|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
UPDATE VERSION
SELECT T."DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN (SELECT DISTINCT "Day" as "DayC" FROM sales) T
WHERE "Day" <= T."DayC"
GROUP BY T."DayC"
try this one:
select day, count(distinct(id)) from yourtable group by day

How to create a pivot table by product by month in SQL

I have 3 tables:
users (id, account_balance)
grocery (user_id, date, amount_paid)
fishmarket (user_id, date, amount_paid)
Both fishmarket and grocery tables may have multiple occurrences for the same user_id with different dates and amounts paid or have nothing at all for any given user. I am trying to develop a pivot table of the following structure:
id | grocery_amount_paid_January | fishmarket_amount_paid_January
1 10 NULL
2 40 71
The only idea I can come with is to create multiple left joins, but this should be wrong since there will be 24 joins (per each month) for each product. Is there a better way?
I have provided a lot of answers on crosstab queries in PostgreSQL lately. Sometimes a "plain" query like the following does the job:
WITH x AS (SELECT '2012-01-01'::date AS _from
,'2012-12-01'::date As _to) -- provide date range once in CTE
SELECT u.id
,to_char(m.mon, 'MM.YYYY') AS month_year
,g.amount_paid AS grocery_amount_paid
,f.amount_paid AS fishmarket_amount_paid
FROM users u
CROSS JOIN (SELECT generate_series(_from, _to, '1 month') AS mon FROM x) m
LEFT JOIN (
SELECT user_id
,date_trunc('month', date) AS mon
,sum(amount_paid) AS amount_paid
FROM x, grocery -- CROSS JOIN with a single row
WHERE date >= _from
AND date < (_to + interval '1 month')
GROUP BY 1,2
) g ON g.user_id = u.id AND m.mon = g.mon
LEFT JOIN (
SELECT user_id
,date_trunc('month', date) AS mon
,sum(amount_paid) AS amount_paid
FROM x, fishmarket
WHERE date >= _from
AND date < (_to + interval '1 month')
GROUP BY 1,2
) f ON f.user_id = u.id AND m.mon = g.mon
ORDER BY u.id, m.mon;
produces this output:
id | month_year | grocery_amount_paid | fishmarket_amount_paid
---+------------+---------------------+------------------------
1 | 01.2012 | 10 | NULL
1 | 02.2012 | NULL | 65
1 | 03.2012 | 98 | 13
...
2 | 02.2012 | 40 | 71
2 | 02.2012 | NULL | NULL
Major points
The first CTE is for convenience only. So you have to type your date range once only. You can use any date range - as long as it's dates with the first of the month (rest of the month will be included!). You could add date_trunc() to it, but I guess you can keep the urge to use invalid dates in check.
First CROSS JOIN users to the result of generate_series() (m) which provides one row per month in your date range. You have learned in your last question how that results in multiple rows per user.
The two subqueries are identical twins. Use WHERE clauses that operate on the base column, so it can utilize an index - which you should have if your table runs over many years (no use for only one or two years, a sequential scan will be faster):
CREATE INDEX grocery_date ON grocery (date);
Then reduce all dates to the first of the month with date_trunc() and sum amount_paid per user_id and the resulting mon.
LEFT JOIN the result to the base table, again by user_id and the resulting mon. This way, rows are neither multiplied nor dropped. You get one row per user_id and month. Voilá.
BTW, I'd never use a column name id. Call it user_id in the table users as well.