Postgres need to get count of rows by uniqueness - sql

I have a simple table that has lat, long, and time. Basically, I want the result of my query to give me something like this:
lat,long,hourwindow,count
I can't seem to figure out how to do this. I've tried so many things I can't keep them straight. And unfortunately Here's what I've got so far:
WITH all_lat_long_by_time AS (
SELECT
trunc(cast(lat AS NUMERIC), 4) AS lat,
trunc(cast(long AS NUMERIC), 4) AS long,
date_trunc('hour', time :: TIMESTAMP WITHOUT TIME ZONE) AS hourWindow
FROM my_table
),
unique_lat_long_by_time AS (
SELECT DISTINCT * FROM all_lat_long_by_time
),
all_with_counts AS (
-- what do I do here?
)
SELECT * FROM all_with_counts;

I think this is pretty basic aggregation query:
SELECT date_trunc('hour', time :: TIMESTAMP WITHOUT TIME ZONE) AS hourWindow
trunc(cast(lat AS NUMERIC), 4) AS lat,
trunc(cast(long AS NUMERIC), 4) AS long,
COUNT(*)
FROM my_table
GROUP BY hourWindow, trunc(cast(lat AS NUMERIC), 4), trunc(cast(long AS NUMERIC), 4)
ORDER BY hourWindow

If "count of rows by uniqueness" is meant to count distinct coordinates per hour (after truncating the numbers), count(DISTINCT (lat,long)) does the job:
SELECT date_trunc('hour', time::timestamp) AS hour_window
, count(DISTINCT (trunc( lat::numeric, 4)
, trunc(long::numeric, 4))) AS count_distinct_coordinates
FROM tbl
GROUP BY 1
ORDER BY 1;
Details in the manual here.
(lat,long) is a ROW value and short for ROW(lat,long). More here.
But count(DISTINCT ...) is typically slow, a subquery should be faster for your case:
SELECT hour_window, count(*) AS count_distinct_coordinates
FROM (
SELECT date_trunc('hour', time::timestamp) AS hour_window
, trunc( lat::numeric, 4) AS lat
, trunc(long::numeric, 4) AS long
FROM tbl
GROUP BY 1, 2, 3
) sub
GROUP BY 1
ORDER BY 1;
Or:
SELECT hour_window, count(*) AS count_distinct_coordinates
FROM (
SELECT DISTINCT
date_trunc('hour', time::timestamp) AS hour_window
, trunc( lat::numeric, 4) AS lat
, trunc(long::numeric, 4) AS long
FROM tbl
) sub
GROUP BY 1
ORDER BY 1;
After the subquery folds duplicates, the outer SELECT can use a plain count(*).

Related

greenplum string_agg conversion into hivesql supported

We are migrating greenplum sql query to hivesql and please find below statement available, string_agg. how do we migrate, kindly help us. below sample greenplum code needed for migration hive.
select string_agg(Display_String, ';' order by data_day )
select string_agg(Display_String, ';' order by data_day )
from
(
select data_day,
sum(revenue)/1000000.00 as revenue,
data_day||' '||trim(to_char(sum(revenue),'9,999,999,999')) as Display_String
from(
select case when data_date = current_date then 'D:'
when data_date = current_date - 1 then ' D-01:'
when data_date = current_date - 2 then ' D-02:'
when data_date = current_date - 7 then ' D-07:'
when data_date = current_date - 28 then ' D-28:'
end data_day, revenue/1000000.00 revenue
from test.testable
where data_date between current_date - 28 and current_date and hour <=(Select hour from ( select row_number() over(order by hour desc) iRowsID, hour from test.testable where data_date = current_date and type = 'UVC')tbl1
where irowsid = 2) and type in( 'UVC')
order by 1 desc) a
group by 1)aa;
There is nothing like this in hive. However you can use collect list and partition by/Order by to calculate it.
select concat_ws(';', max(concat_str))
FROM (
SELECT collect_list(Display_String) over (order by data_day ) concat_str
FROM
(your above SQL) s ) concat_qry)r
Explanation -
collect list concats the string and while doing it it, order by orders data on day column.
Outermost MAX() will pickup max data for the concatenated string.
Pls note this is a very slow operation. Test performance as well before implementing it.
Here is a sample SQL and result to help you.
select
id, concat_ws(';', max(concat_str))
from
( select
s.id, collect_list(s.c) over (partition by s.id order by s.c ) concat_str
from
( select 1 id,'ax' c union
select 1,'b'
union select 2,'f'
union select 2,'g'
union all select 1,'b'
union all select 1,'b' )s
) gs
group by id

Pro Rata in BigQuery [duplicate]

This question already has answers here:
SQL equally distribute value by rows
(2 answers)
Do loop in BigQuery
(1 answer)
Closed 1 year ago.
I want to pro rate a table like this:
into a table like this:
essentially I want to create rows for the days between date_start and date end and then divide spend by how many days there are.
I am currently using the query below to do this, using BigQuery scripting - I know this probably is a horrible way of querying this but I'm not sure how else to do it. It takes about 30 seconds to run this query for just 3 rows.
DECLARE i INT64 DEFAULT 1;
DECLARE n int64;
SET n = (SELECT COUNT(*) FROM `pro_rata_test.data`);
DELETE FROM `pro_rata_test.pro_rata` WHERE TRUE;
WHILE i <= n DO
INSERT INTO
pro_rata_test.pro_rata
SELECT
day,
country,
campaign,
other,
SUM(spend)/(
SELECT
DATETIME_DIFF(DATETIME(TIMESTAMP(date_end)),
DATETIME(TIMESTAMP(date_start)),
DAY) + 1
FROM (
SELECT *, ROW_NUMBER() OVER(ORDER BY date_start) AS rn FROM `pro_rata_test.data`)
WHERE
rn = i) AS spend
FROM (
SELECT *, ROW_NUMBER() OVER(ORDER BY date_start) AS rn FROM `pro_rata_test.data`),
UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) day
WHERE
rn = i
GROUP BY
day,
country,
campaign,
other
ORDER BY
day;
SET
i = i + 1;
END WHILE
Try generate_date_array and unnest:
with mytable as (
select date '2021-01-01' as date_start, date '2021-01-10' as date_end, 100 as spend, 'FR' as country, 'Campaign1' as campaign, 'test1' as Other union all
select date '2021-01-11', date '2021-02-27', 150, 'UK', 'Campaign1', 'test2' union all
select date '2021-03-20', date '2021-04-20', 500, 'UK', 'Campaign2', 'test2'
)
select
day,
country,
campaign,
other,
spend/(date_diff(date_end, date_start, day)+1) as spend
from mytable, unnest(generate_date_array(date_start, date_end)) as day
order by day

Multiple SELECTS and a CTE table

I have this statement which returns values for the dates that exist in the table, the cte then just fills in the half hourly intervals.
with cte (reading_date) as (
select date '2020-11-17' from dual
union all
select reading_date + interval '30' minute
from cte
where reading_date + interval '30' minute < date '2020-11-19'
)
select c.reading_date, d.reading_value
from cte c
left join dcm_reading d on d.reading_date = c.reading_date
order by c.reading_date
However, later on I needed to use A SELECT within a SELECT like this:
SELECT serial_number,
register,
reading_date,
reading_value,,
ABS(A_plus)
FROM
(
SELECT
serial_number,
register,
TO_DATE(reading_date, 'DD-MON-YYYY HH24:MI:SS') AS reading_date,
reading_value,
LAG(reading_value,1, 0) OVER(ORDER BY reading_date) AS previous_read,
LAG(reading_value, 1, 0) OVER (ORDER BY reading_date) - reading_value AS A_plus,
reading_id
FROM DCM_READING
WHERE device_id = 'KXTE4501'
AND device_type = 'E'
AND serial_number = 'A171804699'
AND reading_date BETWEEN TO_DATE('17-NOV-2019' || ' 000000', 'DD-MON-YYYY HH24MISS') AND TO_DATE('19-NOV-2019' || ' 235959', 'DD-MON-YYYY HH24MISS')
ORDER BY reading_date)
ORDER BY serial_number, reading_date;
For extra information:
I am selecting data from a table that exists, and using lag function to work out difference in reading_value from previous record. However, later on I needed to insert dummy data where there are missing half hour reads. The CTE table brings back a list of all half hour intervals between the two dates I am querying on.
ultimately I want to get a result that has all the reading_dates in half hour, the reading_value (if there is one) and then difference between the reading_values that do exist. For the half hourly reads that don't have data returned from table DCM_READING I want to just return NULL.
Is it possible to use a CTE table with multiple selects?
Not sure what you would like to achieve, but you can have multiple CTEs or even nest them:
with
cte_1 as
(
select username
from dba_users
where oracle_maintained = 'N'
),
cte_2 as
(
select owner, round(sum(bytes)/1024/1024) as megabytes
from dba_segments
group by owner
),
cte_3 as
(
select username, megabytes
from cte_1
join cte_2 on cte_1.username = cte_2.owner
)
select *
from cte_3
order by username;

Aggregates for today and the previous day depending on data

Having trouble putting together a query to pull the aggregate values of a give timestamp and the timestamp before it. Given the following schema:
name TEXT,
ts TIMESTAMP,
X NUMERIC,
Y NUMERIC
where there are gaps in the ts column due to gaps in data, I'm trying to construct a query to produce
name,
date_trunc('day' q1.ts),
avg(q1.X),
sum(q2.Y),
date_trunc('day', q2.ts),
avg(q2.X),
sum(q2.Y)
The first half is straightforward:
SELECT q1.name, date_trunc('day', q1.ts), avg(q1.X), sum(q1.Y)
FROM data as q1
GROUP BY 1, 2
ORDER BY 1, 2;
But not sure how to generate the relation to find the "day" before for each row. I'm trying to work an inner join like this:
SELECT q1.name, q1.day, q1.avg, q1.sum, q2.day, q2.avg, q2.sum
FROM (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q1 INNER JOIN (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q2 ON (
q1.name = q2.name
AND q2.day = q1.day - interval '1 day'
);
The problem with this is, it doesn't cover the cases when the next "day" is more than 1 day before the current day.
The special difficulty here is that you need to number days after aggregating rows. You can do this in a single query level with the window function row_number(), since window functions are applied after aggregation by GROUP BY.
Also, use a CTE to avoid executing the same subquery multiple times:
WITH q AS (
SELECT name, ts::date AS day
,avg(x) AS avg_x, sum(y) AS sum_y
,row_number() OVER (PARTITION BY name ORDER BY ts::date) AS rn
FROM data
GROUP BY 1,2
)
SELECT q1.name, q1.day, q1.avg_x, q1.sum_y
,q2.day AS day2, q2.avg_x AS avg_x2, q2.sum_y AS sum_y2
FROM q q1
LEFT JOIN q q2 ON q1.name = q2.name
AND q1.rn = q2.rn + 1
ORDER BY 1,2;
Using the simpler cast to date (ts::date) instead of date_trunc('day', ts) to get "days".
LEFT [OUTER] JOIN (as opposed to [INNER] JOIN) is instrumental to preserve the corner case of the first row, where there is no previous day.
And ORDER BY should be applied to the outer query.
The question isn't crystal clear, but it sounds like you're actually trying to fill gaps while keeping track of leading/lagging rows.
To fill the gaps, look into generate_series() and left join it with your table:
select d
from generate_series(timestamp '2013-12-01', timestamp '2013-12-31', interval '1 day') d;
http://www.postgresql.org/docs/current/static/functions-srf.html
For previous and next row values, look into lead() and lag() window functions:
select date_trunc('day', ts) as curr_row_day,
lag(date_trunc('day', ts)) over w as prev_row_day
from data
window w as (order by ts)
http://www.postgresql.org/docs/current/static/tutorial-window.html

Get another column from sum-sub-select

I'm selecting something from a sub-select, which in turn gives me a list of sums. Now I want to select the base_unit column, which contains the unit of measurement. I can't seem to add base_unit to the sub-select because then it doesn't work with the GROUP BY statement.
SELECT to_char(a.pow * f_unit_converter(base_unit, '[W]'), '000.00')
FROM (
SELECT sum (time_value) AS pow
FROM v_value_quarter_hour
WHERE
mp_id IN (SELECT mp_id FROM t_mp WHERE mp_name = 'AC') AND
(now() - time_stamp < '5 day')
GROUP BY time_stamp
ORDER BY time_stamp DESC
) a
LIMIT 1
Where/how can I additionally select the base_unit from the t_mp Table for each of those sums, so that I can pass it to the f_unit_converter function?
Thanks a lot,
MrB
SELECT to_char(a.pow * f_unit_converter(a.base_unit, '[W]'), '000.00')
FROM (
SELECT sum (time_value) AS pow, t_mp.base_unit
FROM v_value_quarter_hour
inner join t_mp on (v_value_quarter_hour.mp_id = t_mp.mp_id)
WHERE
t_mp.mp_name = 'AC' AND
(now() - time_stamp < '5 day')
GROUP BY time_stamp, base_unit
ORDER BY time_stamp DESC
) a
LIMIT 1
Assuming that all your selected rows have the same base_unit, you should be able to add it both to the SELECT and the GROUP BY of your sub-query.
Use an INNER JOIN instead of an IN. Something like this
SELECT to_char(a.pow * f_unit_converter(base_unit, '[W]'), '000.00') FROM (
SELECT sum (time_value), base_unit AS pow
FROM v_value_quarter_hour
INNER JOIN t_mp ON v_value_quarter_hour.mp_id = t_mp.mp_id
WHERE mp_name = 'AC' AND
now() - time_stamp < '5 day'
GROUP BY time_stamp, base_unit
ORDER BY time_stamp DESC ) a LIMIT 1