I am trying to find the average of a query which has 600,000 rows. I have created a CTE to calculate distances based off of longitude and latitude. Is it possible for me to add an aggregate function to this, as the calculated_distance is not an existing column?
Thanks in advance!
WITH name AS (
SELECT
id,
latitude,
longitude,
name,
docks
FROM
santander_stations
),
ride_data AS (
SELECT
startstationid,
endstationid
FROM
public.santander_2016
UNION
SELECT
startstationid,
endstationid
FROM
public.santander_2017
UNION
SELECT
startstationid,
endstationid
FROM
public.santander_2018
)
SELECT
calculate_distance( a.latitude, a.longitude, b.latitude, b.longitude, 'K' ),
a.name AS Start_Station,
b.name AS End_Station
FROM
name AS a,
name AS b
WHERE
a.id IN ( SELECT startstationid FROM ride_data )
AND
b.id IN ( SELECT endstationid FROM ride_data )
ORDER BY
1 DESC
The number of rides is much more than the number of possible start & end stations. Therefore, we can calculate the distances of all combinations of (start station, end station) first, and then use the result as a lookup table for the 600,000 rides.
Here's a way to calculate the average ride distance in year 2016, 2017 and 2018 (please replace (1). the distance_km formula with your calculated_distance() UDF, (2). Santander_rides with your ride_data subquery)
with cte_santander_station_distance as (
select x.id as start_station_id,
y.id as end_station_id,
acos(sin(x.latitude)*sin(y.latitude)+cos(x.latitude)*cos(y.latitude)*cos(y.longitude-x.longitude)) as distance_km
from santander_stations x, santander_stations y
where x.id <> y.id)
select avg(ssd.distance_km) as average_distance_km,
count(*) as rides
from santander_rides sr
join cte_santander_station_distance ssd
on sr.start_station_id = ssd.start_station_id
and sr.end_station_id = ssd.end_station_id;
If there're lots of combinations of start station and end station, you can materialize the lookup table such as below:
Related
This question already has an answer here:
Full outer join and Group By in BigQuery
(1 answer)
Closed 5 months ago.
I have two tables which has a relationship, but I want to grouping them based on time. Here are the tables
I want select a receipt as a column based on published_at, it must be in between pickup_time and drop_time, so will get this result :
I tried with JOIN, but it seems like select rows with drop_time is NULL only
SELECT
t.source_id AS source_id,
t.pickup_time AS pickup_time,
t.drop_time AS drop_time,
ARRAY_AGG(STRUCT(r.source_id, r.receipt_id, r.published_at) ORDER BY r.published_at LIMIT 1)[SAFE_OFFSET(0)] AS receipt
FROM `my-project-gcp.data_source.trips` AS t
JOIN `my-project-gcp.data_source.receipts` AS r
ON
t.source_id = r.source_id
AND
r.published_at >= t.pickup_time
AND (
r.published_at <= t.drop_time
OR t.drop_time IS NULL
)
GROUP BY source_id, pickup_time, drop_time
and tried with sub-query, got
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
SELECT
t.source_id AS source_id,
t.pickup_time AS pickup_time,
t.drop_time AS drop_time,
ARRAY_AGG((
SELECT
STRUCT(r.source_id, r.receipt_id, r.published_at)
FROM `my-project-gcp.data_source.receipts` as r
WHERE
t.source_id = r.source_id
AND
r.published_at >= t.pickup_time
AND (
r.published_at <= t.drop_time
OR t.drop_time IS NULL
)
LIMIT 1
))[SAFE_OFFSET(0)] AS receipt
FROM `my-project-gcp.data_source.trips` as t
GROUP BY source_id, pickup_time, drop_time
Each source_id is a car and only one driver can drive a car at once.
We can partition therefore by that entry.
Your approach is working for small tables. Since there is no unique join key, the cross join fails on large tables.
I present here a solution with union all and look back technique. This is quite fast and works with up to middle large table sizes in the range of a few GB. It prevents the cross join, but is a quite long script.
In the table trips are all drives by the drivers are listed. The receipts list all fines.
We need a unique row identication of each trip to join on this one later on. We use the row number for this, please see table trips_with_rowid.
The table summery_tmp unions three tables. First we load the trips table and add an empty column for the fines. Then we load the trips table again to mark the times were no one was driving the car. Finally, we add the table receipts such that only the columns source_id, pickup_time and fine is filled.
This table is sorted by the pickup_time for each source_id and the table summary. So the fine entries are under the entry of the driver getting the car. The column row_id_new is filled for the fine entries by the value of the row_id of the driver getting the car.
Grouping by row_id_new and filtering unneeded entries does the job.
I changed the second of the entered times (lazyness), thus it differs a bit from your result.
With trips as
(Select 1 source_id ,timestamp("2022-7-19 9:37:47") pickup_time, timestamp("2022-07-19 9:40:00") as drop_time, "jhon" driver_name
Union all Select 1 ,timestamp("2022-7-19 12:00:01"),timestamp("2022-7-19 13:05:11"),"doe"
Union all Select 1 ,timestamp("2022-7-19 14:30:01"),null,"foo"
Union all Select 3 ,timestamp("2022-7-24 08:35:01"),timestamp("2022-7-24 09:15:01"),"bar"
Union all Select 4 ,timestamp("2022-7-25 10:24:01"),timestamp("2022-7-25 11:14:01"),"jhon"
),
receipts as
(Select 1 source_id, 101 receipt_id, timestamp("2022-07-19 9:37:47") published_at,40 price
Union all Select 1,102, timestamp("2022-07-19 13:04:47"),45
Union all Select 1,103, timestamp("2022-07-19 15:23:00"),32
Union all Select 3,301, timestamp("2022-07-24 09:15:47"),45
Union all Select 4,401, timestamp("2022-07-25 11:13:47"),45
Union all Select 5,501, timestamp("2022-07-18 07:12:47"),45
),
trips_with_rowid as
(
SELECT 2*row_number() over (order by source_id,pickup_time) as row_id, * from trips
),
summery_tmp as
(
Select *, null as fines from trips_with_rowid
union all Select row_id+1,source_id,drop_time,null,concat("no driver, last one ",driver_name),null from trips_with_rowid
union all select null,source_id, published_at, null,null, R from receipts R
),
summery as
(
SELECT last_value(row_id ignore nulls) over (partition by source_id order by pickup_time ) row_id_new
,*
from summery_tmp
order by 1,2
)
select source_id,min(pickup_time) pickup_time, min(drop_time) drop_time,
any_value(driver_name) driver_name, array_agg(fines IGNORE NULLS) as fines_Sum
from summery
group by row_id_new,source_id
having fines_sum is not null or (pickup_time is not null and driver_name not like "no driver%")
order by 1,2
I have one dataset, and am trying to list all of the combinations of said dataset. However, I am unable to figure out how to include the combinations that are null. For example, Longitudinal? can be no and cohort can be 11-20, however for Region 1, there were no patients of that age in that region. How can I show a 0 for the count?
Here is the code:
SELECT "s_safe_005prod"."ig_eligi_group1"."site_name" AS "Site Name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong" AS "Longitudinal?",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort" AS "Cohort",
count(*) AS "count"
FROM "s_safe_005prod"."ig_eligi_group1"
GROUP BY "s_safe_005prod"."ig_eligi_group1"."site_name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort"
ORDER BY "s_safe_005prod"."ig_eligi_group1"."site_name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong" ASC,
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort" ASC
Create a cross join across the unique values from each of the three grouping fields to create a set of all possible combinations. Then left join that to the counts you have originally and coalesce null values to zero.
WITH groups AS
(
SELECT a.site_name, b.longitudinal, c.cohort
FROM (SELECT DISTINCT site_name FROM s_safe_005prod.ig_eligi_group1) a,
(SELECT DISTINCT il_eligi_ellong AS longitudinal FROM s_safe_005prod.ig_eligi_group1) b,
(SELECT DISTINCT il_eligi_elcohort AS cohort FROM s_safe_005prod.ig_eligi_group1) c
),
dat AS
(
SELECT site_name,
il_eligi_ellong AS longitudinal,
il_eligi_elcohort AS cohort,
count(*) AS "count"
FROM s_safe_005prod.ig_eligi_group1
GROUP BY site_name,
il_eligi_ellong,
il_eligi_elcohort
)
SELECT groups.site_name,
groups.longitudinal,
groups.cohort,
COALESCE(dat.[count],0) AS "count"
FROM groups
LEFT JOIN dat ON groups.site_name = dat.site_name
AND groups.longitudinal = dat.longitudinal
AND groups.cohort = dat.cohort;
DATE WindDirection
1/1/2000 SW
1/2/2000 SW
1/3/2000 SW
1/4/2000 NW
1/5/2000 NW
Question below
Every day is unqiue, and wind direction is not unique, SO now we are trying to get the COUNT of the most COMMON wind direction
My query was
SELECT Wind_Direction,COUNT(Wind_Direction) FROM Weather
GROUP BY DISTINCT(Wind_Direction);
The logic is to find the DISTINCT WindDirections, there are like 7 AND then
group by WindDirection and apply count
Group on count of occurrences of each direction while ordering by number of occurrences and limit 1 to get the one occurring on top
select w.wind_direction as most_common_wd
from (
select wind_direction, count(*) as cnt
from weather
group by wind_direction
order by cnt desc
) w
limit 1;
You could try to execute your logic using hive analytic functions:
with q1 as (select wind_direction, count(wind_direction) over (partiton by wind_direction) as total_counts from weather) select distinct wind_direction, total_counts from q1;
I'm a newbie in postgres and i have a troubling issue.
Suppose the output of my SQL query is
123456789;"2014-11-20 12:30:35.454875";500;200;"2014-11-16 16:16:26.976258";300
123456789;"2014-11-20 12:30:35.454875";500;200;"2014-11-16 16:16:27.173523";100
What i want is to sum up all the 4th column, and so that the first row will contain the sum of the 4th column
123456789;"2014-11-20 12:30:35.454875";500;400;"2014-11-16 16:16:26.976258";300
My query is
select l.phone_no, l.loan_time, l.cents_loaned/100, r.cents_deducted/100, r.event_time,
r.cents_balance/100
from tbl_table1 l
LEFT JOIN tbl_table2 r
ON l.tb1_id = r.tbl2_id
where l.phone_no=123456789
order by r.event_time desc
Any help will be appreciated.
Maybe this helps. It will add a new row containing the sum of the 4th column.
WITH query AS (
SELECT l.phone_no, l.loan_time, l.cents_loaned/100 AS cents_loaned,
r.cents_deducted/100 AS cents_deducted, r.event_time,
r.cents_balance/100 AS cents_balance,
ROW_NUMBER() OVER (ORDER BY r.event_time DESC) rn,
SUM(cents_deducted/100) OVER () AS sum_cents_deducted
FROM tbl_table1 l
LEFT
JOIN tbl_table2 r
ON l.tb1_id = r.tbl2_id
WHERE l.phone_no=123456789
)
SELECT phone_no, loan_time, cents_loaned, cents_deducted, event_time, cents_balance
FROM query
WHERE rn > 1
UNION
ALL
SELECT phone_no, loan_time, cents_loaned, sum_cents_deducted, event_time, cents_balance
FROM query
WHERE rn = 1
Use a window function over the whole set (OVER ()) as frame:
select l.phone_no, l.loan_time, l.cents_loaned/100
, sum(r.cents_deducted) OVER () / 100 AS total_cents_deducted
, r.event_time, r.cents_balance/100
FROM tbl_table1 l
LEFT JOIN tbl_table2 r ON l.tb1_id = r.tbl2_id
WHERE l.phone_no = 123456789
ORDER BY r.event_time desc
This will return all rows, not just the first. Your question is unclear as to that.
I'm trying to select max(count of rows).
Here is my 2 variants of SELECT
SELECT MAX(COUNT_OF_ENROLEES_BY_SPEC) FROM
(SELECT D.SPECCODE, COUNT(D.ENROLEECODE) AS COUNT_OF_ENROLEES_BY_SPEC
FROM DECLARER D
GROUP BY D.SPECCODE
);
SELECT S.NAME, MAX(D.ENROLEECODE)
FROM SPECIALIZATION S
CROSS JOIN DECLARER D WHERE S.SPECCODE = D.SPECCODE
GROUP BY S.NAME
HAVING MAX(D.ENROLEECODE) =
( SELECT MAX(COUNT_OF_ENROLEES_BY_SPEC) FROM
( SELECT D.SPECCODE, COUNT(D.ENROLEECODE) AS COUNT_OF_ENROLEES_BY_SPEC
FROM DECLARER D
GROUP BY D.SPECCODE
)
);
The first one is working OK, but I want to rewrite it using "HAVING" like in my second variant and add there one more column. But now 2nd variant don't output any data in results, just empty columns.
How can I fix it ? Thank YOU!)
This query based on description given in comments and some suggestions, so it may be wrong:
select -- 4. Join selected codes with specializations
S.Name,
selected_codes.spec_code,
selected_codes.count_of_enrolees_by_spec
from
specialization S,
(
select -- 3. Filter records with maximum popularity only
spec_code,
count_of_enrolees_by_spec
from (
select -- 2. Count maximum popularity in separate column
spec_code,
count_of_enrolees_by_spec,
max(count_of_enrolees_by_spec) over (partition by null) max_count
from (
SELECT -- 1. Get list of declarations and count popularity
D.SPECCODE AS SPEC_CODE,
COUNT(D.ENROLEECODE) AS COUNT_OF_ENROLEES_BY_SPEC
FROM DECLARER D
GROUP BY D.SPECCODE
)
)
where count_of_enrolees_by_spec = max_count
)
selected_codes
where
S.SPECCODE = selected_codes.spec_code
Also query not tested and some syntax errors are possible.