as the title say, i am doing a query on a bikesharing data stored in bigquery
I am able to extract the data and arrange it in a correct order to be displayed in a path chart. In the data, there are coordinated with only start and end long and lat, or sometimes only start long and lat, how do i remove anything with less then 4 points?
this is the code , i am also limited to select only
SELECT
routeID ,
json_extract(st_asgeojson(st_makeline( array_agg(st_geogpoint(locs.lon, locs.lat) order by locs.date))),'$.coordinates') as geo
FROM
howardcounty.routebatches
where unlockedAt between {{start_date}} and {{end_date}}
cross join UNNEST(locations) as locs
GROUP BY routeID
order by routeID
limit 10
have also included a screen shot for clarity
To apply a condition after a group by, please use a having. For a simply condition -- Are there at least two dataset for the route? -- this query can be used:
With dummy as (
Select 1 as routeID, [struct(current_timestamp() as date, 1 as lon, 2 as lat),struct(current_timestamp() as date, 3 as lon, 4 as lat)] as locations
Union all select 2 as routeID, [struct(current_timestamp() as date, 10 as lon, 20 as lat)]
)
SELECT
routeID , count(locs.date) as amountcoord,
json_extract(st_asgeojson(st_makeline( array_agg(st_geogpoint(locs.lon, locs.lat) order by locs.date))),'$.coordinates') as geo
FROM
#howardcounty.routebatches
dummy
#where unlockedAt between {{start_date}} and {{end_date}}
cross join UNNEST(locations) as locs
GROUP BY routeID
having count(locs.date)>1
order by routeID
limit 10
For more complex ones, a nested select may do the job:
Select *
from (
--- your code ---
) where length(geo)-length(replace(geo,"]","")) > 1+4
The JSON is transformed to a string in your code. If you count the ] and substract one for the end of the JSON, the inside arrays are counted.
Related
I have a table with about 100 000 names/rows that look something like this. There are about 3000 different Refnrs. The names are clustered around the Refnr geographically. The problem is that there are some names that have the wrong location. I need to find the rows who dont fit in with the others. I figured I would do this by finding the Latidude OR Longitude that is too far away from the Longitude and Latitude in the rest of the same Refnrs. So if you see the first Refnr they two of them are located at Latitude 10.67xxx, and 1 is located at Latitude 10.34xxx.
So if I say that I want to compare all the names in the different Refnrs and sort out where the 2nd decimal number differs from the rest of the names.
Is there any way to do this so that I dont have to manually run a query 3000 times?
Refnr
Latitude
Longitude
Name
123
10.67643
50.67523
bob
123
10.67143
50.67737
joe
123
10.34133
50.67848
al
234
11.56892
50.12324
berny
234
11.56123
50.12432
bonny
234
11.98135
50.12223
arby
567
10.22892
50.67143
nilly
567
10.22123
50.67236
tilly
567
10.22148
50.22422
billy
I need a select to give me this.
Refnr
Latitude
Longitude
Name
123
10.34133
50.67848
al
234
11.98135
50.12223
arby
567
10.22148
50.22422
billy
Thanks for the help.
Here's what is hopefully a working solution - it gives the 3 outliers from your sample data, will be interesting to see if it works on your larger data set.
Create a CTE for each longitude and latitude, count the number of matching values based on first 2 decimal places only and choose the minimum of each group - that's the group's outlier.
Join the results with the main table and filter to only rows matching the outlier lat or long.
with outlierLat as (
select top (1) with ties refnr, Round(latitude,2,1) latitude
from t
group by refnr, Round(latitude,2,1)
order by Count(*)
), outlierLong as (
select top (1) with ties refnr, Round(Longitude,2,1) Longitude
from t
group by refnr, Round(Longitude,2,1)
order by Count(*)
)
select t.*
from t
left join outlierLat lt on lt.refnr=t.refnr and Round(t.latitude,2,1)=lt.latitude
left join outlierLong lo on lo.refnr=t.refnr and Round(t.Longitude,2,1)=lo.Longitude
where lt.latitude is not null or lo.Longitude is not null
See demo Fiddle
This got overly complex, and may not be that useful. Still, it was interesting to work on.
First, set up the test data
DROP TABLE #Test
GO
CREATE TABLE #Test
(
Refnr int not null
,Latitude decimal(7,5) not null
,Longitude decimal(7,5) not null
,Name varchar(100) not null
)
INSERT #Test VALUES
(123, 10.67643, 50.67523, 'bob')
,(123, 10.67143, 50.67737, 'joe')
,(123, 10.34133, 50.67848, 'al')
,(234, 11.56892, 50.12324, 'berny')
,(234, 11.56123, 50.12432, 'bonny')
,(234, 11.98135, 50.12223, 'arby')
,(567, 10.22892, 50.67143, 'nilly')
,(567, 10.22123, 50.67236, 'tilly')
,(567, 10.22148, 50.22422, 'billy')
SELECT *
from #Test
As requirements are a tad imprecise, use this to round lat, lon to the desired precision. Adjust as necessary.
DECLARE #Precision TINYINT = 1
--SELECT
-- Latitude
-- ,round(Latitude, #Precision)
-- from #Test
Then it gets messy. Problems will up with if there are multiple "outliers", by EITHER latitude OR longitude. I think this will account for all, and remove duplicates, but further review and testing is called for.
;WITH cteGroups as (
-- Set up groups by lat/lon proximity
SELECT
Refnr
,'Latitude' Type
,round(Latitude, #Precision) Proximity
,count(*) HowMany
from #Test
group by
Refnr
,round(Latitude, #Precision)
UNION ALL SELECT
Refnr
,'Longitude' Type
,round(Longitude, #Precision) Proximity
,count(*) HowMany
from #Test
group by
Refnr
,round(Longitude, #Precision)
)
,cteOutliers as (
-- Identify outliers
select
Type
,Refnr
,Proximity
,row_number() over (partition by Type, Refnr order by HowMany desc) Ranking
from cteGroups
)
-- Pull out all items that match with outliers
select te.*
from cteOutliers cte
inner join #Test te
on te.Refnr = cte.Refnr
and ( (cte.Type = 'Latitude' and round(te.Latitude, #Precision) = Proximity)
or (cte.Type = 'Longitude' and round(te.Longitude, #Precision) = Proximity) )
where cte.Ranking > 1 -- Not in the larger groups
This averages out the center of the locations and looks for ones far from it
SELECT *
, ABS((SELECT Sum(Latitude) / COUNT(*) FROM #Test) - Latitude)
+ ABS((SELECT Sum(Longitude) / COUNT(*) FROM #Test) - Longitude) as Awayfromhome
from #Test
Order by Awayfromhome desc
I am trying to unnest the below table .
Using the below unnest query to flatten the table
SELECT
id,
name ,keyword
FROM `project_id.dataset_id.table_id`
,unnest (`groups` ) as `groups`
where id = 204358
Problem is , this duplicates the rows (except name) as is the case with flattening the table.
How can I modify the query to put the names in two different columns rather than rows.
Expected output below -
That's because the comma is a cross join - in combination with an unnested array it is a lateral cross join. You repeat the parent row for every row in the array.
One problem with pivoting arrays is that arrays can have a variable amount of rows, but a table must have a fixed amount of columns.
So you need a way to decide for a certain row that becomes a certain column.
E.g. with
SELECT
id,
name,
groups[ordinal(1)] as firstArrayEntry,
groups[ordinal(2)] as secondArrayEntry,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
If your array had a key-value pair you could decide using the key. E.g.
SELECT
id,
name,
(select value from unnest(groups) where key='key1') as key1,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
But that doesn't seem to be the case with your table ...
A third option could be PIVOT in combination with your cross-join solution but this one has restrictions too: and I'm not sure how computation-heavy this is.
Consider below simple solution
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (1, 2))
if applied to sample data in your question - output is
Note , when you apply to your real case - you just need to know how many such name_NNN columns to expect and extend respectively list - for example for offset + 1 in (1, 2, 3, 4, 5)) if you expect 5 such columns
In case if for whatever reason you want improve this - use below where everything is built dynamically for you so you don't need to know in advance how many columns it will be in the output
execute immediate (select '''
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (''' || string_agg('' || pos, ', ') || '''))
'''
from (select pos from (
select max(array_length(`groups`)) cnt
from `project_id.dataset_id.table_id`
), unnest(generate_array(1, cnt)) pos
))
Your question is a little unclear, because it does not specify what to do with other keywords or other columns. If you specifically want the first two values in the array for keyword "OVG", you can unnest the array and pull out the appropriate names:
SELECT id,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1
) as name_1,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1 OFFSET 1
) as name_2,
'OVG' as keyword
FROM `project_id.dataset_id.table_id` t
WHERE id = 204358;
I have 2 different characters ('|' and ',') in one column in Bigquery. Using SQL standard how do I split a column with the string from these characters below into multiple columns separating by '|' and ',' ?
Inbr | Evermore | In Banner Video, Canary Island | 702B6
The code I have so far is:
Thank you here is the code scenario, how do I apply that with the other columns I need in the table?
SELECT CAST(Date AS DATE) Date,
Data_Source_type,
Data_Source_id,
Campaign,
Data_Source,
Data_Source_name,
Data_Source_type_name,
Ad_legacy__AdWords,
Ad_Group_Name__AdWords,
Ad_Type__AdWords,
SPLIT(Campaign,'|')[safe_ordinal(1)] as Media,SPLIT(Campaign,'|')[safe_ordinal(2)] as Client,SPLIT(Campaign, '|')[safe_ordinal(3)] as Market_Type,SPLIT(Campaign,'|')[safe_ordinal(4)] as Market,SPLIT(Campaign,'|')[safe_ordinal(5)] as Market_ID,
City__AdWords,
FROM `data.aud_summary'
Consider below (as this is Campaign info - I assume the structure of string in column is consistent across rows and has same number of columns to be extracted)
select * except(key) from (
select to_json_string(t) key, offset, value
from `project.dataset.table` t,
unnest(regexp_extract_all(Campaign, r'[^,|]+')) value with offset
)
pivot(max(value) for offset in (0 as Media, 1 as Client, 2 as Market_Type, 3 as Market, 4 as Code))
if applied to sample data in your question - output is
how do I apply that with the other columns I need in the table?
just add t.* as in below example
select * from (
select t.*, offset, value
from `project.dataset.table` t,
unnest(regexp_extract_all(Campaign, r'[^,|]+')) value with offset
)
pivot(max(value) for offset in (0 as Media, 1 as Client, 2 as Market_Type, 3 as Market, 4 as Code))
Use REPLACE to replace your , to | before splitting the column.
WITH
SampleData AS (
SELECT
"Inbr | Evermore | In Banner Video, Canary Island | 702B6" AS DATA )
SELECT
a[safe_ORDINAL(1)] AS Media,
a[safe_ORDINAL(2)] AS Client,
a[safe_ORDINAL(3)] AS Market_Type,
a[safe_ORDINAL(4)] AS Market,
a[safe_ORDINAL(5)] AS Fifth,
FROM (
SELECT
SPLIT(REPLACE(DATA, ",", "|"),'|') AS a
FROM
SampleData)
Result
Media
Client
Market_Type
Market
Fifth
Inbr
Evermore
In Banner Video
Canary Island
702B6
at last,
SELECT
* EXCEPT(a),
a[safe_ORDINAL(1)] AS Media,
a[safe_ORDINAL(2)] AS Client,
a[safe_ORDINAL(3)] AS Market_Type,
a[safe_ORDINAL(4)] AS Market,
a[safe_ORDINAL(5)] AS Fifth,
FROM (
SELECT
CAST(Date AS DATE) Date,
* EXCEPT(Date),
SPLIT(REPLACE(DATA, ',', '|'),'|') AS a
FROM
`data.aud_summary`)
I am trying to find the maximum milage corridor segment for each corridor in the dataset. I used this very similar query to find the minimum milage, or first segment on a corridor in the increasing direction and this worked fine:
select distinct t.corridor_code_rb,t.frfpost,t.trfpost
from SEC_SEGMENTS t
where t.dir = 'I' and t.lane = 1
and t.frfpost = (select min(s.frfpost) from SEC_SEGMENTS s)
order by 1,2
However the problem arises when I try to use a similar query to find the max (last segment in the increasing direction) corridor length with the following query:
select distinct t.corridor_code_rb,t.frfpost,t.trfpost
from SEC_SEGMENTS t
where t.dir = 'I' and t.lane = 1
and t.trfpost = (select max(s.trfpost) from SEC_SEGMENTS s)
group by t.corridor_code_rb,t.frfpost,t.trfpost
What happens when I run this query is it only outputs the highest milage segment for the first corridor, then stops. Whereas with the lowest milage query, it returns that output for every corridor which is what I want. The frfpost is the beginning mile for each section and the trfpost is the ending milage. So frfpost is 'from reference post' and trfpost is 'to reference post'. Each corridor is broken up into segments between 5 and 40 miles in length usually between junctions with other corridors. Im trying to find the last segments for each corridor so that's where the issue lies.
You need to group by corridor_code_rb as well, to get the max value for that column per corridor_code_rb. Then join it to the main table.
select t.corridor_code_rb,t.frfpost,t.trfpost
from SEC_SEGMENTS t
join (select corridor_code_rb, max(s.trfpost) as trfpost from SEC_SEGMENTS
group by corridor_code_rb) s
on t.trfpost = s.trfpost
where t.dir = 'I' and t.lane = 1
Based on your comment you seem to use Oracle, which supports Analytical Functions:
Select corridor_code_rb, frfpost, tropfst
from
(
select corridor_code_rb, frfpost, trfpost,
ROW_NUMBER() -- rank the rows
OVER (PARTITION BY corridor_code_rb -- for each corridor
ORDER BY trfpost DESC) AS rn -- by descending direction
from SEC_SEGMENTS t
where t.dir = 'I' and t.lane = 1
) dt
WHERE rn = 1 -- filter the last row
I'm having a bit of a weird question, given to me by a client.
He has a list of data, with a date between parentheses like so:
Foo (14/08/2012)
Bar (15/08/2012)
Bar (16/09/2012)
Xyz (20/10/2012)
However, he wants the list to be displayed as follows:
Foo (14/08/2012)
Bar (16/09/2012)
Bar (15/08/2012)
Foot (20/10/2012)
(notice that the second Bar has moved up one position)
So, the logic behind it is, that the list has to be sorted by date ascending, EXCEPT when two rows have the same name ('Bar'). If they have the same name, it must be sorted with the LATEST date at the top, while staying in the other sorting order.
Is this even remotely possible? I've experimented with a lot of ORDER BY clauses, but couldn't find the right one. Does anyone have an idea?
I should have specified that this data comes from a table in a sql server database (the Name and the date are in two different columns). So I'm looking for a SQL-query that can do the sorting I want.
(I've dumbed this example down quite a bit, so if you need more context, don't hesitate to ask)
This works, I think
declare #t table (data varchar(50), date datetime)
insert #t
values
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
select t.*
from #t t
inner join (select data, COUNT(*) cg, MAX(date) as mg from #t group by data) tc
on t.data = tc.data
order by case when cg>1 then mg else date end, date desc
produces
data date
---------- -----------------------
Foo 2012-08-14 00:00:00.000
Bar 2012-09-16 00:00:00.000
Bar 2012-08-15 00:00:00.000
Xyz 2012-10-20 00:00:00.000
A way with better performance than any of the other posted answers is to just do it entirely with an ORDER BY and not a JOIN or using CTE:
DECLARE #t TABLE (myData varchar(50), myDate datetime)
INSERT INTO #t VALUES
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
SELECT *
FROM #t t1
ORDER BY (SELECT MIN(t2.myDate) FROM #t t2 WHERE t2.myData = t1.myData), T1.myDate DESC
This does exactly what you request and will work with any indexes and much better with larger amounts of data than any of the other answers.
Additionally it's much more clear what you're actually trying to do here, rather than masking the real logic with the complexity of a join and checking the count of joined items.
This one uses analytic functions to perform the sort, it only requires one SELECT from your table.
The inner query finds gaps, where the name changes. These gaps are used to identify groups in the next query, and the outer query does the final sorting by these groups.
I have tried it here (SQL Fiddle) with extended test-data.
SELECT name, dat
FROM (
SELECT name, dat, SUM(gap) over(ORDER BY dat, name) AS grp
FROM (
SELECT name, dat,
CASE WHEN LAG(name) OVER (ORDER BY dat, name) = name THEN 0 ELSE 1 END AS gap
FROM t
) x
) y
ORDER BY grp, dat DESC
Extended test-data
('Bar','2012-08-12'),
('Bar','2012-08-11'),
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-08-16'),
('Bar','2012-09-17'),
('Xyz','2012-10-20')
Result
Bar 2012-08-12
Bar 2012-08-11
Foo 2012-08-14
Bar 2012-09-17
Bar 2012-08-16
Bar 2012-08-15
Xyz 2012-10-20
I think that this works, including the case I asked about in the comments:
declare #t table (data varchar(50), [date] datetime)
insert #t
values
('Foo','20120814'),
('Bar','20120815'),
('Bar','20120916'),
('Xyz','20121020')
; With OuterSort as (
select *,ROW_NUMBER() OVER (ORDER BY [date] asc) as rn from #t
)
--Now we need to find contiguous ranges of the same data value, and the min and max row number for such a range
, Islands as (
select data,rn as rnMin,rn as rnMax from OuterSort os where not exists (select * from OuterSort os2 where os2.data = os.data and os2.rn = os.rn - 1)
union all
select i.data,rnMin,os.rn
from
Islands i
inner join
OuterSort os
on
i.data = os.data and
i.rnMax = os.rn-1
), FullIslands as (
select
data,rnMin,MAX(rnMax) as rnMax
from Islands
group by data,rnMin
)
select
*
from
OuterSort os
inner join
FullIslands fi
on
os.rn between fi.rnMin and fi.rnMax
order by
fi.rnMin asc,os.rn desc
It works by first computing the initial ordering in the OuterSort CTE. Then, using two CTEs (Islands and FullIslands), we compute the parts of that ordering in which the same data value appears in adjacent rows. Having done that, we can compute the final ordering by any value that all adjacent values will have (such as the lowest row number of the "island" that they belong to), and then within an "island", we use the reverse of the originally computed sort order.
Note that this may, though, not be too efficient for large data sets. On the sample data it shows up as requiring 4 table scans of the base table, as well as a spool.
Try something like...
ORDER BY CASE date
WHEN '14/08/2012' THEN 1
WHEN '16/09/2012' THEN 2
WHEN '15/08/2012' THEN 3
WHEN '20/10/2012' THEN 4
END
In MySQL, you can do:
ORDER BY FIELD(date, '14/08/2012', '16/09/2012', '15/08/2012', '20/10/2012')
In Postgres, you can create a function FIELD and do:
CREATE OR REPLACE FUNCTION field(anyelement, anyarray) RETURNS numeric AS $$
SELECT
COALESCE((SELECT i
FROM generate_series(1, array_upper($2, 1)) gs(i)
WHERE $2[i] = $1),
0);
$$ LANGUAGE SQL STABLE
If you do not want to use the CASE, you can try to find an implementation of the FIELD function to SQL Server.