bigquery geojson concat multiple rows from same user to a single row

bigquery geojson concat multiple rows from same user to a single row - sql

How do I merge multiple rows from the same user into a single geojson formatted row, I tried order by with no luck. here is a sample code I am using, I am also limited to select only on this db
SELECT rh.routeid, st_asgeojson(st_geogpoint(locs.lon, locs.lat)
FROM demo.routebatches RB, demo.route R
cross join UNNEST(locations) as locs
where EXTRACT (date FROM TIMESTAMP_MILLIS (CAST(locs.date as INT64))) = "2017-03-10" and rh.cycleID = 'aff9bb7b-3b92-4620-bc50-1152edefe04c'
order by routeID
limit 100
which gives this result, where multiple long and lats from the same routeid are not ordered by routeid. How do I solve this?
Geojson would work but I would also take this format, which is from deck.gl path in superset

There are two issues here:
how to order the locations within route. You write where multiple long and lats from the same routeid are not ordered by routeid - but this is a confusing statement. All these points (same routeid) have identical routeid, so how can they be ordered by routeid? You probably want them to order within the routeid group, but you need a different field to order them by.
you need to combine all the points within the routeid into a linestring, instead of having each point separate.
Also, it is not clear what is the actual input schema, and how routebatches table relates to route table, it would help if you clarify. Does locations field belong to RB or R table? The way you join RB and R tables without any predicate makes a cross join of each RB row with each R row.
It looks like locations record has a date field in milliseconds - if we can order by this field, I would use something like
SELECT
rh.routeid,
st_asgeojson(st_makeline(
array_agg(st_geogpoint(locs.lon, locs.lat) order by locs.date)))
FROM demo.route R cross join UNNEST(locations) as locs
where
EXTRACT (date FROM TIMESTAMP_MILLIS (CAST(locs.date as INT64))) = "2017-03-10"
and rh.cycleID = 'aff9bb7b-3b92-4620-bc50-1152edefe04c'
GROUP BY routeID
order by routeID
limit 100
First, we GROUP BY routeId, we then aggregate (array_agg) all the points in this route, creating an array with all the points ordered by the timestamp. st_makeline builds a linestring from this array, which you can then convert to geojson.
Also, see this article that creates a similar linestring from public NOAA data:
https://mentin.medium.com/longest-hurricane-eeb6844d65ed

Related

Unnesting structs in BigQuery

What is the correct way to flatten a struct of two arrays in BigQuery? I have a dataset like the one pictured here (the struct.destination and struct.visitors arrays are ordered - i.e. the visitor counts correspond specifically to the destinations in the same row):
I want to reorganize the data so that I have a total visitor count for each unique combination of origins and destinations. Ideally, the end result will look like this:
I tried using UNNEST twice in a row - once on struct.destination and then on struct.visitors, but this produces the wrong result (each destination gets mapped to every value in the array of visitor counts when it should only get mapped to the value in the same row):
SELECT
origin,
unnested_destination,
unnested_visitors
FROM
dataset.table,
UNNEST(struct.destination) AS unnested_destination,
UNNEST(struct.visitors) AS unnested_visitors

You have one struct that is repeated. So, I think you want:
SELECT origin,
s.destination,
s.visitors
FROM dataset.table t CROSS JOIN
UNNEST(t.struct) s;
EDIT:
I see, you have a struct of two arrays. You can do:
SELECT origin, d.destination, v.visitors
FROM dataset.table t CROSS JOIN
UNNEST(struct.destination) s WITH OFFSET nd LEFT JOIN
UNNEST(struct.visitors) v WITH OFFSET nv
ON nd = nv

Difficult to test by not having the underlying data to test on, so I created my own query with your dataset. As far as I can tell destination|visitors is not in an ARRAY-format, but rather in a STRUCT-format, so you do not need UNNEST it. Also view this thread please :)
SELECT
origin,
COUNT(struct.destination),
COUNT(struct.visitors)
FROM dataset.table
GROUP BY 1

Splitting table PK values into roughly same-size ranges

I have a table in Postgres with about half a million rows and an integer primary key.
I'd like to split its entire PK space into N ranges of approximately same size for independent processing. How do I best do it?
I apparently can do it by fetching all PK values to a client and remember every N-th value. This does a full scan and a fetch of all the values, while I only want no more than N+1 of them.
I can select min and max values and cut the range, but if the PKs are not distributed quite evenly, it may give me some ranges of seriously different sizes.
I want ranges for index-based access later on, so any modulo-based tricks do mot apply.
Is there any nice SQL-based solution that does not involve fetching all the keys to a client? Writing an N-specific query, e.g. with N clauses, if fine.
An example:
IDs in a range, say, from 1234 to 567890, N = 4.
I'd like to get 4 numbers, say 127123, 254789, 379860, so than there are approximately 125k records in each of the ranges of IDs [1234, 127123], [127123, 254789], [254789, 379860], [379860, 567890].
Update:
I've come up with a solution like this:
select
percentile_disc(0.25) within group (order by c.id) over() as pct_25
,percentile_disc(0.50) within group (order by c.id) over() as pct_50
,percentile_disc(0.75) within group (order by c.id) over() as pct_75
from customer c
limit 1
;
It does a decent job of giving me the exact range boundaries, and runs only a few seconds, which is fine for my purposes.
What bothers me is that I have to add the limit 1 clause to get just one row. Without it, I receive identical rows, one per record in the table. Is there a better way to get just a one row of the percentiles?

I think you can use row_number() for this purpose. Something like this:
select t.*,
floor((seqnum * N) / cnt) as range
from (select t.*,
row_number() over (order by pk) - 1 as seqnum,
count(*) over () as cnt
from t
) t;
This assumes by range that you mean ranges on pk values. You can also move the range expression to a where clause to just select one particular range.

Find related "ordered pairs" in SQL

Let's say I have a table format that looks exactly like this:
I'd like to write a query that locates the maximum station for a given frame and output case (results are grouped by frame & output case) but also return the ordered P (& eventually V2, V3, T, M2 & M3) that would be associated with the maximum station. The desired query is shown below:
I can't for the life of me figure this out. I've posted a copy of the access database to my google drive: https://drive.google.com/folderview?id=0B9VpkDoFQISJOFcwS2RMSGJ5RVk&usp=sharing

select x.*, t.p
from (select frame, outputcase, max(station) as max_station
from tbl
group by frame, outputcase) x
inner join tbl t
on x.frame = t.frame
and x.outputcase = t.outputcase
and x.max_station = t.station
order by x.frame, x.outputcase;
Just as a note to avoid confusion, w/ that second column, t is the table alias, p is the column name.
The subquery, which I've assigned an alias of x, finds the max(station) for each unique combination of (frame, outputcase). That is what you want, but the problem does not stop there, you also want column p. The reason that couldn't be selected in the same query is because you would have had to group by it, and you don't want the max(station) for each combination of (frame, outputcase, p). You want the max(station) for each combination of (frame, outputcase).
Because we couldn't get column p in that first step, we have to join back to the original table using the value we obtained (which I've assigned an alias, max_station), and the obvious join conditions of frame and outputcase. So we join back to the original table on those 3 things, 2 of which are fields on the actual table, one of which was calculated in the subquery (max_station).
Because we've joined back to the original table, we can then select column p from the original table.

Takes a bit to return the query, but the result below provides the desired result:
SELECT t1.*
FROM [Element Forces - Frames] as t1
WHERE t1.Station In (SELECT TOP 1 t2.Station
FROM [Element Forces - Frames] as t2
WHERE t2.Frame = t1.Frame
ORDER BY t2.Station DESC)
ORDER BY t1.Frame ASC, t1.OutputCase ASC;
I still want to thank everyone who posted answers. I'm sure it's just syntax errors on my part that I was struggling with.

How do I use the MAX function over three tables?

So, I have a problem with a SQL Query.
It's about getting weather data for German cities. I have 4 tables: staedte (the cities with primary key loc_id), gehoert_zu (contains the city-key and the key of the weather station that is closest to this city (stations_id)), wettermessung (contains all the weather information and the station's key value) and wetterstation (contains the stations key and location). And I'm using PostgreSQL
Here is how the tables look like:
wetterstation
s_id[PK] standort lon lat hoehe
----------------------------------------
10224 Bremen 53.05 8.8 4
wettermessung
stations_id[PK] datum[PK] max_temp_2m ......
----------------------------------------------------
10224 2013-3-24 -0.4
staedte
loc_id[PK] name lat lon
-------------------------------
15 Asch 48.4 9.8
gehoert_zu
loc_id[PK] stations_id[PK]
-----------------------------
15 10224
What I'm trying to do is to get the name of the city with the (for example) highest temperature at a specified date (could be a whole month, or a day). Since the weather data is bound to a station, I actually need to get the station's ID and then just choose one of the corresponding to this station cities. A possible question would be: "In which city was it hottest in June ?" and, say, the highest measured temperature was in station number 10224. As a result I want to get the city Asch. What I got so far is this
SELECT name, MAX (max_temp_2m)
FROM wettermessung, staedte, gehoert_zu
WHERE wettermessung.stations_id = gehoert_zu.stations_id
AND gehoert_zu.loc_id = staedte.loc_id
AND wettermessung.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX (max_temp_2m) DESC
LIMIT 1
There are two problems with the results:
1) it's taking waaaay too long. The tables are not that big (cities has about 70k entries), but it needs between 1 and 7 minutes to get things done (depending on the time span)
2) it ALWAYS produces the same city and I'm pretty sure it's not the right one either.
I hope I managed to explain my problem clearly enough and I'd be happy for any kind of help. Thanks in advance ! :D

If you want to get the max temperature per city use this statement:
SELECT * FROM (
SELECT gz.loc_id, MAX(max_temp_2m) as temperature
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY gz.loc_id) as subselect
INNER JOIN staedte as std
ON std.loc_id = subselect.loc_id
ORDER BY subselect.temperature DESC
Use this statement to get the city with the highest temperature (only 1 city):
SELECT * FROM(
SELECT name, MAX(max_temp_2m) as temp
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
INNER JOIN staedte as std
ON gz.loc_id = std.loc_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX(max_temp_2m) DESC
LIMIT 1) as subselect
ORDER BY temp desc
LIMIT 1
For performance reasons always use explicit joins as LEFT, RIGHT, INNER JOIN and avoid to use joins with separated table name, so your sql serevr has not to guess your table references.

This is a general example of how to get the item with the highest, lowest, biggest, smallest, whatever value. You can adjust it to your particular situation.
select fred, barney, wilma
from bedrock join
(select fred, max(dino) maxdino
from bedrock
where whatever
group by fred ) flinstone on bedrock.fred = flinstone.fred
where dino = maxdino
and other conditions

I propose you use a consistent naming convention. Singular terms for tables holding a single item per row is a good convention. You only table breaking this is staedte. Should be stadt.
And I suggest to use station_id consistently instead of either s_id and stations_id.
Building on these premises, for your question:
... get the name of the city with the ... highest temperature at a specified date
SELECT s.name, w.max_temp_2m
FROM (
SELECT station_id, max_temp_2m
FROM wettermessung
WHERE datum >= '2012-8-1'::date
AND datum < '2012-12-1'::date -- exclude upper border
ORDER BY max_temp_2m DESC, station_id -- id as tie breaker
LIMIT 1
) w
JOIN gehoert_zu g USING (station_id) -- assuming normalized names
JOIN stadt s USING (loc_id)
Use explicit JOIN conditions for better readability and maintenance.
Use table aliases to simplify your query.
Use x >= a AND x < b to include the lower border and exclude the upper border, which is the common use case.
Aggregate first and pick your station with the highest temperature, before you join to the other tables to retrieve the city name. Much simpler and faster.
You did not specify what to do when multiple "wettermessungen" tie on max_temp_2m in the given time frame. I added station_id as tiebreaker, meaning the station with the lowest id will be picked consistently if there are multiple qualifying stations.

SQL conundrum, how to select latest date for part, but only 1 row per part (unique)

I am trying to wrap my head around this one this morning.
I am trying to show inventory status for parts (for our products) and this query only becomes complex if I try to return all parts.
Let me lay it out:
single table inventoryReport
I have a distinct list of X parts I wish to display, the result of which must be X # of rows (1 row per part showing latest inventory entry).
table is made up of dated entries of inventory changes (so I only need the LATEST date entry per part).
all data contained in this single table, so no joins necessary.
Currently for 1 single part, it is fairly simple and I can accomplish this by doing the following sql (to give you some idea):
SELECT TOP (1) ldDate, ptProdLine, inPart, inSite, inAbc, ptUm, inQtyOh + inQtyNonet AS in_qty_oh, inQtyAvail, inQtyNonet, ldCustConsignQty, inSuppConsignQty
FROM inventoryReport
WHERE (ldPart = 'ABC123')
ORDER BY ldDate DESC
that gets me my TOP 1 row, so simple per part, however I need to show all X (lets say 30 parts). So I need 30 rows, with that result. Of course the simple solution would be to loop X# of sql calls in my code (but it would be costly) and that would suffice, but for this purpose I would love to work this SQL some more to reduce the x# calls back to the db (if not needed) down to just 1 query.
From what I can see here I need to keep track of the latest date per item somehow while looking for my result set.
I would ultimately do a
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
to limit the parts I need. Hopefully I made my question clear enough. Let me know if you have an idea. I cannot do a DISTINCT as the rows are not the same, the date needs to be the latest, and I need a maximum of X rows.
Thoughts? I'm stuck...

SELECT *
FROM (SELECT i.*,
ROW_NUMBER() OVER(PARTITION BY ldPart ORDER BY ldDate DESC) r
FROM inventoryReport i
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
)
WHERE r = 1

EDIT: Be sure to test the performance of each solution. As pointed out in this question, the CTE method may outperform using ROW_NUMBER.
;with cteMaxDate as (
select ldPart, max(ldDate) as MaxDate
from inventoryReport
group by ldPart
)
SELECT md.MaxDate, ir.ptProdLine, ir.inPart, ir.inSite, ir.inAbc, ir.ptUm, ir.inQtyOh + ir.inQtyNonet AS in_qty_oh, ir.inQtyAvail, ir.inQtyNonet, ir.ldCustConsignQty, ir.inSuppConsignQty
FROM cteMaxDate md
INNER JOIN inventoryReport ir
on md.ldPart = ir.ldPart
and md.MaxDate = ir.ldDate

You need to join into a Sub-query:
SELECT i.ldPart, x.LastDate, i.inAbc
FROM inventoryReport i
INNER JOIN (Select ldPart, Max(ldDate) As LastDate FROM inventoryReport GROUP BY ldPart) x
on i.ldPart = x.ldPart and i.ldDate = x.LastDate

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas