Supposing I have a table with columns date | group_id | user_id | text, and I would like to get the first 3 texts (by date) of each group_id/user_id pair.
It seems wasteful to query the whole table say, every 3 hours, as the results are unlikely to change for a given pair once set, so I looked at materialized views, but examples were about single rows, not sets of rows.
Another issue is that the date column does not correspond to the ingestion date, does this mean that I have to add an ingestion date column to be able to use the #run_time in scheduled queries?
Alternatively, would it be more sensible to load the batch on a separate table, compare it with/update the "first/materialized" table, before merging it with the main table? (so instead of having queries on the main table, fill the materialized table one preemptively at every load). This looks hacky/wrong though?
The question links to I want a "materialized view" of the latest records, and mentions that it deals with single rows instead of multiple rows. The question says that it wants the 3 latest rows instead of only one.
For that, look at the inner query in that answer. Instead of doing this:
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour > TIMESTAMP_SUB(#run_time, INTERVAL 1 DAY )
# change to CURRENT_TIMESTAMP() or let scheduled queries do it
AND datehour > '2000-01-01' # nag
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
Do this:
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 3)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour > TIMESTAMP_SUB(#run_time, INTERVAL 1 DAY )
# change to CURRENT_TIMESTAMP() or let scheduled queries do it
AND datehour > '2000-01-01' # nag
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
Re #run_time - you can compare it to any column, just make sure to have a column that makes sense to the logic you want to implement.
Related
I have 32 years of data that I want to put into a partitioned table. However BigQuery says that I'm going over the limit (4000 partitions).
For a query like:
CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate
AS
SELECT *
FROM `flights.original`
I'm getting an error like:
Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions
How can I get over this limit?
Instead of partitioning by day, you could partition by week/month/year.
In my case each year of data contains around ~3GB of data, so I'll get the most benefits from clustering if I partition by year.
For this, I'll create a year date column, and partition by it:
CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`
Note that I created the extra column DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year in the process.
Table stats:
Since the table is clustered, I'll get the benefits of partitioning even if I don't use the partitioning column (year) as a filter:
SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'
Predicted cost: 83.4 GB
Actual cost: 3.2 GB
Alternative example, I created a NOAA GSOD summary table clustered by station name - and instead of partitioning by day, I didn't partition it at all.
Let's say I want to find the hottest days since 1980 for all stations with a name like SAN FRAN%:
SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
Note that I got the results after processing only 55.2MB of data.
The equivalent query on the source tables (without clustering) processes 4GB instead:
# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC
I also added a geo clustered table, to search by location instead of station name. See details here: https://stackoverflow.com/a/34804655/132438
How can we write the query for getting a set of rows where it was last updated grouped by a field name . Example
no - siteid - status - created_at
the 'no' number increases continuously as its the status report provided from another source. As in , Insertion happens 24/7 at some interval. But i want to check the status of 10 sites at the second i run the query .
There are only 10 sites . and i want to generate the report of the status of these 10
as in the last inserted query will give us the current status of the site.
I tried this
SELECT created_at,siteid FROM [TOC].[dbo].[frame1] GROUP BY siteid ORDER BY created_at DESC
But no luck
I'm not actually sure what you want, can you give example input and output data?
For now here is something that might be useful to you...
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY site_id ORDER BY created_at DESC) AS sequence_id,
*
FROM
[TOC].[dbo].[Frame1]
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
In SQL Server 2008,
I have a table for tracking the status history of actions (STATUS_HISTORY) that has three columns ([ACTION_ID],[STATUS],[STATUS_DATE]).
Each ACTION_ID can have a variable number of statuses and status dates.
I need to convert these rows into columns that preferably look something like this:
[ACTION_ID], [STATUS_1], [STATUS_2], [STATUS_3], [DATE_1], [DATE_2], [DATE_3]
Where the total number of status columns and date columns is unknown, and - of course - DATE_1 correlates to STATUS_1, etc. And I'd like for the status to be in chronological order (STATUS_1 has the earliest date, etc.)
My reason for doing this is so I can put the 10 most recent Statuses on a report in an Access ADP, along with other information for each action. Using a subreport with each status in a new row would cause the report to be far too large.
Is there a way to do this using PIVOT? I don't want to use the date or the status as a column heading.
Is it possible at all?
I have no idea where to even begin. It's making my head hurt.
Let us suppose for brevity that you only want 3 most recent statuses for each action_id (like in your example).
Then this query using CTE should do the job:
WITH rownrs AS
(
SELECT
action_id
,status
,status_date
,ROW_NUMBER() OVER (PARTITION BY action_id ORDER BY status_date DESC) AS rownr
FROM
status_history
)
SELECT
s1.action_id AS action_id
,s1.status AS status_1
,s2.status AS status_2
,s3.status AS status_3
,s1.status_date AS date_1
,s2.status_date AS date_2
,s3.status_date AS date_3
FROM
(SELECT * FROM rownrs WHERE rownr=1) AS s1
LEFT JOIN
(SELECT * FROM rownrs WHERE rownr=2) AS s2
ON s1.action_id = s2.action_id
LEFT JOIN
(SELECT * FROM rownrs WHERE rownr=3) AS s3
ON s1.action_id = s3.action_id
NULL values will appear in the rows where the action_id has less then 3 status-es.
I haven't had to do it with two columns, but a PIVOT sounds like what you should try. I've done this in the past with dates in a result set where I needed the date in each row be turned into the columns at the top.
http://msdn.microsoft.com/en-us/library/ms177410.aspx
I sympathize with the headache from trying to design and visualize it, but the best thing to do is try getting it working with one of the columns and then go from there. It helps once you start playing with it.
I want to group records in hour or day.
The table A looks like:
The table A has two columns: ID int, record_time datetime,
For example, two records looks like:
id record_time
-----------------------
1 2011-01-24 22:14:50
2 2011-01-24 22:14:50
I want to group by hour. I use command:
select *
from A
group by Hour(record_time);
However, it does not output as I want.
It only outputs the first record. The second record does not show.
What you call grouping sounds like it's actually sorting. Change group by to order by and see if that gets you what you want. If by "group" you actually mean "I want to group the rows together in the result set, this is what you need (and is called ordering).
SELECT *
FROM A
GROUP BY DATE_FORMAT(record_time, '%H')
UPDATE
SELECT *
FROM A
ORDER BY DATE_FORMAT(record_time, '%H%Y%m%d')
Try this... just ORDER BY (no grouping)
you should use date_format, like
group by date_format(record_time, '%Y%m%d%H');
I'm grouping some records by their proximity of time. What I do right now (timestamps in unixtime),
First off I do a sub select to grab records that are of interest of me,
(SELECT timestamp AS target_time FROM table WHERE something = cool) AS subselect
Then I want to look at the records that are close in time to those,
SELECT id FROM table, subselect WHERE ABS(target_time - timestamp) < 1800
But here is where I hit my problem. I want to only want the records where the time diffrance between the records around the target_time is > 20 mins. So to do this, I group by the target_time and add a HAVING section.
SELECT id FROM table, first WHERE ABS(target_time - timestamp) < 3600
GROUP BY target_time HAVING MAX(timestamp) - MIN(timestamp) > 1200
This is great, and all the records I don't like are gone, but now I only have the first id of the group, when I really want all of the ids. I can use GROUP_CONCAT but that gives me a be mess I can't do anymore queries on. What I really would like it to get all of the ids returned from all of these groups that are created. Do I need another SELECT statement? Or is there just a better way to structure what I got?
Thank you,
A SQL nub.
See if I have your problem correct:
For a given row in a table, you want to know the set of rows for similar records if the range of timestamps for those records is greater than 20 minutes. You want to to this for all ids in the table.
If you simply want a list of ids which fulfil this criteria, it is fairly straightforward:
given a table like:
create table foo (id bigint(4), section VARCHAR(2), modification datetime);
you can do:
select id, foo.section, min_max.min_modification, min_max.max_modification, abs(min_max.min_modification - min_max.max_modification) as diff
from foo,
(select section, max(modification) max_modification, min(modification) min_modification from foo as inner_foo group by section) as min_max
where foo.section = min_max.section
and abs(min_max.min_modification - min_max.max_modification) > 1800;
You're doing a subselect based on the 'similar rows' criteria (in this case the column section) to get the minimum and maximum timestamps for that section. This min and max applies to all ids in that section. Hence, for section 'A', you will have a list of ids, same for section 'B'.
My assumption is you want an output that looks like:
id1, timestamp1, fieldA, fieldB
id1, timestamp2, fieldA, fieldB
id2, timestamp3, fieldA, fieldB
id2, timestamp4, fieldA, fieldB
id3, timestamp5, fieldA, fieldB
id3, timestamp6, fieldA, fieldB
but the timestamp for these records is BETWEEN 1200 and 1800 seconds of a "target_time" where something = cool?
SELECT data.id, data.timestamp, data.fieldA, data.fieldB, ..., data.fieldX
FROM events
JOIN data
WHERE events.something = cool_event -- Gives the 'target_time' of cool_event
AND ABS(event.timestamp - data.timestamp) BETWEEN 1200 and 1800 -- gives data records 'near' target time, but at least 20 minutes away.
IF the 'data' and 'events' table are the SAME table, then just use table alias names, but you can join a table to itself, aka 'SELF-JOIN'.
SELECT data.id, data.timestamp, data.fieldA, data.fieldB, ..., data.fieldX
FROM events AS target, events AS data
WHERE target.something = cool_event -- gives the 'target_time' of cool_event
AND ABS(target.timestamp - data.timestamp) BETWEEN 1200 and 1800 -- gives data records 'near' target time, but at least 20 minutes away.
This sounds about right, and is without any group-by or aggregates needed.
You can order the resulting data if necessary.
-- J Jorgenson --