Partition by week/month//quarter/year to get over the partition limit? - google-bigquery

I have 32 years of data that I want to put into a partitioned table. However BigQuery says that I'm going over the limit (4000 partitions).
For a query like:
CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate
AS
SELECT *
FROM `flights.original`
I'm getting an error like:
Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions
How can I get over this limit?

Instead of partitioning by day, you could partition by week/month/year.
In my case each year of data contains around ~3GB of data, so I'll get the most benefits from clustering if I partition by year.
For this, I'll create a year date column, and partition by it:
CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`
Note that I created the extra column DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year in the process.
Table stats:
Since the table is clustered, I'll get the benefits of partitioning even if I don't use the partitioning column (year) as a filter:
SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'
Predicted cost: 83.4 GB
Actual cost: 3.2 GB

Alternative example, I created a NOAA GSOD summary table clustered by station name - and instead of partitioning by day, I didn't partition it at all.
Let's say I want to find the hottest days since 1980 for all stations with a name like SAN FRAN%:
SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
Note that I got the results after processing only 55.2MB of data.
The equivalent query on the source tables (without clustering) processes 4GB instead:
# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC
I also added a geo clustered table, to search by location instead of station name. See details here: https://stackoverflow.com/a/34804655/132438

Related

Materialized view of first rows

Supposing I have a table with columns date | group_id | user_id | text, and I would like to get the first 3 texts (by date) of each group_id/user_id pair.
It seems wasteful to query the whole table say, every 3 hours, as the results are unlikely to change for a given pair once set, so I looked at materialized views, but examples were about single rows, not sets of rows.
Another issue is that the date column does not correspond to the ingestion date, does this mean that I have to add an ingestion date column to be able to use the #run_time in scheduled queries?
Alternatively, would it be more sensible to load the batch on a separate table, compare it with/update the "first/materialized" table, before merging it with the main table? (so instead of having queries on the main table, fill the materialized table one preemptively at every load). This looks hacky/wrong though?
The question links to I want a "materialized view" of the latest records, and mentions that it deals with single rows instead of multiple rows. The question says that it wants the 3 latest rows instead of only one.
For that, look at the inner query in that answer. Instead of doing this:
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour > TIMESTAMP_SUB(#run_time, INTERVAL 1 DAY )
# change to CURRENT_TIMESTAMP() or let scheduled queries do it
AND datehour > '2000-01-01' # nag
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
Do this:
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 3)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour > TIMESTAMP_SUB(#run_time, INTERVAL 1 DAY )
# change to CURRENT_TIMESTAMP() or let scheduled queries do it
AND datehour > '2000-01-01' # nag
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
Re #run_time - you can compare it to any column, just make sure to have a column that makes sense to the logic you want to implement.

Hive groupby slower with partition

I partition my data in hive based on a column value(date). So each date has it's own directory in /warehouse. right now I have about 240 dates, and a total of 70 million records evenly distributed across dates.
I also create another table containing the same data that does not has partitions.
When I query both table with the same queries, the partitioned table does not always out-perform the unpartitioned one. More specifically, partitioned table is slower when executing query with group by.
select count(*) from not_partitioned_table where date > '2018-07-27' and date < '2018-08-27
This took 22.146 seconds, and the count is 7427366.
select count(*) from partitioned_table where date > '2018-07-27' and date < '2018-08-27
This took 22.723 seconds, and also returns 7427366 for count.
However when group by is added, partitioned table perform worse than un-partitioned table.
select count(*) from not_partitioned_table where dated > '2018-07-27' and date < '2018-08-27' group by col_name;
This took 39.733 seconds and 26,724 rows were returned.
select count(*) from partitioned_table where dated > '2018-07-27' and date < '2018-08-27' group by col_name;
This took 76.648 seconds seconds and 26,724 rows were returned.
Why is the partitioned table slower in this scenario?
EDIT
This is how I create my partitioned table:
CREATE TABLE all_ads_from_csv_partitioned3(
id STRING,
...
)
PARTITIONED BY(datedecoded STRING)
STORED AS ORC;
And under 2018-10-08 15:34 /warehouse/tablespace/managed/hive/partitioned_table/, there are 240 directories(240 partitions), each has the format of /warehouse/tablespace/managed/hive/partitioned_table/dated='the partitioned date', and each partition contains roughly 10 buckets.

Optimize Postgres TOP-n query

Table with two columns (transaction_id, user_id), both with index. Approx 10M records in table.
transaction_id is unique
transaction_id count on user_id varies from very few to thousands.
What I need is to find the max(transaction_id), with respect to that the top25 (order by desc) transaction_id's on a given user must be ignored.
Eg a user_id with 21 transaction_id's will not be selected. A user_id with 47 transactions will return transaction 26.
I have tried several ways by using offset, limit etc, but they all seem to be to slow (very high cost).
you have a window functions i.e.
select user_id, nth_value(transaction_id, 26) over (
partition by user_id order by transaction_id
)
from your_table;
should be plenty

SQL Huge Read Only Table Performance Filter and Ordering

I have a table with 1 billion rows that holds possible solutions to a goal setting program.
The combination of each column's value creates a successful goal path. I want to filter records to show the top 10 rows that are ordered by the choice of the user. Someone may want the lowest possible retirement age, then lowest deposit amount. Someone else may want the highest possible survival chance, then highest ending balance, ...
Here are my columns:
age tinyint
retirement_age tinyint
retirement_length tinyint
survival smallint
deposit int
balance_start int
balance_end int
SLOW 10 MIN QUERY:
select top(10) age,retirement_age,retirement_length,survival,deposit,balance_start,balance_end
from TABLE
where
age >= 30
and survival >= 8000 --OUT OF 10000
and balance_start <= 20000
and retirement_age >= 60
and retirement_age <= 75
and retirement_length >= 10
and retirement_length <= 25
and deposit >= 1000
and deposit <= 20000
ORDER BY -- (COLUMN ORDER PREFERENCES UNKNOWN)
retirement_age,
deposit,
retirement_length desc,
balance_end desc,
age desc,
survival desc
That query takes 10 min.
All of the records are generated once, so there is no more writing/updating to the database. I was thinking I should index each column, but have not done so. The database is 30GB right now, but space is not an issue.
I have run the Estimated Execution plan:
select: 0%
parallelism: 0%
sort: 23%
table scan: 77%
Have you tried creating an index like
CREATE INDEX IX_TABLE ON [TABLE]
(age,survival,balance_start,retirement_age,retirement_length,deposit)
INCLUDE (balance_end)
The order of the index fields (age,survival,balance_start,retirement_age,retirement_length,deposit) will make a difference if not all the fields are used in the WHERE clause, so make sure to put them in order of most used.
Also, the order of the included columns does not make any difference.
Seeing as the table values will not change, you can create more than one such index to improve the performance of other queries where it does not use all the fields in the WHERE clause
I ended up creating separate indexes on each of the columns in my where and order clauses with the default order:
CREATE INDEX IX_age ON TABLE (age desc)
CREATE INDEX IX_retirement_age ON TABLE (retirement_age)
CREATE INDEX IX_retirement_length ON TABLE (retirement_length desc)
CREATE INDEX IX_survival ON TABLE (survival desc)
CREATE INDEX IX_deposit ON TABLE (deposit)
CREATE INDEX IX_balance_start ON TABLE (balance_start)
CREATE INDEX IX_balance_end ON TABLE (balance_end desc)

Optimizing SQL Server query / table

I have a database table that receives close to 1 million inserts a day that needs to be searchable for at least a year. Big hard drive and lots of data and not that great hardware to put it on either.
The table looks like this:
id | tag_id | value | time
----------------------------------------
279571 55 0.57 2013-06-18 12:43:22
...
tag_id might be something like AmbientTemperature or AmbientHumidity and the time is captured when the reading is taken from the sensor.
I'm querying on this table in a reporting format. I want to see all data for tags 1,55,72, and 4 between 2013-11-1 and 2013-11-28 at 1 hour intervals.
SELECT time, tag_id, tag_name, value, friendly_name
FROM (
SELECT time, tag_name, tag_id, value,friendly_name,
ROW_NUMBER() over (partition by tag_id,datediff(hour, 0, time)/1 order by time desc) as seqnum
FROM tag_values tv
JOIN tag_names tn ON tn.id = tv.tag_id
WHERE (tag_id = 1 OR tag_id = 55 OR tag_id = 72 OR tag_id = 4)
AND time >= '2013-11-1' AND time < '2013-11-28'
) k
WHERE seqnum = 1
ORDER BY time";
Can I optimize this table or my query at all? How should I set up my indexes?
It's pretty slow with a table size of 100 million + rows. It can take several minutes to get a data set of 7 days at an hourly interval with 3 tags in the query.
Filtering on the result of the row number function will make the query painfully slow. Also it will prevent optimal index use.
If your primary reporting need is hourly information you might want to consider storing which rows are the first sensor reading for a tag in a specific hour.
ALTER TABLE tag_values ADD IsHourlySensorReading BIT NULL;
In an hourly process, you calculate this column for new rows.
DECLARE #CalculateFrom DATETIME = (SELECT MIN(time) FROM tag_values WHERE IsHourlySensorReading IS NULL);
SET #CalculateFrom = dateadd(hour, 0, datediff(hour, 0, #CalculateFrom));
UPDATE k
SET IsHourlySensorReading = CASE seqnum WHEN 1 THEN 1 ELSE 0 END
FROM (
SELECT id, row_number() over (partition by tag_id,datediff(hour, 0, time)/1 order by time desc) as seqnum
FROM tag_values tv
WHERE tv.time >= #CalculateFrom
AND tv.IsHourlySensorReading IS NULL
) as k
Your reporting query then becomes much simpler:
SELECT time, tag_id, tag_name, value, friendly_name
FROM (
SELECT time, tag_name, tag_id, value,friendly_name
FROM tag_values tv
JOIN tag_names tn ON tn.id = tv.tag_id
WHERE (tag_id = 1 OR tag_id = 55 OR tag_id = 72 OR tag_id = 4)
AND time >= '2013-11-1' AND time < '2013-11-28'
AND IsHourlySensorReading=1
) k
ORDER BY time;
The following index will help calculating the IsHourlySensorReading column. But remember, indexes will also cause your million inserts per day to take more time. Test thoroughly!
CREATE NONCLUSTERED INDEX tag_values_ixnc01 ON tag_values (time, IsHourlySensorReading) WHERE (IsHourlySensorReading IS NULL);
Use this index for reporting if you need order by time.
CREATE NONCLUSTERED INDEX tag_values_ixnc02 ON tag_values (time, tag_id, IsHourlySensorReading) INCLUDE (value) WHERE (IsHourlySensorReading = 1);
Use this index for reporting if you don't need order by time.
CREATE NONCLUSTERED INDEX tag_values_ixnc02 ON tag_values (tag_id, time, IsHourlySensorReading) INCLUDE (value) WHERE (IsHourlySensorReading = 1);
Some additional things to consider:
Is ORDER BY time really required?
Table partitioning can seriously improve both insert and query performance. Depending on your situation I would partition on either tag_id or date.
Instead of creating a column with an IsHourlySensorReading indicator, you can also create a separate table/database for specific reporting requirements and only load the relevant data into that.
I'm not an expert on sqlserver, but I would seriously consider setting this up as a partitioned table. This would also make archiving easier as partitions could simply be dropped (rather than an expensive delete from where...).
Also (with a bit of luck) the optimiser will only look in the partitions required for the data.