BigQuery full table to partition - google-bigquery

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.
That means I will have 270 partitions. What is the best way to move this data to partition table.
I dont want to run 270 queries which is very costly operation. So looking for optimized solution.
I have multiple tables like this. I need to migrate all these tables to partition tables.
Thanks,

I see three options
Direct Extraction out of original table:
Actions (how many queries to run) = Days [to extract] = 270
Full Scans (how much data scanned measured in full scans of original table) = Days = 270
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00
Hierarchical(recursive) Extraction: (described in Mosha’s answer)
Actions = 2^log2(Days) – 2 = 510
Full Scans = 2*log2(Days) = 18
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60
Clustered Extraction: (I will describe it in a sec)
Actions = Days + 1 = 271
Full Scans = [always]2 = 2
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40
Summary
Method Actions Total Full Scans Total Cost
Direct Extraction 270 270 $459.00
Hierarchical(recursive) Extraction 510 18 $30.60
Clustered Extraction 271 2 $3.40
Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward
Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible!
From $460 down to $31!
Almost 15 times down!
In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges
- take a look at third option
“Clustered Extraction” Explanation
Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan
Step 2
After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created)
This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table
While querying column – we will need to de-serialize row’s value and parse it back to original scheme
Very simplified example: (using BigQuery Standard SQL here)
The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)
So, assume original table (theTable) looks somehow like below
SELECT 1 AS id, "101" AS x, 1 AS ts UNION ALL
SELECT 2 AS id, "102" AS x, 1 AS ts UNION ALL
SELECT 3 AS id, "103" AS x, 1 AS ts UNION ALL
SELECT 4 AS id, "104" AS x, 1 AS ts UNION ALL
SELECT 5 AS id, "105" AS x, 1 AS ts UNION ALL
SELECT 6 AS id, "106" AS x, 2 AS ts UNION ALL
SELECT 7 AS id, "107" AS x, 2 AS ts UNION ALL
SELECT 8 AS id, "108" AS x, 2 AS ts UNION ALL
SELECT 9 AS id, "109" AS x, 2 AS ts UNION ALL
SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
SELECT 20 AS id, "120" AS x, 4 AS ts
Step 1 – transform table and write result into tempTable
SELECT
num,
MAX(IF(ts=1, ser, NULL)) AS ts_1,
MAX(IF(ts=2, ser, NULL)) AS ts_2,
MAX(IF(ts=3, ser, NULL)) AS ts_3,
MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
SELECT
ts,
CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser,
ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
FROM theTable
)
GROUP BY num
tempTable will look like below:
num ts_1 ts_2 ts_3 ts_4
1 1|101|1 6|106|2 10|110|3 19|119|4
2 2|102|1 7|107|2 11|111|3 20|120|4
3 3|103|1 8|108|2 12|112|3 null
4 4|104|1 9|109|2 13|113|3 null
5 5|105|1 null 14|114|3 null
6 null null 15|115|3 null
7 null null 16|116|3 null
8 null null 17|117|3 null
9 null null 18|118|3 null
Here, I am using simple concatenation for serialization
Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2
SELECT
r[OFFSET(0)] AS id,
r[OFFSET(1)] AS x,
r[OFFSET(2)] AS ts
FROM (
SELECT SPLIT(ts_2, "|") AS r
FROM tempTable
WHERE NOT ts_2 IS NULL
)
The result will look like below (which is expected):
id x ts
6 106 2
7 107 2
8 108 2
9 109 2
I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving

I will add another fourth option to #Mikhail's answer
DML QUERY
Action = 1 query to run
Full scans = 1
Cost = $5 x 0.34 = 1.7$ (x270 times cheaper than solution #1 \o/)
With the new DML feature of BiQuery you can convert a none partitioned table to a partitioned one while doing only one full scan of the source table
To illustrate my solution I will use one of BQ's public tables, namely bigquery-public-data:hacker_news.comments. below is the tables schema
name | type | description
_________________________________
id | INTGER | ...
_________________________________
by | STRING | ...
_________________________________
author | STRING | ...
_________________________________
... | |
_________________________________
time_ts | TIMESTAMP | human readable timestamp in UTC YYYY-MM-DD hh:mm:ss /!\ /!\ /!\
_________________________________
... | |
_________________________________
We are going to partition the comments table based on time_ts
#standardSQL
CREATE TABLE my_dataset.comments_partitioned
PARTITION BY DATE(time_ts)
AS
SELECT *
FROM `bigquery-public-data:hacker_news.comments`
I hope it helps :)

If your data was in sharded tables (i.e. with YYYYmmdd suffix), you could've used "bq partition" command. But with data in a single table - you will have to scan it multiple times applying different WHERE clauses on your partition key column.
The only optimization I can think of is to do it hierarchically, i.e. instead of 270 queries which will do 270 full table scans - first split table in half, then each half in half etc. This way you will need to pay for 2*log_2(270) = 2*9 = 18 full scans.
Once the conversion is done - all the temporary tables can be deleted to eliminate extra storage costs.

Related

Group data series into variable width windows based on first event

I have computational task which can be reduced to the follow problem:
I have a large set of pairs of integers (key, val) which I want to group into windows. The first window starts with the first pair p ordered by key attribute and spans all the pairs where p[i].key belongs to [p[0].key; p[0].key + N), with some arbitrary integer N, positive and common to all windows.
The next window starts with the first pair ordered by key not included in the previous windows and again spans all the pairs from its key to key + N, and so on for the following windows.
The last step is to sum second attribute for each window and display it together with the first key of the window.
For example, given list of records with values:
key
val
1
3
2
7
5
1
6
4
7
1
10
3
13
5
and N=3, the windows would be:
{(1,3),(2,7)},
{(5,1),(6,4),(7,1)},
{(10,3)}
{(13,5)}
The final result:
key
sum_of_values
1
10
5
6
10
3
13
5
This is easy to program with a standard programming language but I have no clue how to solve this with SQL.
Note: If clickhouse doesn't support the RECURSIVE keyword, just remove that keyword from the expression.
Clickhouse seems to use non-standard syntax for the WITH clause. The below uses standard SQL. Adjust as needed.
Sorry. clickhouse may not support this approach. If not, we would need to find another method of walking through the data.
Standard SQL:
There are a few ways. Here's one approach. First assign row numbers to allow recursively stepping through the rows. We could use LEAD as well.
Assign a group (key value) to each row based on the current key and the last group/key value and whether they are within some distance (N = 3, in this case).
The last step is to just SUM these values per group start_key and to use the start_key value as the starting key in each group.
WITH RECURSIVE nrows (xkey, val, n) AS (
SELECT xkey, val, ROW_NUMBER() OVER (ORDER BY xkey) FROM test
)
, cte (xkey, val, n, start_key) AS (
SELECT xkey, val, n, xkey FROM nrows WHERE n = 1
UNION ALL
SELECT t1.xkey, t1.val, t1.n
, CASE WHEN t1.xkey <= t2.start_key + (3-1) THEN t2.start_key ELSE t1.xkey END
FROM nrows AS t1
JOIN cte AS t2
ON t2.n = t1.n-1
)
SELECT start_key
, SUM(val) AS sum_values
FROM cte
GROUP BY start_key
ORDER BY start_key
;
Result:
+-----------+------------+
| start_key | sum_values |
+-----------+------------+
| 1 | 10 |
| 5 | 6 |
| 10 | 3 |
| 13 | 5 |
+-----------+------------+

Seperating a Oracle Query with 1.8 million rows into 40,000 row blocks

I have a project where I am taking Documents from one system and importing them into another.
The first system has the documents and associated keywords stored. I have a query that will return the results which will then be used as the index file to import them into the new system. There are about 1.8 million documents involved so this means 1.8 million rows (One per document).
I need to divide the returned results into blocks of 40,000 to make importing them in batches of 40,000 at a time, rather than one long import.
I have the query to return the results I need. Just need to know how to take that and break it up for easier import. My apologies if I have included to little information. This is my first time here asking for help.
Use the built-in function ORA_HASH to divide the rows into 45 buckets of roughly the same number of rows. For example:
select * from some_table where ora_hash(id, 44) = 0;
select * from some_table where ora_hash(id, 44) = 1;
...
select * from some_table where ora_hash(id, 44) = 44;
The function is deterministic and will always return the same result for the same input. The resulting number starts with 0 - which is normal for a hash, but unusual for Oracle, so the query may look off-by-one at first. The hash works better with more distinct values, so pass in the primary key or another unique value if possible. Don't use a low-cardinality column, like a status column, or the buckets will be lopsided.
This process is in some ways inefficient, since you're re-reading the same table 45 times. But since you're dealing with documents, I assume the table scanning won't be the bottleneck here.
A prefered way to bucketing the ID is to use the NTILE analytic function.
I'll demonstrate this on a simplified example with a table with 18 rows that should be divided in four chunks.
select listagg(id,',') within group (order by id) from tab;
1,2,3,7,8,9,10,15,16,17,18,19,20,21,23,24,25,26
Note, that the IDs are not consecutive, so no arithmetic can be used - the NTILE gets the parameter of the requested number of buckets (4) and calculates the chunk_id
select id,
ntile(4) over (order by ID) as chunk_id
from tab
order by id;
ID CHUNK_ID
---------- ----------
1 1
2 1
3 1
7 1
8 1
9 2
10 2
15 2
16 2
17 2
18 3
19 3
20 3
21 3
23 4
24 4
25 4
26 4
18 rows selected.
All but the last bucket are of the same size, the last one can be smaller.
If you want to calculate the ranges - use simple aggregation
with chunk as (
select id,
ntile(4) over (order by ID) as chunk_id
from tab)
select chunk_id, min(id) ID_from, max(id) id_to
from chunk
group by chunk_id
order by 1;
CHUNK_ID ID_FROM ID_TO
---------- ---------- ----------
1 1 8
2 9 17
3 18 21
4 23 26

Create SQL query with dynamic WHERE statement

I'm using Postgresql as my database, in case that's helpful, although I'd like to find a pure SQL approach instead of a Postgresql specific implementation.
I have a large set of test data obtained from manufacturing a piece of electronics and I'd like to take that set of data and extract from it which units met certain criteria during test, ideally using a separate table that contains the test criteria from each step of manufacturing.
As a simple example, let's say I check the temperature readback from the unit in two different steps of the test. In step 1, the temperature should be in the range of 20C-30C while step 2 should be in the range of 50C-60C.
Let's assume the following table structure with a set of example data (table name 'test_data'):
temperature step serial_number
25 1 1
55 2 1
19 1 2
20 2 2
and let's assume the following table that contains the above mentioned pass criteria (table name 'criteria'):
temperature_upper temperature_lower step
20 30 1
50 60 2
At the moment, using a static approach, I can just use the following query:
SELECT * FROM test_data WHERE
( test_data.step = 1 AND test_data.temperature > 20 AND test_data.temperature < 30 ) OR
( test_data.step = 2 AND test_data.temperature > 50 AND test_data.temperature < 60 );
which would effectively yield the following table:
temperature step serial_number
25 1 1
55 2 1
I'd like to make my select query more dynamic and instead of begin statically defined, make it construct itself off of a list of results from the test_criteria table. The hope is to grow this into a complex query where temperature, voltage and current might be checked in step 1 but only current in step 2, for example.
Thanks for any insight!
You can solve using a join between the tables
SELECT t.*
FROM test_data t
INNER JOIN criteria c ON t.step = c.step
AND t.temperature > c.temperature_upper
AND t.temperature < c.temperature_lower
OR if you want >= and <=
SELECT t.*
FROM test_data t
INNER JOIN criteria c ON t.step = c.step
AND t.temperature netween c.temperature_upper AND c.temperature_lower

PostgreSQL efficiently find last decendant in linear list

I currently try to retrieve the last decendet efficiently from a linked list like structure.
Essentially there's a table with a data series, with certain criteria I split it up to get a list like this
current_id | next_id
for example
1 | 2
2 | 3
3 | 4
4 | NULL
42 | 43
43 | 45
45 | NULL
etc...
would result in lists like
1 -> 2 -> 3 -> 4
and
42 -> 43 -> 45
Now I want to get the first and the last id from each of those lists.
This is what I have right now:
WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
SELECT start_ts, end_ts FROM track_caps tc
UNION
SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
FROM (
SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
GROUP BY rstart_ts
) sq
GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts
In this case I just used timestamps which work fine for the given data.
Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4
Just for illustration, the produced result set by the recursive query looks like this:
1 | 2
2 | 3
3 | 4
1 | 3
2 | 4
1 | 4
As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE.
For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.
Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?)
Or are there better single-query solutions to this problem?
The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.
To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.
To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.
So you wind up with something like this:
with recursive contract as (
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
union
select c.current_id, c.next_id, p.root_id
from track_caps c
join contract p on c.current_id = p.next_id
and c.next_id is not null
)
select *
from contract
order by current_id;
Online example: http://rextester.com/DOABC98823

Aggregation over order-dependent partition?

I have a source data set like this (simplified to be more clear):
Key F1 F2
1 X 4
2 X 5
3 Y 6
4 X 9
5 X 7
6 X 8
7 Y 9
8 X 6
9 X 5
10 Y 3
The data is sorted by the Key field. Now, I want to compute an aggregate of the F2 field over partitions that are defined by the F1 field: A partition starts at the first X value and ends with the first subsequent Y value.
So, for example, I might want wo compute the MIN() over the partitions defined as described above. Then the result set would look like this:
rownum MIN(F2)
1 4
2 7
3 3
I have tried a number of resources (incl. our own intranet community and of course stackoverflow) but found nothing for my case. Usually partitioning only works with a field that can be used to identify the partitions. Here, the partitions are defined by a change in a field's content with respect to a given order.
Although I am aware that I may have to resort to writing a procedural solution I would prefer to solve this in pure SQL.
Any ideas how such a partitioning could be achieved with a SQL select statement?
Thanks and regards
Kai.
A little bit shorter solution: http://sqlfiddle.com/#!12/7390d/24
Query:
select min(f2)
from t t1
group by (select max(key)
from t t2
where t2.f1='Y' and
t1.key > t2.key)
Result:
| MIN |
-------
| 4 |
| 7 |
| 3 |
The idea is to find the key of preceding 'Y' for each row and group by it. Should work with any SQL engine.
You didn't specify engine or dialect or version so I assumed SQL Server 2012.
Example that you can run to see the solution: http://sqlfiddle.com/#!6/f5d38/21
You solve it by creating correct partitions in your set. Code looks like this.
WITH groupLimits as
(
SELECT
[Key] AS groupend
,COALESCE(LAG([Key]) OVER (order by [Key]),0)+1 AS groupstart
FROM sourceData
WHERE F1 = 'Y'
)
SELECT
MIN(sourceData.F2)
FROM groupLimits
INNER JOIN sourceData
ON sourceData.[Key] BETWEEN groupLimits.groupstart and groupLimits.groupend
GROUP BY groupLimits.groupstart
ORDER BY groupLimits.groupstart