What is the max limit of group_concat/string_agg in bigquery output? - google-bigquery

I am using group_concat/string_agg (possibly varchar) and want to ensure that bigquery won't drop any of the data concatenated.

BigQuery will not drop data if a particular query runs out of memory; you will get an error instead. You should try to keep your row sizes below ~100MB, since beyond that you'll start getting errors. You can try creating a large string with an example like this:
#standardSQL
SELECT STRING_AGG(word) AS words FROM `bigquery-public-data.samples.shakespeare`;
There are 164,656 rows in this table, and this query creates a string with 1,168,286 characters (around a megabyte in size). You'll start to see an error if you run a query that requires more than something on the order of hundreds of megabytes on a single node of execution, though:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus)) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));
This results in an error:
Resources exceeded during query execution.
If you click on the "Explanation" tab in the UI, you can see that the failure happened during stage 1 while building the results of STRING_AGG. In this case, the string would have been 3,303,599,000 characters long, or approximately 3.3 GB in size.

Adding to Elliot's answer - how to fix:
This query (Elliot's) fails:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus)) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));
But you can LIMIT the number of strings concatenated to get a working solution:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus) LIMIT 10) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));

Related

How to Generate incredibly large arrays in sql

When trying to generate a large array using the following command
GENERATE_ARRAY(1467331200, 1530403201, 15)
I'm getting the following error:
google.api_core.exceptions.BadRequest: 400 GENERATE_ARRAY(1467331200, 1530403201, 15) produced too many elements
Is there a way to generate an array of said size?
There is a limit on the number of result elements up to 1048575.
Test: bq query --dry_run --nouse_legacy_sq "[replace query below]"
Query: select GENERATE_ARRAY(1, 1048575) as test_array;
Output: Query successfully validated. Assuming the tables are not modified, running this query will process 0 bytes of data.
Query: select GENERATE_ARRAY(1, 1048576) as test_arr;
Output: GENERATE_ARRAY(1, 1048576, 1) produced too many elements
There's no mention of this limit in the documentation so I suggest that you either send a documentation feedback on the page or file a feature request to increase the limit or if possible remove the limit.
Possible workaround is to concat the array.
Example: SELECT ARRAY_CONCAT(GENERATE_ARRAY(1,1048575), GENERATE_ARRAY(1,1048575))...

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

Hive collect list is not working with millions of records

I have a hive query in outer query I am using collect_list. The inner query I have a ordered list of 1.8 million records. When I run the query every time 500-600 records giving wrong result and missing the order in a pattern. I used brickhouse jar also with collect udf. This is also giving same result with 500-600 records differed. I don't have any clue how to debug.
select concat_ws('','',collect_list(host)),
concat_ws('','',collect_list(cast(total_data_volume_host as string))),
concat_ws('','',collect_list(cast(event_duration_host as string))),
concat_ws('','',collect_list(application_name)),
concat_ws('','',collect_list(cast(total_data_volume_app as string))),
concat_ws('','',collect_list(cast(event_duration_app as string))),
Using ORDER BY is a sub-query does not guarantee an ordered array.
We are going to use sort_array for that.
concat_ws works only with array of strings so you are casting your values to string before using collect_list.
The problem now is that the natural order of the elements is changed -
100 > 20 but '20' > '100' (alphabetic order).
The work around is to lpad the values with spaces, sort, concat and then remove the spaces.
with t as (select explode(array(100,20,3)) as i)
select translate(concat_ws(',',sort_array(collect_list(lpad(cast(i as string),10,' ')))),' ','')
from t;
3,20,100

LIMIT in FoxPro

I am attempting to pull ALOT of data from a fox pro database, work with it and insert it into a mysql db. It is too much to do all at once so want to do it in batches of say 10 000 records. What is the equivalent to LIMIT 5, 10 in Fox Pro SQL, would like a select statement like
select name, address from people limit 5, 10;
ie only get 10 results back, starting at the 5th. Have looked around online and they only make mention of top which is obviously not of much use.
Take a look at the RecNo() function.
FoxPro does not have direct support for a LIMIT clause. It does have "TOP nn" but that only provides the "top-most records" within a given percentage, and even that has a limitation of 32k records returned (maximum).
You might be better off dumping the data as a CSV, or if that isn't practical (due to size issues), writing a small FoxPro script that auto-generates a series of BEGIN-INSERT(x10000)-COMMIT statements that dump to a series of text files. Of course, you would need a FoxPro development environment for this, so this may not apply to your situation...
Visual FoxPro does not support LIMIT directly.
I used the following query to get over the limitation:
SELECT TOP 100 * from PEOPLE WHERE RECNO() > 1000 ORDER BY ID;
where 100 is the limit and 1000 is the offset.
It is very easy to get around LIMIT clause using TOP clause ; if you want to extract from record _start to record _finish from a file named _test, you can do :
[VFP]
** assuming _start <= _finish, if not you get a top clause error
*
_finish = MIN(RECCOUNT('_test'),_finish)
*
SELECT * FROM (SELECT TOP (_finish - _start + 1) * FROM (SELECT TOP _finish *, RECNO() AS _tempo FROM _test ORDER BY _tempo) xx ORDER BY _tempo DESC) yy ORDER BY _tempo
**
[/VFP]
I had to convert a Foxpro database to Mysql a few years ago. What I did to solve this was add an auto-incrementing id column to the Foxpro table and use that as the row reference.
So then you could do something like.
select name, address from people where id >= 5 and id <= 10;
The Foxpro sql documentation does not show anything similar to limit.
Here, adapt this to your tables. Took me like 2 mins, i do this waaaay too often.
N1 - group by whatever, and make sure you got a max(id), you can use recno() to make one, sorted correctly
N2 - Joins N1 where the ID = Max Id of N1, display the field you want from N2
Then if you want to join to other tables, put that all in brackets and give it an alias and include it in a join.
Select N1.reference, N1.OrderNoteCount, N2.notes_desc LastNote
FROM
(select reference, count(reference) OrderNoteCount, Max(notes_key) MaxNoteId
from custnote
where reference != ''
Group by reference
) N1
JOIN
(
select reference, count(reference) OrderNoteCount, notes_key, notes_desc
from custnote
where reference != ''
Group by reference, notes_key, notes_desc
) N2 ON N1.MaxNoteId = N2.notes_key
To expand on Eyvind's answer I would create a program to uses the RecNo() function to pull records within a given range, say 10,000 records.
You could then programmatically cycle through the large table in chucks of 10,000 records at a time and preform your data load into you MySQL database.
By using the RecNO() function you can be certain not to insert rows more than once, and be able to restart at a know point in the data load process. That by it's self can be very handy in the event you need to stop and restart the load process.
Depending on the number of the returned rows and if you are using .NET Framework you can offset/limit the gotten DataTable on the following way:
dataTable = dataTable.AsEnumerable().Skip(offset).Take(limit).CopyToDataTable();
Remember to add the Assembly System.Data.DataSetExtensions.