BigQuery Table Data Export - google-bigquery

I am trying to export data from BigQuery Table using python api. Table contains 1 to 4 million of rows. So I have kept maxResults parameter to maximum i.e. 100000 and then paging through. But problem is that in One page I am getting 2652 rows only so number of paging is too much. Can anyone provide reason for this or solution to deal. Format is JSON.
Or can I export data into CSV format without using GCS?
I tried by inserting job and keeping allowLargeResults =true, but the result remain same.
Below is my query body :
queryData = {'query':query,
'maxResults':100000,
'timeoutMs':'130000'}
Thanks in advance.

You can try to export data from table without using GCS by using bq command line tool https://cloud.google.com/bigquery/bq-command-line-tool like this:
bq --format=prettyjson query --n=10000000 "SELECT * from publicdata:samples.shakespeare"
You can use --format=json depending on your needs as well.

Actual page size is determined not by row count, but rather by size of those rows in given page. I think it is something around 10MB
User can alsoset maxResults to limit rows in page in addition to above criteria

Related

Is BQ billing different when using BigQuery pagination and when using LIMIT and OFFSET?

I have a BQ query which returns around 25K records as of now and I want to apply pagination to it to fetch it in chunks of 1000.
Based on this link I can add BQ pagination by setting the maxResults property and use pageToken to get the next set of results.
I know that I add my own logic and use LIMIT and OFFSET but as per the documentation the billing will be for all the rows and not only the number of rows that are fetched
Applying a LIMIT clause to a SELECT * query does not affect the amount
of data read. You are billed for reading all bytes in the entire
table, and the query counts against your free tier quota.
But I can't seem to find any documentation which gives me a clear idea whether using maxResults property will actually reduce the amount of data that I'm billed for.
Can anyone please advice? Thanks in advance.

How can I optimize my varchar(max) column?

I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?
If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.
If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code

How to iterate through BigQuery query results and write it to file

I need to query the Google BigQuery table and export the results to gzipped file.
This is my current code. The requirement is that the each row data should be new line (\n) delemited.
def batch_job_handler(args):
credentials = Credentials.from_service_account_info(service_account_info)
client = Client(project=service_account_info.get("project_id"),
credentials=credentials)
query_job = client.query(QUERY_STRING)
results = query_job.result() # Result's total_rows is 1300000 records
with gzip.open("data/query_result.json.gz", "wb") as file:
data = ""
for res in results:
data += json.dumps(dict(list(res.items()))) + "\n"
break
file.write(bytes(data, encoding="utf-8"))
The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.
Is it because of this line: json.dumps(dict(list(res.items()))) + "\n" as I am constructing a huge string by concatenating each records by new line.
As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.
Check out the new BigQuery Storage API for quick reads:
https://cloud.google.com/bigquery/docs/reference/storage
For an example of the API at work, see this project:
https://github.com/GoogleCloudPlatform/spark-bigquery-connector
It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Filtering
The new API allows column and limited predicate filtering to only read the data you are interested in.
Column Filtering
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
Predicate Filtering
The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal
You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.

Google Bigquery results

I am getting a part of result from the Bigquery API.
Earlier, I solved the issue of 1,00,000 records per result using iterators.
However, now I'm stuck at some other obstacle.
If I take more than 6-7 columns in a result, I do not get the complete set of result.
However, if I take a single column, I get the complete result.
Can there be a size limit as well for results in Bigquery API ?
There are some limits for Query Job
In particular
Maximum response size — 128 MB compressed
Of course, it is unlimited when writing large query results to a destination table (and then reading from there)

Tables truncated when using bigquery api (buffer size issue?)

I am running the following:
bq query --format=csv SELECT GKGRECORDID, DATE,SourceCommonName,DocumentIdentifier, V2Persons, V2Tone, TranslationInfo, from [gdelt-bq:gdeltv2.gkg_partitioned]where V2Persons like "%Orban%" and _PARTITIONTIME >= TIMESTAMP("2016-11-09") and _PARTITIONTIME < TIMESTAMP("2016-11-11")' > outputfile.csv
This should return a table with roughly 1000 rows (which I get when I use the normal bigquery interface in the browser). However, when I run this using the api, it will only return 100.
It seems like an issue with the size of the buffer, but I thought I'd ask if there was something that could be done on the bigquery side (for example, a way to send query output in several chunks) to remedy this.
Thanks!
On the command line you can specify how many rows to be returned, defaults to max 100.
bq query -n 1500
Please be aware that maximum return size is 128MB compressed regardless of rows requested.