How to iterate through BigQuery query results and write it to file - google-bigquery

I need to query the Google BigQuery table and export the results to gzipped file.
This is my current code. The requirement is that the each row data should be new line (\n) delemited.
def batch_job_handler(args):
credentials = Credentials.from_service_account_info(service_account_info)
client = Client(project=service_account_info.get("project_id"),
credentials=credentials)
query_job = client.query(QUERY_STRING)
results = query_job.result() # Result's total_rows is 1300000 records
with gzip.open("data/query_result.json.gz", "wb") as file:
data = ""
for res in results:
data += json.dumps(dict(list(res.items()))) + "\n"
break
file.write(bytes(data, encoding="utf-8"))
The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.
Is it because of this line: json.dumps(dict(list(res.items()))) + "\n" as I am constructing a huge string by concatenating each records by new line.
As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.

Check out the new BigQuery Storage API for quick reads:
https://cloud.google.com/bigquery/docs/reference/storage
For an example of the API at work, see this project:
https://github.com/GoogleCloudPlatform/spark-bigquery-connector
It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Filtering
The new API allows column and limited predicate filtering to only read the data you are interested in.
Column Filtering
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
Predicate Filtering
The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal

You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.

Related

Architectural design clarrification

I built an API in nodejs+express that allows reactjs clients to upload CSV files(maximum size is atmost 1GB) to the server.
I also wrote another API which when given the filename and row numbers in an array (ie array of row numbers ) as input, it selects the rows corresponding to the row numbers, from the previously stored files and writes it to another result file (writeStream).
Then th resultant file is piped back to the client(all via streaming).
Currently as you see I am using files(basically nodejs' read and write streams) to asynchronously manage this.
But I have faced srious latency (only 2 cores are used) and some memory leak (900mb consumption) when I have 15 requests, each supplying about 600 rows to retrieve from files of size approximately 150mb.
I also have planned an alternate design.
Basically, I will store the entire file as a SQL Table with row numbers as primary indexed key.
I will convert the user inputted array of row numbrs to a another table using sql unnest and then join both these tables to get the rows needed.
Then I will supply back the resultant table as a csv file to the client.
Would this architecture be better than the previous architecture?
Any suggestions from devs is highly appreciated.
Thanks.
Use the client to do all the heavy lifting by using the XLSX package for any manipulation of content. Then have API to save information about the transaction. This will remove upload to server and download from the server and help you provide better experience.

How do I use awswrangler to read only the first few N rows of a parquet file stored in S3?

I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidth).
I cannot see how to do it, or whether it is even possible without relocating.
Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?
I have come across this incomplete solution (last N rows ;) ) using pyarrow - Read last N rows of S3 parquet table - but a time-based filter would not be ideal for me and the accepted solution doesn't even get to the end of the story (helpful as it is).
Or is there another way without first downloading the file (which I could probably have done by now)?
Thanks!
You can do that with awswrangler using S3 Select. For example:
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
would return 5 rows only from the S3 object.
This is not possible with other read methods because the entire object must be pulled locally before reading it. With S3 select, the filtering is done on the server side instead

Add dataset parameters into column to use them in BigQuery later with DataPrep

I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:
//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv
We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.
However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.
How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?
While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.
In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.
There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.
NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here:
https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
There is currently no access to data source location or parameter match values within the flow. Only the data in the dataset is available to you. (except SOURCEROWNUMBER())
Partial Solution
One method I have been using to mimic parameter insertion into the eventual table is to have multiple dataset imports by parameter and then union these before running your transformations into a final table.
For each known parameter search dataset, have a recipe that fills a column with that parameter per dataset and then union the results of each of these.
Obviously, this is only so scalable i.e. it works if you know the set of parameter values that will match. once you get to the granularity of time-stamp in the source file there is no way this is feasible.
In this example just the year value is the filtered parameter.
Longer Solution (An aside)
The alternative to this I eventually skated to was to define dataflow jobs using Dataprep, use these as dataflow templates and then run an orchestration function that ran the dataflow job (not dataprep) and amended the parameters for input AND output via the API. Then there was a transformation BigQuery Job that did the roundup append function.
Worth it if the flow is pretty settled, but not for adhoc; all depends on your scale.

How to set the number of partitions/nodes when importing data into Spark

Problem: I want to import data into Spark EMR from S3 using:
data = sqlContext.read.json("s3n://.....")
Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the data:
data.registerTempTable("table")
SqlData = sqlContext.sql("SELECT * FROM table")
Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and sc.parallelize I can pass the number of partitions as an input. Also, I have seen repartition(), but I am not sure if it can solve my problem. The variable data is a DataFrame in my example.
Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value
By default it partitions into 200 sets. You can change it by using set command in sql context sqlContext.sql("set spark.sql.shuffle.partitions=10");. However you need to set it with caution based up on your data characteristics.
You can call repartition() on dataframe for setting partitions. You can even set spark.sql.shuffle.partitions this property after creating hive context or by passing to spark-submit jar:
spark-submit .... --conf spark.sql.shuffle.partitions=100
or
dataframe.repartition(100)
Number of "input" partitions are fixed by the File System configuration.
1 file of 1Go, with a block size of 128M will give you 10 tasks. I am not sure you can change it.
repartition can be very bad, if you have lot of input partitions this will make lot of shuffle (data traffic) between partitions.
There is no magic method, you have to try, and use the webUI to see how many tasks are generated.

BigQuery Table Data Export

I am trying to export data from BigQuery Table using python api. Table contains 1 to 4 million of rows. So I have kept maxResults parameter to maximum i.e. 100000 and then paging through. But problem is that in One page I am getting 2652 rows only so number of paging is too much. Can anyone provide reason for this or solution to deal. Format is JSON.
Or can I export data into CSV format without using GCS?
I tried by inserting job and keeping allowLargeResults =true, but the result remain same.
Below is my query body :
queryData = {'query':query,
'maxResults':100000,
'timeoutMs':'130000'}
Thanks in advance.
You can try to export data from table without using GCS by using bq command line tool https://cloud.google.com/bigquery/bq-command-line-tool like this:
bq --format=prettyjson query --n=10000000 "SELECT * from publicdata:samples.shakespeare"
You can use --format=json depending on your needs as well.
Actual page size is determined not by row count, but rather by size of those rows in given page. I think it is something around 10MB
User can alsoset maxResults to limit rows in page in addition to above criteria