READ BQUERY STREAMING (real time) API - google-bigquery

I have BigQuery data warehouse which gets its data from Google Analytics.
the data is streamd - real time.
now I want to get this data as it arrives (and not after) to the bigquery using its API.
I have seen the api which lets you query the data after it saved into the bigquery,
for example:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
query_job = client.query(query) # Make an API request.
print("The query data:")
for row in query_job:
# Row values can be accessed by field name or index.
print("name={}, count={}".format(row[0], row["total_people"]))
Is there any way to "listen" to the data and store some of it on cloud?
rather than let it be saved and then query from the bigquery?
Thanks

There is not currently a streaming read mechanism for accessing managed data in BigQuery; existing mechanisms leverage some form of snapshot-like consistency at a given point in time (tabledata.list, storage API read, etc).
Given that your data is already automatically delivered into BigQuery, the next best thing is likely some kind of delta strategy where you read periodically with some kind of filter (recent data filtered by a timestamp, etc).

Related

How to iterate through BigQuery query results and write it to file

I need to query the Google BigQuery table and export the results to gzipped file.
This is my current code. The requirement is that the each row data should be new line (\n) delemited.
def batch_job_handler(args):
credentials = Credentials.from_service_account_info(service_account_info)
client = Client(project=service_account_info.get("project_id"),
credentials=credentials)
query_job = client.query(QUERY_STRING)
results = query_job.result() # Result's total_rows is 1300000 records
with gzip.open("data/query_result.json.gz", "wb") as file:
data = ""
for res in results:
data += json.dumps(dict(list(res.items()))) + "\n"
break
file.write(bytes(data, encoding="utf-8"))
The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.
Is it because of this line: json.dumps(dict(list(res.items()))) + "\n" as I am constructing a huge string by concatenating each records by new line.
As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.
Check out the new BigQuery Storage API for quick reads:
https://cloud.google.com/bigquery/docs/reference/storage
For an example of the API at work, see this project:
https://github.com/GoogleCloudPlatform/spark-bigquery-connector
It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Filtering
The new API allows column and limited predicate filtering to only read the data you are interested in.
Column Filtering
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
Predicate Filtering
The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal
You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.

Export Bigquery Logs

I want to analyze the activity on BigQuery during the past month.
I went to the cloud console and the (very inconvenient) log viewer. I set up exports to Big-query, and now I can run queries on the logs and analyze the activity. There is even very convenient guide here: https://cloud.google.com/bigquery/audit-logs.
However, all this helps to look at data collected from now on. I need to analyze past month.
Is there a way to export existing logs (rather than new) to Bigquery (or to flat file and later load them to BQ)?
Thanks
While you cannot "backstream" the BigQuery's logs of the past, there is something you can still do, depending on what kind of information you're looking for. If you need information about query jobs (jobs stats, config etc), you can call Jobs: list method of BigQuery API to list all jobs in your project. The data is preserved there for 6 months and if you're project owner, you can list the jobs of all users, regardless who actually ran it.
If you don't want to code anything, you can even use API Explorer to call the method and save the output as json file and then load it back into BigQuery's table.
Sample code to list jobs with BigQuery API. It requires some modification but it should be fairly easy to get it done.
You can use Jobs: list API to collect job info and upload it to GBQ
Since it is in GBQ - you can analyze it any way you want using power of BigQuery
You can either flatten result or use original - i recommend using original as it is less headache as no any transformation before loading to GBQ (you just literally upload whatever you got from API). Of course all this in simple app/script that you still have to write
Note: make sure you use full value for projection parameter
I was facing the same problem when I found a article which describes how to inspect Big Query using INFORMATION_SCHEMA without any script nor Jobs: list as mentioned by other OPs.
I was able to run and got this working.
# Monitor Query costs in BigQuery; standard-sql; 2020-06-21
# #see http://www.pascallandau.com/bigquery-snippets/monitor-query-costs/
DECLARE timezone STRING DEFAULT "Europe/Berlin";
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
DATE(creation_time, timezone) creation_date,
FORMAT_TIMESTAMP("%F %H:%I:%S", creation_time, timezone) as query_time,
job_id,
ROUND(total_bytes_processed / gb_divisor,2) as bytes_processed_in_gb,
IF(cache_hit != true, ROUND(total_bytes_processed * cost_factor,4), 0) as cost_in_dollar,
project_id,
user_email,
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
ORDER BY
bytes_processed_in_gb DESC
Credits: https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/

Data streaming insertAll api usage not equal to actually inserted rows

We are using google-php-client-api in order to stream web sites page views logs into a table with 9 columns.
(formed of basic data types as
cookieid(string),
domain(string),
site_category(string),
site_subcategory(string),
querystring(string),
connectiontime(timestamp),
flag(boolean),
duration(integer),
remoteip(string))
After 10 hours or running the scripts, we observed that bigquery api usage (for insertAll methods) became 300K but during that time 35K rows were only recorded to the table...
When we looked to the google cloud console, approximately 299K of this 300K api usage returned "success codes"; what i mean the streaming seemed to work well.
What we didn't understand, after 299K successful requests, how only 35K rows should be inserted to the table?
Is this a problem caused because of the google-php-client-api or bigquery didn't save the sent data to the table yet?
If the second is true, how much time do we need to see the actual (all of the) rows sent to bigquery?
Code used for streaming data:
$rows = array();
$data = json_decode($rawjson);
$row = new Google_Service_Bigquery_TableDataInsertAllRequestRows();
$row->setJson($data);
$row->setInsertId(strtotime('now'));
$rows[0] = $row;
$req = new Google_Service_Bigquery_TableDataInsertAllRequest();
$req->setKind('bigquery#tableDataInsertAllRequest');
$req->setRows($rows);
$this->service->tabledata->insertAll($projectid, $datasetid, $tableid, $req);
Thank you in advance,
Cihan
We resolved this issue.
We saw that it was caused because of this code line:
$row->setInsertId(strtotime('now'));
As we have at least 10-20 requests per second; because of this "insertID", sent to BigQuery, which is depending on the current timestamp; BigQuery was saving only 1 request per second and was rejecting all of other requests without saving them to the table.
We removed this line, now the numbers are coherents.

Datamodel design for an application using Redis

I am new to redis and I am trying to figure out how redis can be used.
So please let me know if this is a right way to build an application.
I am building an application which has got only one data source. I am planning to run a job on nightly basis to get data into a file.
Now I have a front end application, that needs to render this data in different formats.
Example application use case
Download processed applications by a university on nightly basis.
Display how many applications got approved or rejected.
Display number of applications by state.
Let user search for an application by application id.
Instead of using postgres/mysql like relational database, I am thinking about using redis. I am planning to store data in following ways.
Application id -> Application details
State -> List of application ids
Approved -> List of application ids (By date ?)
Declined -> List of application ids (By date ?)
Is this correct way to store data into redis?
Also if someone queries for all applications in california for a certain date,
I will be able to pull application ids in one call but to get details for each application, do I need to make another request?
Word of caution:
Instead of using postgres/mysql like relational database, I am thinking about using redis.
Why? Redis is an amazing database, but don't use the right hammer for the wrong nail. Use Redis if you need real time performance at scale, but don't try make it replace an RDBMS if that's what you need.
Answer:
Fetching data efficiently from Redis to answer your queries depends on how you'll be storing it. Therefore, to determine the "correct" data model, you first need to define your queries. The data model you proposed is just a description of the data - it doesn't really say how you're planning to store it in Redis. Without more details about the queries, I would store the data as follows:
Store the application details in a Hash (e.g. app:<id>)
Store the application IDs in a per state in Set (e.g. apps:<state>)
Store the approved/rejected applications in two Sorted Sets, the id being the member and the date being the score
Also if someone queries for all applications in california for a certain date, I will be able to pull application ids in one call but to get details for each application, do I need to make another request?
Again, that depends on the data model but you can use Lua scripts to embed this logic and execute it in one call to the database.
First of all you can use a Hash to store structured Data. With Lists (ZSets) and Sets you can create indexes for an ordered or unordered access. (Depending on your requirements of course. Make a list of how you want to access your data).
It is possible to get all data as json of an index in one go with a simple redis script (example using an unordered set):
local bulkToTable = function(bulk)
local retTable = {};
for index = 1, #bulk, 2 do
local key = bulk[index];
local value = bulk[index+1];
retTable[key] = value;
end
return retTable;
end
local functionSet = redis.call("SMEMBERS", "app:functions")
local returnObj = {} ;
for index = 1, #functionSet, 1 do
returnObj[index] = bulkToTable(redis.call("HGETALL", "app:function:" .. functionSet[index]));
returnObj[index]["functionId"] = functionSet[index];
end
return cjson.encode(returnObj);
more information about redis scripts see here : http://www.redisgreen.net/blog/intro-to-lua-for-redis-programmers/

How to add timestamp column when loading file to table

I'm loading batch files to a table.
I want to add a timestamp column to the table so I can know the insertion times
on the record. I'm loading in append mode, so not all records insert at the same time.
Unfortunately, I didn't find a way to it in big query. When loading a file to a table, I didn't find an option to add padding the insertion with additional columns. I just want to calculate timestamp in my code and put it as constant field for all the insertion process.
The solution that I'm doing now, is to load to temp table and then query the table + new timestamp field into the target table. It works, but it's another step and I have multiple loadings and the full process takes too much time due to the latency of another step.
Does anyone know about another solution with only 1 step?
That's a great feature request for https://code.google.com/p/google-bigquery/issues/list. Unfortunately, there is no automated way to do it today. I like the way you are doing it though :)
If you are willing to make a new table to house this information, I recommend making a new table with the following settings:
table with _PARTITIONTIME field based on insertion
If you make a table using the default _PARTITIONTIME partitioning field, it does exactly what you are asking based on time of insertion
You can add a timestamp column/value using Pandas data-frame:
from datetime import datetime
import pandas as pd
from google.cloud import bigquery
insertDate = datetime.utcnow()
bigqueryClient = bigquery.Client()
tableRef = bigqueryClient.dataset("dataset-name").table("table-name")
dataFrame = pd.read_json("file.json")
dataFrame['insert_date'] = insertDate
bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef)
bigqueryJob.result()
You can leverage the "hive partitioning" functionality of BigQuery load jobs to accomplish this. This feature is normally used for "external tables" where the data just sits in GCS in carefully-organized folders, but there's no law against using it to import data into a native table.
When you write your batch files, include your timestamp as part of the path. For example, if your timestamp field is called "added_at" then write your batch files to gs://your-bucket/batch_output/added_at=1658877709/file.json
Load your data with the hive partitioning parameters so that the "added_at" value comes from the path instead of from the contents of your file. Example:
bq load --source_format=JSON \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=gs://your-bucket/batch_output/ \
dataset-name.table-name \
gs://your-bucket/output/added_at=1658877709/*
The python API has equivalent functionality.