Data streaming insertAll api usage not equal to actually inserted rows - google-bigquery

We are using google-php-client-api in order to stream web sites page views logs into a table with 9 columns.
(formed of basic data types as
cookieid(string),
domain(string),
site_category(string),
site_subcategory(string),
querystring(string),
connectiontime(timestamp),
flag(boolean),
duration(integer),
remoteip(string))
After 10 hours or running the scripts, we observed that bigquery api usage (for insertAll methods) became 300K but during that time 35K rows were only recorded to the table...
When we looked to the google cloud console, approximately 299K of this 300K api usage returned "success codes"; what i mean the streaming seemed to work well.
What we didn't understand, after 299K successful requests, how only 35K rows should be inserted to the table?
Is this a problem caused because of the google-php-client-api or bigquery didn't save the sent data to the table yet?
If the second is true, how much time do we need to see the actual (all of the) rows sent to bigquery?
Code used for streaming data:
$rows = array();
$data = json_decode($rawjson);
$row = new Google_Service_Bigquery_TableDataInsertAllRequestRows();
$row->setJson($data);
$row->setInsertId(strtotime('now'));
$rows[0] = $row;
$req = new Google_Service_Bigquery_TableDataInsertAllRequest();
$req->setKind('bigquery#tableDataInsertAllRequest');
$req->setRows($rows);
$this->service->tabledata->insertAll($projectid, $datasetid, $tableid, $req);
Thank you in advance,
Cihan

We resolved this issue.
We saw that it was caused because of this code line:
$row->setInsertId(strtotime('now'));
As we have at least 10-20 requests per second; because of this "insertID", sent to BigQuery, which is depending on the current timestamp; BigQuery was saving only 1 request per second and was rejecting all of other requests without saving them to the table.
We removed this line, now the numbers are coherents.

Related

How to interpret query process GB in Bigquery?

I am using a free trial of Google bigquery. This is the query that I am using.
select * from `test`.events where subject_id = 124 and id = 256064 and time >= '2166-01-15T14:00:00' and time <='2166-01-15T14:15:00' and id_1 in (3655,223762,223761,678,211,220045,8368,8441,225310,8555,8440)
This query is expected to return at most 300 records and not more than that.
However I see a message like this as below
But the table on which this query operates is really huge. Does this indicate the table size? However, I ran this query multiple times a day
Due to this, it resulted in error below
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
How long do I have to wait for this error to go-away? Is the daily limit 1TB? If yes, then I didn't not use close to 400 GB.
How to view my daily usage?
If I can edit quota, can you let me know which option should I be editing?
Can you help me with the above questions?
According to the official documentation
"BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read)", regardless of how large the output size is. What this means is that if you do a count(*) on a 1TB table, you will supposedly be charged $5, even though the final output is very minimal.
Note that due to storage optimizations that BigQuery is doing internally, the bytes processed might not equal to the actual raw table size when you created it.
For the error you're seeing, browse the Google Console to "IAM & admin" then "Quotas", where you can then search for quotas specific to the BigQuery service.
Hope this helps!
Flavien

How to iterate through BigQuery query results and write it to file

I need to query the Google BigQuery table and export the results to gzipped file.
This is my current code. The requirement is that the each row data should be new line (\n) delemited.
def batch_job_handler(args):
credentials = Credentials.from_service_account_info(service_account_info)
client = Client(project=service_account_info.get("project_id"),
credentials=credentials)
query_job = client.query(QUERY_STRING)
results = query_job.result() # Result's total_rows is 1300000 records
with gzip.open("data/query_result.json.gz", "wb") as file:
data = ""
for res in results:
data += json.dumps(dict(list(res.items()))) + "\n"
break
file.write(bytes(data, encoding="utf-8"))
The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.
Is it because of this line: json.dumps(dict(list(res.items()))) + "\n" as I am constructing a huge string by concatenating each records by new line.
As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.
Check out the new BigQuery Storage API for quick reads:
https://cloud.google.com/bigquery/docs/reference/storage
For an example of the API at work, see this project:
https://github.com/GoogleCloudPlatform/spark-bigquery-connector
It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Filtering
The new API allows column and limited predicate filtering to only read the data you are interested in.
Column Filtering
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
Predicate Filtering
The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal
You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.

Listing BigQuery Tables in `huge/big` Datasets - 30K-40K+ tables

The task is to programmatically list all the tables within the given dataset with more than 30-40K tables
Initial option we explored was using tables.list API (as we do all the times for normal datasets with reasonable number of tables in them)
Looks like this API returns max 1000 entries (even if we try to set maxResults to bigger value)
To take next 1000 we need to “wait” for response of previous request then extract pageToken and repeat call and so on
For the datasets with 30K – 40K+ this can take up to 10-15 and more sec (under good weather)
So the timing is a problem for us that we want to address!
In above mentioned calls we are getting back only nextPageToken and tables/tableReference/tableId so size of response is extremely small!
Question:
Is there way to somehow increase maxResults, so to get all tables in one (or very few) call(s) (assuming it will be much faster than doing 30-40 calls)?
The workaround we tried so far is to use __TABLES_SUMMARY__ with jobs.insert or jobs.query API.
This way – the whole result is returned within the seconds – but in our particular case – using BigQuery jobs API is not an option for multiple reasons. We want to be able to use list API

BigQuery streaming data not available instantly

Since couple of days some data i am streaming to bigquery is not available instantly (as it normally happens) within bigquery web ui after being inserted successfully.
My use case consists of inserting thousand of lines using :
bigquery.tabledata().insertAll(...)
The results of the streaming inserts into the table are :
(i am also checking for insertErrors to be sure as described here):
BigQuery insert status : {"kind":"bigquery#tableDataInsertAllResponse"}
BigQuery insert errors : null
Total number of lines available in bigquery web ui is different that total inserted.
I would be grateful for any help.
Bigquery project details :
Project ID : favorable-beach-87616
Table : mtp_UA_xxxx_1_20150410
Project dependencies on google libraries:
compile 'com.google.api-client:google-api-client:1.19.0'
compile 'com.google.http-client:google-http-client:1.19.0'
compile 'com.google.http-client:google-http-client-jackson2:1.19.0'
compile 'com.google.oauth-client:google-oauth-client:1.19.0'
compile 'com.google.oauth-client:google-oauth-client-servlet:1.19.0'
compile 'com.google.apis:google-api-services-bigquery:v2-rev171-1.19.0'
compile 'com.google.api-client:google-api-client:1.17.0-rc'
Great thanks in advance for your help!
When you say the total number of lines available in the web UI, do you mean the number of rows that show up in the 'details' pane on the table, or the number of rows that are returned if you do a SELECT COUNT(*) query?
If the former, that is expected, since that counter only returns the number of rows that have been flushed to long-term storage (as opposed to the short-term storage buffers the streaming data originally gets written to). This is admittedly confusing, and we are working on a fix.
If the latter, the rows don't show up in a query, that is more concerning. If that is the case, please let us know and we'll investigate.

database schema for http transactions

I have a script that makes a http call to a webservice, captures the response and parses it.
For every transaction, I would like to save the following pieces of data in a relational DB.
HTTP request time
HTTP request headers
HTTP response time
HTTP response code
HTTP response headers
HTTP response content
I am having a tough time visualizing a schema for this.
My initial thoughts were to create 2 tables.
Table 'Transactions':
1. transaction id (not null, not unique)
2. timestamp (not null)
3. type (response or request) (not null)
3. headers (null)
4. content (null)
5. response code (null)
'transaction id' will be some sort of checksum derived from combining the timestamp with the header text.
The reason why i compute this transaction id is to have a unique id that can distinguish 2 transactions, but at the same time used to link a request with a response.
What will this table be used for?
The script will run every 5 minutes, and log all this into the DB. Plus, every time it runs, the script will check the last time a successful transaction was made. Also, at the end of the day, the script generates a summary of all the transactions made that day and emails it.
Any ideas of how i can improve on this design? What kinda normalization and/or optimization techniques i should apply to this schema? Should i split this up into 2 or more tables?
I decided to use a NoSQL approach to this, and it has worked. Used MongDB. The flexibility it offers with document structure and not having to have a fixed number of attributes really helped.
Probably not the best solution to the problem, but i was able to optimize the performance using compound indexes.