BigQuery detailed charges just shows how much data was analyzed - google-bigquery

I'm trying to find out what is causing my BigQuery bill to be so high but when I click View Detailed Charges on Google Cloud I just get how much data was analyzed and how much it costs. Is there a place where I can view a detailed breakdown of what jobs cost so much and what is causing the bill to get so large?

Is there a place where I can view a detailed breakdown of what jobs cost so much and what is causing the bill to get so large?
You should be able to use Jobs.list API to lists all jobs that you started in the specified project. Job information is available for a six month period after creation. The job list is sorted in reverse chronological order, by job creation time. Requires the Can View project role, or the Is Owner project role if you set the allUsers property
You actually can even make it without any coding - https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/list#try-it
Collect all you jobs info and analyse it as you wish
For the long term solution - you can either automate above process or use BigQuery Monitoring using Stackdriver

Related

Log Analytics retention policy and querying on logs

I would like to know how can we address this scenario in Azure Log Analytics where I need to generate Kube-audit logs of different cluster every week and also retain these logs for approx 400 days. Now storing it over Log Analytics will cost me more and its not an optimized architecture as I will not be require that so often. So I would like to know from experts whats the best way to design the architecture, where we get the kube audit logs which can be retained for 400 days and be available for querying when required without incurring too much cost.
PS: I also heard in my team that querying 400 days logs always times out in KQL.
Log analytics offerings:
Log analytics now provides the capability to manage several service tiers at table scope. Setting your data as archive, with no query capabilities at a much lower cost. offering spans for up to 7 years.
when needed, you can choose to elevate a subset of your data into the Analytics offering, providing you the capability to query it. The action of elevating your data is denoted as - "Search jobs"
Another option is to elevate an entire period in time to the Analytic offering, they call it - "Restore logs".
Table's different service tiers -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/data-retention-archive?tabs=api-1%2Capi-2
Search job offering -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/search-jobs?tabs=api-1%2Capi-2%2Capi-3
Restore logs -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/restore?tabs=api-1%2Capi-2
all are under public preview.
both offerings - Search jobs and Restore logs provides you the capability to engage your data on demand, can't comment or suggest regarding the actual cost.
Azure data explorer solution:
Another option is to use Azure storage to hold your data (as an example), Azure data explorer provides the capability to create an external table, that table is a logical view on top of your data, the data itself is kept outside of the ADX cluster. you can query your data by using ADX, expect degradation in query performance.
ADX external table offering -
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/schema-entities/externaltables

How to get query time and space information from BigQuery API

I'm going to build a web app and use BigQuery as a part of backend database, and I want to show the query cost information (ex. 1.8 sec elapsed, 264.9 MB processed) in the app.
I know we can check the BigQuery's query information inside GCP, but how do we I get that information from BigQuery API?
The information you are interested in is present in the job statistics.
See jobs.get for more details: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/get
The dry-run sample may be of interest as well, though you can get the stats from a real invocation as well (dry run is for estimating costs without executing the query):
https://cloud.google.com/bigquery/docs/samples/bigquery-query-dry-run

Service that does advanced queries on a data set, and automatically returns relevant updated results every time new data is added to the set?

I'm looking for a cloud service that can do advanced statistics calculations on a large amount of votes submitted by users, in "real time".
In our app, users can submit different kind of votes like picking a favorite, rating 1-5, say yes/no etc. on various topics.
We also want to show "live" statistics to the user, showing the popularity of a person etc. This will be generated by a rather complex SQL where we are calculating the average number of times a person was picked as favorite, divided by total number of votes and the number of games in which the person has been participating etc. And the score for the latest X games should count higher than the overall score for all games. This is just an example, there are several other SQL queries with similar complexity.
All our presentable data (including calculated statistics) is served from Firestore documents, and the votes will be saved as Firestore documents.
Ideally, the Firebase-backend (functions, firestore etc) should not need to know about the query logic.
What I wish for is a pay as you go cloud service that does the following:
I define some schemas and set up the queries we need for the statistics we have (15-20 different SQLs). Like setting up views in MySQL
On every vote, we push the vote data to this service, which will store it in a row.
The service should then, based on its knowledge about the defined queries, and the content of the pushed vote data, determine which statistics that are affected by the newly added row, and recalculate these. A specific vote type can affect one or more statistics.
Every time a statistic is recalculated, the result should be automatically pushed back to our Firebase backend (for instance by calling an HTTPS endpoint that hits a cloud function) - so we can update the relevant Firestore documents.
The service should be able to throttle the calculations, like only regenerating new statistics every 1 minute despite having several votes per second on the same topic.
Is there any product like this in the market? Or can it be built by combining available cloud services? And what is the official term for such a product, if I should search for it myself?
I know that I can probably build a solution like this myself, and run it on a cloud hosted database server, which can scale as our need grows - but I believe that I'm not the first developer with a need of this, so I hope that someone has solved it before me :)
You can leverage the existing cloud services available on the Google Cloud Platform.
Google BigQuery, Google Cloud Firestore, Google App Engine (CRON Jobs), Google Cloud Tasks
The services can be used to solve the problems mentioned above:
1) Google BigQuery : Here you can define schema for the data on which you're going to run the SQL queries. BigQuery supports Standard and legacy SQL queries.
2) Every vote can be pushed to the defined BigQuery tables using its streaming insert service.
3) Every vote pushed can trigger the recalculation service which calculates the statistics by executing the defined SQL queries and the query results can be stored as documents in collections in Google Cloud Firestore.
4) Google Cloud Firestore: Here you can store the live statistics of the user. This is a real time database, so you'll be able to configure listeners for the modifications to the statistics and show the modifications as soon as the statistics are recalculated.
5) In the same service which inserts every vote, create a new record with a "syncId" in an another table. The idea is to group a number of votes cast in a particular interval to a its corresponding syncId. The syncId can be suffixed with a timestamp. According to your requirement a particular time interval can be set so that the recalculation can be triggered using CRON jobs service which invokes the recalculation service within the interval. Once the recalculation related to a particular syncId is completed the record corresponding to the syncId should be marked as completed.
We are leveraging the above technologies to build a web application on Google Cloud Platform, where the inputs are recorded on Google Firestore and then stream-inserted to Google BigQuery. The data stored in BigQuery is queried after 30 sec of each update using SQL queries and the query results are stored in Google Cloud Firestore to serve dashboards which are automatically updated using listeners configured for the collection in which the dashboard information is stored.

Removing BigQuery Public Dataset

Does anyone know of any way to remove the public datasets from a BigQuery project?
Though the risk is very low, I don't want my users to be able to run queries against them and rack up costs.
Thanks
Its an old question, but for those who just want to unpin the "bigquery-public-data" to tidy up the resources list, you can click the name on the side, then on the far right of the info pane there is an "unpin project button". Click that.
The whole point of public datasets is that everyone has access to them so they can test BigQuery. Even if a feature request will create the option to disable the listing in the panel of the BigQuery web UI, the users will still have access and could query the public datasets.
It will be more practical to use custom quotas.
So you would create a project with a number of users that share a quota that you consider enough for their activities. When the established quota is reached BigQuery stops and the users receive an error message when trying to run queries.
Another useful tool is creating budget alerts with a desired level that you can set taking into account the previous month's spend. The alert will notify you when the project's bill have reached the amount you set and can save you from bad surprises.
In addition, implementing the Audit Logs in your project will give comprehensive overview on the BigQuery operations. Check this example of an Audit Logs query that will give details on the performed queries. Of course, you will find out about the use of a public dataset after it happens but this will point out who’s the user that performed the query and you can reinforce the administration policy of not inquiring public datasets. To get information on the performed query, including the interrogated dataset, use this field when querying the Audit Logs:
'protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query'
As a last resort, you can create a designated project for your users to query the public datasets and to make sure it will not create additional costs, you can remove the billing account. Though, by doing so you can only query 1 TB of data per month, the BigQuery always free usage tier.
Also keep in mind about this best practices to limit the queries costs.
if you closed current tab , public data set will disappear from google BigQuery page

How can I see my accumulated query costs for BigQuery while in the free trial?

Google Cloud billing is not updating with the free trial (on monthly payments) and I can not change it to a faster update cycle. As per https://cloud.google.com/free-trial/docs/billing-during-free-trial the bill should come every month.
It is therefore not easy to see how much of the 300$ is left.
Is there any way to at least see how many TBs my queries used? This should be by far the biggest item on the bill.
I am concerned that I might get 'stuck' between some important queries that I otherwise could have managed better to have at least partial results available after the trial ends.
BigQuery analysis & storage costs should be listed under your GCP billing transactions:
https://console.cloud.google.com/billing/<INSERT_YOUR_BILLING_ID_HERE>/history?e=13803970,13803205
Another way to see how much you have queried is by enabling audit logging as described here.