Metricbeat gcp module to capture bigquery metrics is not working as expected - google-bigquery

I have added bigquery metrics configuration in modules.d/gcp.yml file
But getting an error below
"2021-12-06T09:51:10.206Z ERROR [gcp.metrics] metrics/metricset.go:294 bigquery.googleapis.com/storage/table_count" metric descriptor is empty, this metric will not be collected
Below is the code which I have used
module: gcp
region: "us-central1"
metricsets:
metrics
project_id: "ProjectName"
credentials_file_path: "/root/xyz.json"
exclude_labels: false
period: 1h
metrics:
aligner: ALIGN_NONE
service: bigquery
service_metric_prefix: bigquery.googleapis.com/
metric_types:
"bigquery.googleapis.com/storage/table_count"
Let me know if there is any changes needed in code.
I have also tried extracting compute and billing metrics . We can see the column names populating in Kibana discovery filter area , but we do not have any data available
Please find the code given below
module: gcp
region: "us-central1"
metricsets:
metrics
project_id: "projectname"
credentials_file_path: "/root/xyz.json"
exclude_labels: false
period: 1
metrics:
aligner: ALIGN_NONE
service: compute
metric_types:
"instance/cpu/reserved_cores"
"instance/cpu/usage_time"
"instance/cpu/utilization"
"instance/uptime"
Thanks

The code below is worked well for me.
- module: gcp
metricsets:
- metrics
project_id: "YOUR_PROJECT_ID"
credentials_file_path: "your JSON credentials file path"
exclude_labels: false
period: 1800s # 1800s is recommended
metrics:
- aligner: ALIGN_NONE
service: bigquery
service_metric_prefix: bigquery.googleapis.com/ # This is not required as already "bigquery.googleapis.com/" is default value
metric_types:
- "storage/table_count" #only "storage/table_count" here
It seems that you should edit your metric_type attribute only.
I recommend the collection period of table_count metric to 1800s as GCP sampled this every 1800 seconds.
Refer here for the information about that.
This works well at my kibana too.

Related

How do I record multidimensional metrics in cloudwatch?

I want to be able to graph two or more metrics that come from the same event e.g. response times from an API versus the size of the response. Is there a way to group two or more metrics as being a multidimensional data point?
I don't see that multiple MetricDatum logged in one PutMetricDataRequest are grouped in any way.
You need to use MetricDatum (withDimensions, setDimensions) functions.
See: https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/javav2/example_code/cloudwatch/src/main/java/com/example/cloudwatch/PutMetricData.java#L62-L82
aws-cli example for multidimensions:
aws cloudwatch get-metric-statistics --metric-name Buffers --namespace MyNameSpace --dimensions Name=InstanceId,Value=1-23456789 Name=InstanceType,Value=m1.small --start-time 2016-10-15T04:00:00Z --end-time 2016-10-19T07:00:00Z --statistics Average --period 60

Elastalert - Text fields are not optimised for operations that require per-document field data - Please use a keyword field instead

I have setup elastalert on a server and managed to run a rule to monitor disk usage. It was running ok until I had to rebuild the server. Now when I run the rule I get error below. I can't find a solution on Internet. Any ideas? Thank you in advanced.
ERROR:root:Error running query: RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [host.name] in order to load field data by uninverting the inverted index. Note that this can use significant memory.')
alert rule:
name: "warning:High Disk Usage - Disk use is over 85% of capacity:warning"
type: metric_aggregation
index: metricbeat-*
metric_agg_key: system.filesystem.used.pct
metric_agg_type: avg
alert_subject: "High Disk Usage"
max_threshold: 0.15
filter:
- term:
metricset.name: filesystem
- term:
system.filesystem.mount_point: "/"
query_key: host.name

ScrapeFrequencyInSecs is not working in metrics-collector module

I have deployed metrics-collector module with a ScrapeFrequencyInSecs value of 60. so it should scrape data every minutes, but when I check data in insightsmetrics, I am still getting data every 5 minutes or so
I am using
mcr.microsoft.com/azureiotedge-metrics-collector:1.0
Version 1.
The mcr.microsoft.com/azureiotedge-metrics-collector:1.0 module has ScrapeFrequencyInSecs Default value: 300.
Note: After updating the configuration parameter, environment variables need a restart of the module.
You can refer to MS Q&A: can we set the interval for sending metrics using Metrics collector module to log analysis workspace

How to set up job dependencies in google bigquery?

I have a few jobs, say one is loading a text file from a google cloud storage bucket to bigquery table, and another one is a scheduled query to copy data from one table to another table with some transformation, I want the second job to depend on the success of the first one, how do we achieve this in bigquery if it is possible to do so at all?
Many thanks.
Best regards,
Right now a developer needs to put together the chain of operations.
It can be done either using Cloud Functions (supports, Node.js, Go, Python) or via Cloud Run container (supports gcloud API, any programming language).
Basically you need to
issue a job
get the job id
poll for the job id
job is finished trigger other steps
If using Cloud Functions
place the file into a dedicated GCS bucket
setup a GCF that monitors that bucket and when a new file is uploaded it will execute a function that imports into GCS - wait until the operations ends
at the end of the GCF you can trigger other functions for next step
another use case with Cloud Functions:
A: a trigger starts the GCF
B: function executes the query (copy data to another table)
C: gets a job id - fires another function with a bit of delay
I: a function gets a jobid
J: polls for job is ready?
K: if not ready, fires himself again with a bit of delay
L: if ready triggers next step - could be a dedicated function or parameterized function
It is possible to address your scenario with either cloud functions(CF) or with a scheduler (airflow). The first approach is event-driven getting your data crunch immediately. With the scheduler, expect data availability delay.
As it has been stated once you submit BigQuery job you get back job ID, that needs to be check till it completes. Then based on the status you can handle on success or failure post actions respectively.
If you were to develop CF, note that there are certain limitations like execution time (max 9min), which you would have to address in case BigQuery job takes more than 9 min to complete. Another challenge with CF is idempotency, making sure that if the same datafile event comes more than once, the processing should not result in data duplicates.
Alternatively, you can consider using some event-driven serverless open source projects like BqTail - Google Cloud Storage BigQuery Loader with post-load transformation.
Here is an example of the bqtail rule.
rule.yaml
When:
Prefix: "/mypath/mysubpath"
Suffix: ".json"
Async: true
Batch:
Window:
DurationInSec: 85
Dest:
Table: bqtail.transactions
Transient:
Dataset: temp
Alias: t
Transform:
charge: (CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)
SideInputs:
- Table: bqtail.fees
Alias: f
'On': t.fee_id = f.id
OnSuccess:
- Action: query
Request:
SQL: SELECT
DATE(timestamp) AS date,
sku_id,
supply_entity_id,
MAX($EventID) AS batch_id,
SUM( payment) payment,
SUM((CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)) charge,
SUM(COALESCE(qty, 1.0)) AS qty
FROM $TempTable t
LEFT JOIN bqtail.fees f ON f.id = t.fee_id
GROUP BY 1, 2, 3
Dest: bqtail.supply_performance
Append: true
OnFailure:
- Action: notify
Request:
Channels:
- "#e2e"
Title: Failed to aggregate data to supply_performance
Message: "$Error"
OnSuccess:
- Action: query
Request:
SQL: SELECT CURRENT_TIMESTAMP() AS timestamp, $EventID AS job_id
Dest: bqtail.supply_performance_batches
Append: true
- Action: delete
You want to use an orchestration tool, especially if you want to set up this tasks as recurring jobs.
We use Google Cloud Composer, which is a managed service based on Airflow, to do workflow orchestration and works great. It comes with automatically retry, monitoring, alerting, and much more.
You might want to give it a try.
Basically you can use Cloud Logging to know almost all kinds of operations in GCP.
BigQuery is no exception. When the query job completed, you can find the corresponding log in the log viewer.
The next question is how to anchor the exact query you want, one way to achieve this is to use labeled query (means attach labels to your query) [1].
For example, you can use below bq command to issue query with foo:bar label
bq query \
--nouse_legacy_sql \
--label foo:bar \
'SELECT COUNT(*) FROM `bigquery-public-data`.samples.shakespeare'
Then, when you go to Logs Viewer and issue below log filter, you will find the exactly log generated by above query.
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.labels.foo="bar"
The next question is how to emit an event based on this log for the next workload. Then, the Cloud Pub/Sub comes into play.
2 ways to publish an event based on log pattern are:
Log Routers: set Pub/Sub topic as the destination [1]
Log-based Metrics: create alert policy whose notification channel is Pub/Sub [2]
So, the next workload can subscribe to the Pub/Sub topic, and be triggered when the previous query has completed.
Hope this helps ~
[1] https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfiguration
[2] https://cloud.google.com/logging/docs/routing/overview
[3] https://cloud.google.com/logging/docs/logs-based-metrics

How to best handle data stored in different locations in Google BigQuery?

My current workflow in BigQuery is as follows:
(1) query data in a public repository (stored in the US), (2) write it to a table in my repository, (3) export a csv to a cloud bucket and (4) download the csv on the server I work on and (5) work with that on the server.
The problem I have now, is that the server I work on is located in EU. Thus, I have to pay quite some fees for transfering data between my US bucket and my EU server. I could now go ahead and locate my bucket in EU, but then I still have the problem that I would transfer data from the US (BigQuery) to EU (bucket). So I could also set my dataset in bq to be located in the EU, but then I cant do any queries anylonger, because the data in the public repository is located in the US, and queries between different locations are not allowed.
Does anyone have an idea of how to approach this?
One way to copy a BigQuery dataset from one region to another is to take advantage of the Storage Data Transfer Service. It doesn't get around the fact that you still have to pay for bucket-to-bucket network traffic, but might save you some CPU time on copying data to a server in the EU.
The flow would be to:
Extract all the BigQuery tables into a bucket in the same region as the tables. (Recommend Avro format for best fidelity in data types and fastest loading speed.)
Run a storage transfer job to copy the extracted files from the starting location bucket to a bucket in the destination location.
Load all the files into a BigQuery dataset located in the destination location.
Python example:
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import sys
import time
import googleapiclient.discovery
from google.cloud import bigquery
import json
import pytz
PROJECT_ID = 'swast-scratch' # TODO: set this to your project name
FROM_LOCATION = 'US' # TODO: set this to the BigQuery location
FROM_DATASET = 'workflow_test_us' # TODO: set to BQ dataset name
FROM_BUCKET = 'swast-scratch-us' # TODO: set to bucket name in same location
TO_LOCATION = 'EU' # TODO: set this to the destination BigQuery location
TO_DATASET = 'workflow_test_eu' # TODO: set to destination dataset name
TO_BUCKET = 'swast-scratch-eu' # TODO: set to bucket name in destination loc
# Construct API clients.
bq_client = bigquery.Client(project=PROJECT_ID)
transfer_client = googleapiclient.discovery.build('storagetransfer', 'v1')
def extract_tables():
# Extract all tables in a dataset to a Cloud Storage bucket.
print('Extracting {}:{} to bucket {}'.format(
PROJECT_ID, FROM_DATASET, FROM_BUCKET))
tables = list(bq_client.list_tables(bq_client.dataset(FROM_DATASET)))
extract_jobs = []
for table in tables:
job_config = bigquery.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.AVRO
extract_job = bq_client.extract_table(
table.reference,
['gs://{}/{}.avro'.format(FROM_BUCKET, table.table_id)],
location=FROM_LOCATION, # Available in 0.32.0 library.
job_config=job_config) # Starts the extract job.
extract_jobs.append(extract_job)
for job in extract_jobs:
job.result()
return tables
def transfer_buckets():
# Transfer files from one region to another using storage transfer service.
print('Transferring bucket {} to {}'.format(FROM_BUCKET, TO_BUCKET))
now = datetime.datetime.now(pytz.utc)
transfer_job = {
'description': '{}-{}-{}_once'.format(
PROJECT_ID, FROM_BUCKET, TO_BUCKET),
'status': 'ENABLED',
'projectId': PROJECT_ID,
'transferSpec': {
'transferOptions': {
'overwriteObjectsAlreadyExistingInSink': True,
},
'gcsDataSource': {
'bucketName': FROM_BUCKET,
},
'gcsDataSink': {
'bucketName': TO_BUCKET,
},
},
# Set start and end date to today (UTC) without a time part to start
# the job immediately.
'schedule': {
'scheduleStartDate': {
'year': now.year,
'month': now.month,
'day': now.day,
},
'scheduleEndDate': {
'year': now.year,
'month': now.month,
'day': now.day,
},
},
}
transfer_job = transfer_client.transferJobs().create(
body=transfer_job).execute()
print('Returned transferJob: {}'.format(
json.dumps(transfer_job, indent=4)))
# Find the operation created for the job.
job_filter = {
'project_id': PROJECT_ID,
'job_names': [transfer_job['name']],
}
# Wait until the operation has started.
response = {}
while ('operations' not in response) or (not response['operations']):
time.sleep(1)
response = transfer_client.transferOperations().list(
name='transferOperations', filter=json.dumps(job_filter)).execute()
operation = response['operations'][0]
print('Returned transferOperation: {}'.format(
json.dumps(operation, indent=4)))
# Wait for the transfer to complete.
print('Waiting ', end='')
while operation['metadata']['status'] == 'IN_PROGRESS':
print('.', end='')
sys.stdout.flush()
time.sleep(5)
operation = transfer_client.transferOperations().get(
name=operation['name']).execute()
print()
print('Finished transferOperation: {}'.format(
json.dumps(operation, indent=4)))
def load_tables(tables):
# Load all tables into the new dataset.
print('Loading tables from bucket {} to {}:{}'.format(
TO_BUCKET, PROJECT_ID, TO_DATASET))
load_jobs = []
for table in tables:
dest_table = bq_client.dataset(TO_DATASET).table(table.table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.AVRO
load_job = bq_client.load_table_from_uri(
['gs://{}/{}.avro'.format(TO_BUCKET, table.table_id)],
dest_table,
location=TO_LOCATION, # Available in 0.32.0 library.
job_config=job_config) # Starts the load job.
load_jobs.append(load_job)
for job in load_jobs:
job.result()
# Actually run the script.
tables = extract_tables()
transfer_buckets()
load_tables(tables)
The preceding sample uses google-cloud-bigquery library for BigQuery API and google-api-python-client for Storage Data Transfer API.
Note that this sample does not account for partitioned tables.
No matter what, you have data in the US that you need in the EU, so I think you have two options:
You could continue to pay many smaller fees to move your reduced datasets from the US to the EU as you're doing today.
You could pay the one-off fee to transfer the original public BQ dataset from the US to your own dataset in the EU. From then on, all queries you run stay in the same region, and you have no more trans-continental transfers.
It really depends on how many queries you plan to do. If it's not a lot, then the way you're doing things today seems like it'd be the most efficient. If it's a lot, then moving the data once (paying the up-front fee) might work out cheaper.
Maybe Google has some magical way to make this better, but as far as I can tell, you're dealing with lots of data on one side of the Atlantic that you need on the other side, and moving it across that wire costs money.