how to export bigquery to bigtable using airflow? schema issue - google-bigquery

I'm using Airflow to extract BigQuery rows to Google Cloud Storage in Avro format.
with models.DAG(
"bigquery_to_bigtable",
default_args=default_args,
schedule_interval=None,
start_date=datetime.now(),
catchup=False,
tags=["test"],
) as dag:
data_to_gcs = BigQueryInsertJobOperator(
task_id="data_to_gcs",
project_id=project_id,
location=location,
configuration={
"extract": {
"destinationUri": gcs_uri, "destinationFormat": "AVRO",
"sourceTable": {
"projectId": project_id, "datasetId": dataset_id,
"tableId": table_id}}})
gcs_to_bt = DataflowTemplatedJobStartOperator(
task_id="gcs_to_bt",
template="gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Bigtable",
location=location,
parameters={
'bigtableProjectId': project_id,
'bigtableInstanceId': bt_instance_id,
'bigtableTableId': bt_table_id,
'inputFilePattern': 'gs://export/test.avro-*'
},
)
data_to_gcs >> gcs_to_bt
the bigquery row contains
row_key | 1_cnt | 2_cnt | 3_cnt
1#2021-08-03 | 1 | 2 | 2
2#2021-08-02 | 5 | 1 | 5
.
.
.
I'd like to use the row_key column for row key in bigtable and rest column for columns in specific column family like my_cf in bigtable.
However I got error messages while using dataflow to loads avro file to bigtable
"java.io.IOException: Failed to start reading from source: gs://export/test.avro-"
Caused by: org.apache.avro.AvroTypeException: Found Root, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
The docs I read is telling:
The Bigtable table must exist and have the same column families as
exported in the Avro files.
How Do I export BigQuery in Avro with same column families?

I think you have to transform AVRO to proper schema. Documentation mentioned by you also says:
Bigtable expects a specific schema from the input Avro files.
There is a link that is referring to special data schema, which has to be used.
If I understand correctly you are just importing data from table the result, although is AVRO schema, will not much the requirement schema, so you need to transform data to proper schema appropriate to your BigTable schema.

Related

Process fields with nested arrays into strings with strcat_array for output in Kusto

I would like to process Azure AD audit Logs into HTML tables/csv files. The data contains nested sets of arrays that I would like to summarise into a comma separated string.
eg data that looks like this
{
"TargetResources": [{"displayName": "Policy",
"modifiedProperties": [{"displayname": "PolicySetting1"},
{"displayname": "PolicySetting2"}]
}]
}
Would be processed into
TargetResource | Policy
modifedProps | PolicySetting1, PolicySetting2
mv-expand doesn't seem to work because some rows do not have modifiedProperties so those rows get eliminated
The only solution I have been able to find that gets close to what I am trying to do looks like this:
AuditLogs
| extend TargetResource = tostring(TargetResources[0].displayName)
| extend ModifiedProperty0 = tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[0].displayName)
| extend ModifiedProperty1 = tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[1].displayName)
| extend ModifiedProperty2 = tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[2].displayName)
| extend ModifiedProperties = strcat(ModifiedProperty0,", ",ModifiedProperty1,", ",ModifiedProperty2)
This solution is limited in that it cannot work for arbitrary numbers of modifiedProperty values (it only works properly for exactly 3) which is a requirement for my purposes, I would like the solution to work if modifiedProperties does not exist and if there are 0-15 values.
Thank you for any help you can provide
if I understood your description correctly, you could use mv-apply (twice) to achieve that:
datatable(d: dynamic)
[
dynamic({"TargetResources":[{"displayName": "Policy0","someOtherProperty":"hello world"}]}),
dynamic({"TargetResources":[{"displayName": "Policy1","modifiedProperties":[{"displayname":"PolicySetting1"},{"displayname":"PolicySetting2"}]}]}),
dynamic({"TargetResources":[{"displayName": "Policy2","modifiedProperties":[{"displayname":"PolicySetting3"},{"displayname":"PolicySetting4"}]}, {"displayName":"Policy3","modifiedProperties":[{"displayname":"PolicySetting5"},{"displayname":"PolicySetting6"}]}]}),
]
| mv-apply tr = d.TargetResources on (
extend TargetResource = tr.displayName
| mv-apply mp = tr.modifiedProperties on (
extend propertyName = mp.displayname
| summarize modifiedProps = strcat_array(make_set(propertyName), ", ")
)
)
| project TargetResource, modifiedProps
TargetResource
modifiedProps
Policy0
Policy1
PolicySetting1, PolicySetting2
Policy2
PolicySetting3, PolicySetting4
Policy3
PolicySetting5, PolicySetting6

How to track SLA of VM availability set (or availability zone) through heartbeats with Log Analytics (KQL)

I want to track the SLAs of our VMs in a Monitor Workbook using a Log Analytics query.
For this, I use the 'Heartbeat' table, which gives the heartbeats of each VM.
However, some of our VMs are in an availability set/zone and as such, the SLA is only broken,
if in an interval of 1 minute, both heartbeats are missing.
As such I need to be able to group the heartbeats by availability set/zone in the query, but there doesn't seem to be such a property on the heartbeat.
I can use a separate Azure Resource Graph query to search for which VMs are in an availability set/zone, but when I merge this query with my Log Analytics query, I can't do any further Kusto Query Language processing on the query (I can only merge the tables).
For information, these are my Log Analytics Heartbeat query and my Resource Graph SLA query:
let timeRangeStart = {TimeRange:start};
let timeRangeEnd = {TimeRange:end};
Heartbeat
| where ResourceType == "virtualMachines"
| extend ResourceGroup = case(ResourceGroup <> "", ResourceGroup, "On-Prem")
| where TimeGenerated > timeRangeStart and TimeGenerated < timeRangeEnd and Computer in ({Servers})
| extend Resource=tolower(iff(isempty(_ResourceId), Resource, _ResourceId))
| summarize heartbeat_tot = count() by Resource,ResourceGroup, SubscriptionId
| extend total_number_of_buckets=round((timeRangeEnd-timeRangeStart)/1m)
| extend round(availability_rate=heartbeat_tot*100/total_number_of_buckets,2)
| extend availability_rate = min_of(availability_rate, 100)
| order by availability_rate asc
Resources // VMs
| where type == 'microsoft.compute/virtualmachines'
| extend AvSet = properties.availabilitySet.id
| extend AvZone = properties.availabilityZone.id
| extend VMname_SLA = iff(isnotempty(AvZone), AvZone, iff(isnotempty(AvSet), AvSet, id))
| extend SLA_VM = iff(isnotnull(AvZone), '99.99%', iff(isnotnull(AvSet), '99.95%', ''))
| extend managedBy = tolower(id)
| join kind = leftouter (
Resources // Disks
| where type == 'microsoft.compute/disks'
| where isnotempty(managedBy)
| extend managedBy = tolower(managedBy)
// What do Standard HDD disks have as SKU tag??? I used StandardHDD for the time being
| extend Tier_disk = sku.tier
| extend SLA_disk = iff(Tier_disk == 'StandardHDD', '95%', iff(Tier_disk == 'Standard', '99.5%', '99.9%'))
) on managedBy
| extend SLA_tot = iff(isnotempty(SLA_VM), SLA_VM, SLA_disk)
| project managedBy, VMname_SLA, SLA_tot
| order by managedBy asc
How many resources is it?
If it is not a large number of resources, a workaround would be:
run your ARG query in text parameter, and format the results of the query to effectively generate a json array of objects, with id, location, etc that you need. then mark this parameter as hidden
in your Logs query, reference that parameter json text before the query, and use KQL operators to turn that JSON structure into a table. then you can join/filter on that table in the query
it isn't optimal, and won't work well if there are large numbers of resources since every time you run your query you're effectively "uploading" a json blob and then immediately parsing it apart again.

How to s3-select all data within inner array of parquet file

I have parquet files on s3 which need to be queried using S3 Select. The parquet files are generated from JSON files with inner arrays. The S3 Select query can get the first array but if i tried to query the records in the inner array it fails to return the ids. Saying its an invalid data source
What I tried:
Looking up documentation on Amazon proves no use
Multiple formats of the s3 select query
Json Structure
{
"Array": [
{
"Id": "1"
},
{
"Id": "2"
}
]
}
Query
select s.Array[*].id from s3object s
Expect to get all the ids back from the query so should return Id 1 and 2.
select s.Id from S3Object[*].Array[*] s limit 5 will return all the ID's in the Array.

Google Pub/Sub to Dataflow, avoid duplicates with Record ID

I'm trying to build a Streaming Dataflow Job which read events from Pub/Sub and write them into BigQuery.
According to the documentation, Dataflow can detect duplicate messages delivery if a Record ID is used (see: https://cloud.google.com/dataflow/model/pubsub-io#using-record-ids)
But even using this Record ID, I still have some duplicates
(around 0.0002%).
Did I miss something ?
EDIT:
I use Spotify Async PubSub Client to publish messages with the following snipplet:
Message
.builder()
.data(new String(Base64.encodeBase64(json.getBytes())))
.attributes("myid", id, "mytimestamp", timestamp.toString)
.build()
Then I use Spotify scio to read the message from pub/sub and save it to DataFlow:
val input = sc.withName("ReadFromSubscription")
.pubsubSubscription(subscriptionName, "myid", "mytimestamp")
input
.withName("FixedWindow")
.withFixedWindows(windowSize) // apply windowing logic
.toWindowed // convert to WindowedSCollection
//
.withName("ParseJson")
.map { wv =>
wv.copy(value = TableRow(
"message_id" -> (Json.parse(wv.value) \ "id").as[String],
"message" -> wv.value)
)
}
//
.toSCollection // convert back to normal SCollection
//
.withName("SaveToBigQuery")
.saveAsBigQuery(bigQueryTable(opts), BQ_SCHEMA, WriteDisposition.WRITE_APPEND)
The Window size is 1 minute.
After only few seconds injecting messages I already have duplicates in BigQuery.
I use this query to count duplicates:
SELECT
COUNT(message_id) AS TOTAL,
COUNT(DISTINCT message_id) AS DISTINCT_TOTAL
FROM my_dataset.my_table
//returning 273666 273564
And this one to look at them:
SELECT *
FROM my_dataset.my_table
WHERE message_id IN (
SELECT message_id
FROM my_dataset.my_table
GROUP BY message_id
HAVING COUNT(*) > 1
) ORDER BY message_id
//returning for instance:
row|id | processed_at | processed_at_epoch
1 00166a5c-9143-3b9e-92c6-aab52601b0be 2017-02-02 14:06:50 UTC 1486044410367 { ...json1... }
2 00166a5c-9143-3b9e-92c6-aab52601b0be 2017-02-02 14:06:50 UTC 1486044410368 { ...json1... }
3 00354cc4-4794-3878-8762-f8784187c843 2017-02-02 13:59:33 UTC 1486043973907 { ...json2... }
4 00354cc4-4794-3878-8762-f8784187c843 2017-02-02 13:59:33 UTC 1486043973741 { ...json2... }
5 0047284e-0e89-3d57-b04d-ebe4c673cc1a 2017-02-02 14:09:10 UTC 1486044550489 { ...json3... }
6 0047284e-0e89-3d57-b04d-ebe4c673cc1a 2017-02-02 14:08:52 UTC 1486044532680 { ...json3... }
The BigQuery documentation states that there may be rare cases where duplicates arrive:
"BigQuery remembers this ID for at least one minute" -- if Dataflow takes more than one minute before retrying the insert BigQuery may allow the duplicate in. You may be able to look at the logs from the pipeline to determine if this is the case.
"In the rare instance of a Google datacenter losing connectivity unexpectedly, automatic deduplication may not be possible."
You may want to try the instructions for manually removing duplicates. This will also allow you to see the insertID that was used with each row to determine if the problem was on the Dataflow side (generating different insertIDs for the same record) or on the BigQuery side (failing to deduplicate rows based on their insertID).

How to filter avro records through filtering with a list of another file in pig?

I have a file "fileA" that is an avro with the following records:
{itemid:"Carrot"}
{itemid:"Lettuce"}
...
I have another file "fileB" that is an avro with multiple records following the same schema:
{item: "Carrot", cost: $2, ...other fields..}
{item: "Lettuce", cost: $2, ...other fields..}
{item: "Rice", cost: $2, ...other fields..}
...
How can I use pig to filter the data such that I can store all the relevant records in file "B" in a new output file?
I tried performing the following:
A = load 'fileA' using AvroStorage();
B = load 'fileB' using AvroStorage();
C = JOIN A by itemid , B by item;
STORE C into 'outputpath' using AvroStorage();
I am getting an error of "Pig Schema contains a name that is not allowed in Avro.
I want to avoid having to specify the complete schema of "B" inside of the AvroStorage() or any fields in A, as I only want to use A to filter down the records of B for storage and not add or change any of the schema output of B. Is there a way to do this?