We have requirement to store data in Bigquery using dataflow job which is reading from kafka. In this , we have some few columns which has sensitive data need to be encrypted.
for example.
schema : element(id , name, phone_num)
in this example i want to do streaming insert of these elements with 'phone_num' as encrypted using KMS.
Please suggest me possible way to achieve it
I have tried exploring BigQuerIO , which has table level encryption , is there any way i can encrypt at column level ?
Related
I create a symmetric key in Cloud KMS and I used Python to encrypt some of the column in my CSV data (as a string by decoding the ciphertext).
Then, I load this CSV file to a BQ table. I have an idea to decrypt the column using KMS via SQL and I found this document about column-level encryption from this link
But I am not sure that it only possible for the data that encrypted with the provided SQL function or not (seems like it)
There are some field that I don't actually know whether to put something
DECLARE KMS_RESOURCE_NAME STRING;
SET KMS_RESOURCE_NAME ="gcp-kms://projects/<project>locations/<location/keyRings/<keyring>/cryptoKeys/<key>"
;
select id
, AEAD.DECRYPT_STRING(KEYS.KEYSET_CHAIN(
KMS_RESOURCE_NAME,
first_level_keyset,
name
additional_authenticated_data)
FROM
mydataset.mytable
What actually I should put for first_level_keyset and additional_authenticated_data if I just need only to decrypt a pre-encrypted column that I did previously in CSV?
Edit:
The reason I have to encrypt data beforehand is because I store these data in my data lake in GCS first which I want it to be encrypted and no PII data in both GCS and BQ
What I am trying to achieve is this:
1. Access a REST API to download hotel reservation data - the data output format is in JSON
2. Convert JSON data into the correct format to be uploaded into SQL table
3. Upload this table of data onto Google BigQuery existing table as additional rows
Do let me know if any further information is required and if I have been clear enough
Thanks in advance
1) pretty good REST API tutorial
2) You can use a local SQL DB or use Cloud SQL. The process would be the same (Parse JSON and insert to DB)
If you decide to use Cloud SQL, you can parse the JSON and save it as a CSV then follow this tutorial
or
simply parse the JSON and insert using one of the following API's
3) Use can easily load data into any BigQuery table by using BigQuery API. You can also directly insert the JSON data into BigQuery
But as Tamir had mentioned, it would be best to ask questions if you encounter errors/issues. Since there are multiple ways to perform this type of scenario, we cannot provide an exact solution for you.
I am new to aws data pipeline. We have a use case where we copy updated data into redshift . I wanted to know whether I can use OVERWRITE_EXISTING insert mode for redshiftcopyactivity. Also, please explain the internal working of OVERWRITE_EXISTING.
Data Pipelines are used to move data from DynamoDB or Amazon S3 to Amazon Redshift. You can load data into a new table, or easily merge data into an existing table.
"OVERWRITE_EXISTING", over writes the already existed data in to the destination table but with a constraint of unique identifier (Primary Key) in RedShift cluster.
You can use "TRUNCATE", if you dont want your table structure to be changed due to the addition of PK.
Though, you can find things here: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html
In first time,I created a empty table with partition and cluster. After that, I would like to configure data transfere service to fill my table from Google Cloud Storage.But when I configure the transfer, I didn't see a parameter field which allows to choose the cluster field.
I tried to do the same thing without the cluster and I can fill my table easily.
Big query error when I ran the transfer:
Failed to start job for table matable$20190701 with error INVALID_ARGUMENT: Incompatible table partitioning specification. Destination table exists with partitioning specification interval(type:DAY,field:) clustering(string_field_15), but transfer target partitioning specification is interval(type:DAY,field:). Please retry after updating either the destination table or the transfer partitioning specification.
When you define the table you specify the partitioning and clustering columns. That's everything you need to do.
When you load the data (from CLI or UI) from GCS BigQuery automatically partition and cluster the data.
If you can give more detail of how you create the table and set up the transfer would be helpful to provide a more detailed explanation.
Thanks for your time.
Of course :
empty table configuration
transfer configuration
I success to transfer datat without cluster but, when I add a cluster in my empty table,the trasnfer fails.
Like the example shown in https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/cookbook/TriggerExample.java
There is a BigQuery table where new data gets appended eveny 15 minutes. There is a Timestamp column in table. Is it possible to perform streaming analysis by fixedWindow time-based trigger from data being added to that BigQuery table? similar to the above example which uses pub/sub?
Streaming data out of BigQuery is tricky -- unlike PubSub, BigQuery does not have a "subscribe to notifications" API. Is there a way you can stream upstream from BigQuery -- i.e., can you stream from whoever is pushing the 15-minute updates?