Hive partitioning LAYOUT table format in BigQuery - hive

I have many qsns inside this situation. So here goes :
Has anyone ever written Kafka's output to a Google Cloud Storage (GCS) bucket, such that the data in that bucket is partitioned using the "default hive partitioning layout"
The intent behind doing that is this external table needs to be "queryable" in BigQuery
Google's documentation on that is here but wanted to see if someone has an example ( https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs )
for e.g. the documentation says "files follow the default layout, with the key/value pairs laid out as directories with an = sign as a separator, and the partition keys are always in the same order."
What's not clear is
a) does Kafka create these directories on the fly OR do i have to pre-create them ? Lets say i WANT to have KAFKA write to directories based on date in GCS
gs://bucket/table/dt=2020-04-07/
Tonight, after midnight, do i have PRE-create this new directory gs://bucket/table/dt=2020-04-08/ or CAN Kafka create it for me AND in all this, how does hive partitioning LAYOUT help me ?
Does my table's data, which i am trying to put in these dirs every day, need to have "dt" ( from gs://bucket/table/dt=2020-04-07/ ) as a column in it ?
Since the goal in all this to have BigQuery query this external table, which underlying is referencing all data in this bucket i.e.
gs://bucket/table/dt=2020-04-06/
gs://bucket/table/dt=2020-04-07/
gs://bucket/table/dt=2020-04-08/
Just trying to see if this would be the right approach for it.

Kafka itself is a messaging system that allows to exchange data between processes, applications, and servers, but it requires producers and consumers (here is an example) that move the data. For instance:
The Producer needs to send the data in a format that BigQuery can read.
And the Consumer needs to write the data with a valid Hive Layout.
The Consumer should write to GCS, so you would need to find the proper connector for your application (e.g. this Java connector or Confluent connector). And when writing the messages to GCS you need to take care about using a valid 'default hive partitioning layout'.
For example, gs://bucket/table/dt=2020-04-07/, dt is a column where the table is partitioned on, and 2020-04-07 is one of its values, so take care about this. Once you have a valid Hive Layout in GCS, you need to create a table in BigQuery, I recommend a native table from the UI and selecting Google Cloud Storage as the source and enabling 'Source Data Partitioned', but you can also use --hive_partitioning_source_uri_prefix and --hive_partitioning_mode to link the GCS data with a BigQuery table.
As all this process implies different layers of development and configuration, if this process makes sense for you, I recommend you open new questions for any specific errors you could have.
The last but not least, Kafka to BigQuery connector and other connectors to ingest from Kafka to GCP can help better if Hive Layout is not mandatory for your use case.

Related

Azure Sentinel referencing large sets of data

I've been trying to find the most effective (elegant) solution to achieve what I'm trying to do. I'd like to hear from the community, thank you.
Situation:
Need to geo-enrich IP Address records on Sentinel. Example: Successful SigninLogs, since MSFT enrichment sometimes generates "Unknown" results in the IP enrichment maps.
External reference file (subnet, country_code, country_name) are available publicly, however the size and # of records are rather large. (~12MB, 200K+records).
Issue:
Tried using storage account blob to host the "reference table", apparently hitting the limit on max. blob size in Storage Account.
Looks like there are max. 30.000 records on Workbooks to read from external sources using 'externaldata' command. Hence, only partial reference data can be read and referred to.
Options considered:
Ingest the reference table into the log analytics workspace, do a join/lookup to this custom reference table for enrichment
Export the IP addresses from SigninLogs table to a blob storage, enrich the IP address using logicapps, and then put it back to a 'reference' blob storage. then read the 'reference' blob storage using 'externaldata' syntax.
Limitation Observed:
Came to a realization that Sentinel couldn't perform API call for enrichment from external data. (CMIIW). I've done similar stuff with Splunk, and we could enrich the data on the fly, by calling in multiple API calls to outside database.
Ingest the Data - As you've mentioned, ingest the data and join the tables. You would need to regularly ingest this though to ensure you can lookup the data within the desired time range (e.g. If you have an Analytics Rule, then this only looks up data for a 14 day period).
Use a Playbook - If you want the Geo-IP lookup post incident, you can perform this with a Logic App
Use Jupyter Notebooks - This have the flexibility to perform API calls against external locations and join the data to that hosted in Sentinel. An example notebook is the IP Explorer Notebook. Use Jupyter notebooks to hunt for security threats
Threat Intelligence - Microsoft enriches all imported threat intelligence indicators with GeoLocation and WhoIs data, which is displayed together with other indicator details.
Since March 2022, you can upload large CSV files into a Sentinel Watchlist. This way, you can upload a complete GeoIP database and perform ipv4_lookups. This blog post explains you how to do this: https://cryptsus.com/blog/enrich-geolocation-sentinel-siem.html

Snowflake - Loading data loading data from cloud storage

I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.
In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery
I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

Cloud Dataflow: Generating tables in BigQuery

I have a pipeline which reads streaming data from Cloud Pub/Sub, this data is processed by Dataflow, then saved into one large BigQuery table, each Pub/Sub message includes an associated account_id. Is there a way to create new tables on the fly when a new account_id is identified? And then populate them with data from that associated account_id?
I know that this can be done by updating the pipeline for each new account. But in an ideal world, Cloud Dataflow would generate these tables within the code programmatically.
wanted to share few options I see
Option1 - wait for Partition on non-date field feature
It is not know when this is going to be implemented and available for us, so it might be not what you want now. But when this will go live - this will be the best option for such scenarios
Option 2 – you can come up with hashing your account_id into predefined number of buckets.
In this case you can pre-create all those tables and in your code have logic that will handle respective destination table based on account hash. Same hashing logic than needs to be used in queries that will query that data
The API for creating BigQuery Tables is at https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert.
Nevertheless, it would probably be easier if you store all accounts in one static table that contains account_id as one column.

Export big query data into in house Hadoop Cluster

We have GA data in Big query, and some of my users want to join that to in house data in Hadoop which we can not move to Big Query.
Please let me know what is the best way to do this.
See BigQuery to Hadoop Cluster - How to transfer data?:
The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop
https://cloud.google.com/hadoop/bigquery-connector
This connector defines a BigQueryInputFormat class.
Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.
(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)
You could follow the route of the Hadoop connecter as Felipe Hoffa suggested.. Or build your own application which will transfer data from BigQuery to your Hadoop cluster. In both ways, you will be able to make the required joins on the hadoop cluster using Pig, Hive etc.
In case you want to try the application method, let me take you through a process flow which your application may need to follow:
Query BQ tables (flatten any nested or repeated fields)
If your query response is too large, you can divert this response into a destination table. Your destination table is simply another table in BigQuery.
You can then export this destination table to a GCS bucket. This uses another query request. You will have options to choose an export format, compression type, split up the data into multiple files etc.
From the GCS bucket, using a tool called gsutil, you can copy the files to your cluster gateway machine.
From your cluster gateway machine, you can use the hadoop command 'copyFromLocal' to copy this data to your HDFS directory.
Once it is in a HDFS directory, you can create a hive external table pointing to this HDFS directory. Your data will now be available in the Hive table. Ready to be joined with the in house data on your cluster.
Let me know if you need anymore details or clarifications. I went down this route because I found the connector alternative a little too complex. But that is a subjective opinion varying from a person to person.