How to configure DBT sources from Big Query EXTERNAL_QUERY - google-bigquery

In Big Query, I am using an external connection/federated SQL query (cloudSQL) from which I can get data with SELECT * FROM EXTERNAL_QUERY("gcp-name.europe-west3.friendly_name", "SELECT * FROM database_name.external_table;")
Now my question is, in DBT, how do I define this source in my schema.yml file and how should my FROM {{source(...,...)}} statement look like?

From my perspective right now given my above comments on the current state of the dbt-external-tables package (which I don't believe meets your needs), I would say you have two options:
Define your external dependencies as static views in a custom schema and then import as dbt sources.
Define your external dependencies within dbt using something like Evaluate(<select *>) and then ref() those like normal in your transform / stage layer.
Example of #1
* my-bq-project-id
|
|_ dbt_schema
|
|_ external_db_schema
|
|_ external_table_1
|_ external_table_2
etc.
And then you'd have:
* my-dbt-project-dir
|
|_ analysis
|_ data
|_ models
| |_ sources
| | |
| | |> my_external_table_1.yml
| | |> my_external_table_2.yml
| |
| |_ transforms
| |_ final
|_ dbt_project.yml
|_ readme.md
Where "my_external_table_1.yml" looks like:
sources:
- name: external_db_schema
database: my-bq-project-id
tables:
- name: my_external_table_1
description: "Lorem Ipsum"
And your static view is defined by running a query like:
create view if not exists `my-bq-project-id.external_db_schema.my_external_table_1` as
( SELECT * FROM EXTERNAL_QUERY("gcp-name.europe-west3.friendly_name",
"SELECT * FROM database_name.external_table;"))
Example of #2
Just make a base level dbt model that does exactly what you are describing on a 1-1 object mapping:
my_external_table_1.sql
execute immediate (
SELECT * FROM EXTERNAL_QUERY("gcp-name.europe-west3.friendly_name",
"SELECT * FROM database_name.external_table;")
)
And then from here you'll be able to ref('my_external_table_1') in your transform layer etc.

Related

Self-join Kusto Query in Analytics Rule

I am working within Microsoft Sentinel Analytics Rules with the Kusto Query Language. (KQL)
I need to work in a Table called CrowdstrikeReplicatorLogs_CL which contains rows that contain a) data rows for which I need to alert on and b) metadata. that contains information about the subject in the alert.
This means I need to self-join the KQL table with itself to get the final result.
The column in question to join the table itself is the aid_g column.
ThreatIntelligenceIndicator
| where foo == bar
| join kind=innerunique (
CrowdstrikeReplicatorLogs_CL
| where TimeGenerated >= ago(dt_lookBack)
| where event_simpleName_s has_any ("NetworkConnectIP4", "NetworkConnectIP6")
| extend json=parse_json(custom_fields_message_s)
| extend ip4 = json["RemoteAddressIP4"], ip6=json["RemoteAddressIP6"]
| extend CS_ipEntity = tostring(iff(isnotempty(ip4), ip4, ip6))
| extend CommonSecurityLog_TimeGenerated = TimeGenerated
) on $left.TI_ipEntity == $right.CS_ipEntity
| join kind=innerunique (
CrowdstrikeReplicatorLogs_CL
| where custom_fields_message_s has "ComputerName"
| extend customFields=parse_json(custom_fields_message_s)
| project Hostname=customFields['ComputerName'], Platform=event_platform_s, aid_g
) on $left.aid_g == $right.aid_g
;
However, this raises a Query contains incompatible 'set' commands. error in Sentinel.
Is there a proper way to self-join tables?

Deep dive Azure Log analytics cost using KQL query

I'm running following Log Analytics Kusto query to get data what uses and thus generetes our Log Analytics cost
Usage
| where IsBillable == true
| summarize BillableDataGB = sum(Quantity) by Solution, DataType
| sort by Solution asc, DataType asc
and then the output is following:
What kinda query should I use if I want to deep dive more eg to ContainerInsights/InfrastructureInsights/ServiceMap/VMInsights/LogManagement so to get more detailed data what name or namespaces really cost?
Insightmetrics table have e.g these names and namespaces.
I was able maybe able to get something out using following query but something is still missing. Not totally sure if I'm on right or wrong way
union withsource = tt *
| where _IsBillable == true
| extend Namespace, Name
Here is the code for getting the name and namespace details. using Kusto query
let startTimestamp = ago(1h);
KubePodInventory
| where TimeGenerated > startTimestamp
| project ContainerID, PodName=Name, Namespace
| where PodName contains "name" and Namespace startswith "namespace"
| distinct ContainerID, PodName
| join
(
ContainerLog
| where TimeGenerated > startTimestamp
)
on ContainerID
// at this point before the next pipe, columns from both tables are available to be "projected". Due to both
// tables having a "Name" column, we assign an alias as PodName to one column which we actually want
| project TimeGenerated, PodName, LogEntry, LogEntrySource
| summarize by TimeGenerated, LogEntry
| order by TimeGenerated desc
For more information you can go through the Microsoft document and here is the Kust Query Tutorial.

Apache Drill and Apache Kudu - not able to run "select * from <some table>" using Apache Drill, for the table created in Kudu through Apache Impala

I'm able to connect to Kudu through Apache Drill, and able to list tables fine. But when I have to fetch data from the table "impala::default.customer" below, I tried different options but none is working for me.
The table in Kudu was created through Impala-Shell as external table.
Initial connection to Kudu and listing objects
ubuntu#ubuntu-VirtualBox:~/Downloads/apache-drill-1.19.0/bin$ sudo ./drill-embedded
Apache Drill 1.19.0
"A Drill is a terrible thing to waste."
apache drill> SHOW DATABASES;
+--------------------+
| SCHEMA_NAME |
+--------------------+
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| information_schema |
| kudu |
| sys |
+--------------------+
7 rows selected (24.818 seconds)
apache drill> use kudu;
+------+----------------------------------+
| ok | summary |
+------+----------------------------------+
| true | Default schema changed to [kudu] |
+------+----------------------------------+
1 row selected (0.357 seconds)
apache drill (kudu)> SHOW TABLES;
+--------------+--------------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+--------------------------------+
| kudu | impala::default.customer |
| kudu | impala::default.my_first_table |
+--------------+--------------------------------+
2 rows selected (9.045 seconds)
apache drill (kudu)> show tables;
+--------------+--------------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+--------------------------------+
| kudu | impala::default.customer |
| kudu | impala::default.my_first_table |
+--------------+--------------------------------+
Now when trying to run "select * from impala::default.customer ", not able to run it at all.
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala::default`.customer;
Error: VALIDATION ERROR: Schema [[impala::default]] is not valid with respect to either root schema or current default schema.
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `default`.customer;
Error: VALIDATION ERROR: Schema [[default]] is not valid with respect to either root schema or current default schema.
Current default schema: kudu
[Error Id: 8a4ca4da-2488-4775-b2f3-443b8b4b17ef ] (state=,code=0)
Current default schema: kudu
[Error Id: ce96ea13-392f-4910-9f6c-789a6052b5c1 ] (state=,code=0)
apache drill (kudu)>
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala`::`default`.customer;
Error: PARSE ERROR: Encountered ":" at line 1, column 23.
SQL Query: SELECT * FROM `impala`::`default`.customer
^
[Error Id: 5aacdd98-db6e-4308-9b33-90118efa3625 ] (state=,code=0)
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala::`.`default`.customer;
Error: VALIDATION ERROR: Schema [[impala::, default]] is not valid with respect to either root schema or current default schema.
Current default schema: kudu
[Error Id: 5450bd90-dfcd-4efe-a8d3-b517be85b10a ] (state=,code=0)
>>>>>>>>>>>
In Drill conventions, the first part of the FROM clause is the storage plugin, in this case kudu. When you ran the SHOW TABLES query, you saw that the table name is actually impala::default.my_first_table. If I'm reading that correctly, that whole bit is the table name and the query below is how you should escape it.
Note the back tick before impala and after first_table but nowhere else.
SELECT *
FROM kudu.`impala::default.my_first_table`
Does that work for you?

How to store two differents Graphs into one with Cypher?

For a later traitements with The project CAPS , I need to store 2 different Graphs into one:
Graph3=Graph1+Graph2
I tried to search for solutions to do that and I found UNION ALL but the last doesn't work as I expected. Is there another way to do that with Cypher?
Example :
val Graph1=session.cypher("""
| FROM GRAPH mergeGraph
| MATCH (from)-[via]->(to)
|WHERE substring(from.geohash,0,5)=substring(to.geohash,0,5)
| CONSTRUCT
| CREATE (h1:HashNode{geohash:substring(from.geohash,0,5)})-[COPY OF via]->(h1)
| RETURN GRAPH
""".stripMargin).graph
which contains this pattern :
val Graph2=session.cypher("""
| FROM GRAPH mergeGraph
| MATCH (from)-[via]->(to)
|WHERE substring(from.geohash,0,5)<>substring(to.geohash,0,5)
| CONSTRUCT
| CREATE (:HashNode{geohash:substring(from.geohash,0,5)})-[COPY OF via]->(:HashNode{geohash:substring(to.geohash,0,5)})
| RETURN GRAPH
""".stripMargin).graph
which contains this pattern :
With union All :
Graph3=Graph1.unionAll(Graph2)
I get this graph :
As you can see the green nodes are the nodes of Graph2 without relationship ! thats what i didn't expected.

Firebase query does not allow order on multiple properties?

I'm trying to fetch some data from firebase. In my object, I have
Recent
|_UniversitId
|_objectiId1
| |_ userId:1
| |_ timestamp:143242344
|_objectiId2
|_ userId:1
|_ timestamp:143243222
My querying path is http://firbasedbname.com/Recent/UniversityId. I need to fetch the entries which are userId id is equal to 1 and order that set by timestamp. I tried the following,
FIRDatabaseQuery *query = [[firebase queryOrderedByChild:#"userId"] queryEqualToValue:#"1"];
This fetches my the users correctly. But is there a possible way to order this set by timestamp. To do that I tried by putting another queryOrderedByChild. But it says it can be used only one time. How may I fix this?
queryOrderedByChild can be used only once.
a workaround would be
|_objectiId1
| |_ userId:1
| |_ timestamp:143242344
| |_ userId_timestamp:1_143242344
|_objectiId2
|_ userId:1
|_ timestamp:143243222
|_ userId_timestamp:1_143243222
Then try :
FIRDatabaseQuery *query = [[firebase queryOrderedByChild:#"userId_timestamp"] queryEqualToValue:#"1_143242344"];
check this out https://youtu.be/sKFLI5FOOHs?t=541
another way to do it would be :
|_objectiId1
|_ userId1:
| |_ objectId11:143242344
| |_ objectId12:143243222
|_ userId2:
Then
querying path is http://firbasedbname.com/Recent/UniversityId/userId1
and then order by value