dbt partitions bigquery table on wrong field - google-bigquery

Trying to add data build tool to our ecosystem and faced with titled problem. We are currently using huge BigQuery tables so we want them to be partitioned and extended every day. IDK if that matters, but everything is running in docker container. Here's the way you can reproduce it:
bigquery sql query to create source table:
create table `***.dbt_nick_test.partition_test_20210304` (
session_date DATE,
user_id STRING
);
insert into `***.dbt_nick_test.partition_test_20210304` (session_date, user_id)
values ('2021-03-04', '1234'), ('2021-03-04', NULL), ('2021-03-04', '1235');
dbt_project.yml - models defenition part:
models:
***:
test:
+schema: test
profiles.yml - just in case to be sure everything is configured ok:
***-bq:
target: "{{ env_var('DBT_TARGET', 'dev') }}"
outputs:
dev:
type: bigquery
method: service-account
project: ***
dataset: dbt_nick_test
threads: 4
keyfile: /root/.dbt/bq-creds.json
timeout_seconds: 300
priority: interactive
retries: 1
cat models/test/test.sql:
{{
config(
partition_by={
"field": "session_date",
"data_type": "date",
"granularity": "day"
},
partitions=dbt.partition_range(var('dates', default=yesterday())),
verbose=True
)
}}
SELECT
session_date,
user_id
FROM `***`.`dbt_nick_test`.`{{ date_sharded_table('partition_test_') }}`
yesterday macro is default from dbt tutorial.
After running dbt -dS run -m test --vars 'dates: "20210304, 20210304"' (everything is going OK) dbt reports that table is created successfully. Now, going to BigQuery I can see that the table is actually created, but it has bad "partition by" field -- _PARTITIONTIME instead of "session_date" screenshot.
If I manually create correctly partitioned table and then run dbt run -- it will work as expected, everything is perfect.
Also, tables created from this table using dbt are also badly partitioned.

Related

'DBT docs generate' does not populate model column-level comments in the catalog

I use dbt-snowflake 1.1.0 with the corresponding dbt-core 1.1.0.
I added documentation for my models in yaml files, i.e.:
> models/stage/docs.yml
version: 2
models:
- name: raw_weblogs
description: Web logs of customer interaction, broken down by attribute (bronze). The breakdown is performed using regular expressions.
columns:
- name: ip
description: IP address from which the request reached the server (might be direct customer IP or the address of a VPN/proxy).
...
Although these details show up correctly in the DBT UI when i run dbt docs generate and then dbt docs serve, yet they are not listed in target/catalog.json:
cat target/catalog.json | grep identity
(no results)
According to the DBT documentation, I understand that column comments should be part of catalog.json.
LATER EDIT: I tried running dbt --debug docs generate and it seems that all data is retrieved directly from the target environment (in my case, Snowflake). Looking at the columns of my model in Snowflake, they indeed do NOT have any comments posted on the in Snowflake.
It thus seems to me that the underlying error might be with the fact that dbt run does not correctly persist the column metadata to Snowflake.
After further investigation, I found out the reason for lacking comments was indeed the fact that the comments are written to catalog.json when running dbt docs generate based on what is received from the database, while dbt docs serve populates the UI by combining information from catalog.json with metadata (in my case, documented column comments) from the local dbt models.
The solution to persist such metadata in the database with dbt run was to add the following DBT configuration:
> dbt_project.yml
models:
<project>:
...
+persist_docs:
relation: true
columns: true

Declaring multiple warehouses in dbt

I am pretty new to dbt , i want to use two kinds of warehouses in one project , currently i declared my clickhouse warehouse which i am going to make tables for and i need to add another warehouse mindsDB becuase i want to reference some of the tables in it
currently my prfofile.yml looks like this
dbt-project:
target: dev
outputs:
dev:
type: clickhouse
schema : clickhouse_l
host: 8.77.780.70
port: 6000
user: xxx
password: xxxx
i want to add the below warehouse too
type: mysql
host: mysql.mindsdb.com
user: mindsdb.user#example.com
password: xxx
port: 3306
dbname: mindsdb
schema: exampl_xxx
threads: 1
is there a way to do it? thank you
This is a bit outside what dbt is designed to do. Is there any reason you can't use multiple projects with their own deployments? Presumably the models have dependencies on each other?
If I had to do this I would:
Create two targets (or sets of targets), one for each warehouse (or dev/prod for each warehouse, etc.)
Create a custom materialization that wraps the typical table materialization, but no-ops if target.type does not match a specified adapter
Run the project on each adapter in series, in a shell script
Use tags to select parts of the DAG that are up/downstream from the other adapter's models
I think the core of the problem is that dbt run needs a database connection to compile your project, so you really can't run against two databases simultaneously. What I described above is not really any better than having two standalone projects. You may want to look into using an orchestrator, like Airflow, Dagster, or Prefect.

DBT select Big Query table from different Google Project

I am using DBT to read and write tables in Big Query, all running in my Google project X.
I have one table which I want to read in from a different Google project Y and put in a DBT model (which will then be saved as a table in project X).
Is it possible to do? And if yes, where do I define the different project in FROM {{ source('dataset_project_y', 'table_to_read')}}?
first, you need to declare the source in a source.yml file.
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#declaring-a-source
for example, create a source_y.yml
sources:
- name: dataset_project_y
schema: dataset_y
database: 'project_y'
tables:
- name: table_to_read
identifier: table_to_read
after that,
you could refer to source table_to_read in any dbt model, select from it in any of the dbt models' SQL satements.
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#selecting-from-a-source
For example, to use table_to_read in dbt_model_x.sql
{{
config(
materialized = "view",
)
}}
SELECT * FROM {{ source('dataset_project_y', 'table_to_read')}}

Why don't decorators work on integer range partitioned bigquery tables?

I created an integer range partitioned bigquery table similar to one described in the tutorial:
CREATE TABLE
mydataset.newtable
PARTITION BY
RANGE_BUCKET(customer_id, GENERATE_ARRAY(1, 100, 1))
AS SELECT 1 AS customer_id, DATE "2019-10-01" AS date1
However, trying to extract a single partition into a bucket, running in bash
bq extract myproject:mydataset.newtable\$1 gs://mybucket/newtable.csv
I get an error "partition key is not valid". Why? How do I find the valid keys?
Similarly I cannot use the decorator to select from a specific partition using query composer:
select from mydataset.newtable$0 or select from mydataset.newtable$1
give
Syntax error: Illegal input character "$" at [1:46]
The decorator $ is valid in LegacySQL, but you can opt by one of these options:
# LegacySQL, legacy sql is used by default in the following command.
# From the UI you need to change it in More -> Query Settings
bq query 'SELECT * from mydataset.newtable$10'
or
# StandardSQL, the option use_legacy_sql=false force to use standard sql
bq query --use_legacy_sql=false 'SELECT * from mydataset.newtable WHERE customer_id BETWEEN 10 AND 20'
Regarding the bq extract command I could export after removing :
$ bq extract myproject:mydataset.newtable$1 gs://mybucket/newtable.csv
Waiting on bqjob_..._000001701cb5d260_1 ... (0s) Current status: DONE
$ gsutil cat gs://mybucket/newtable.csv
customer_id,date1
18,2020-10-01
1,2019-10-01
2,2019-10-02
$
Edit:
After checking your comment below, you are correct, the bq extract above returns all the data.
The doc Exporting table data suggests that 'mydataset.table$N'should work. But when the scape character (\) is used, this error is returned: Partition key is invalid: key: "N"
Since there are not documentation that indicates this is possible, I have already created a FR to add this funtionality. You can monitor this request in this link, it is important to note that there is not an ETA for its resolution.
This issue has now been solved by Google, so the following command works as expected:
bq extract myproject:mydataset.newtable\$1 gs://mybucket/newtable.csv

HIVE query logs location

I am find very difficult to locate the HIVE query logs, basically i want to see what queries were executed.
Basically i want to find the queries in this state:
select foo, count(*) from table where field=value group by foo;
From Hive documentation:
hive.exec.scratchdir Default Value:
/tmp/${user.name} in Hive 0.2.0 through 0.8.0
/tmp/hive-${user.name} in Hive 0.8.1 through 0.14.0
/tmp/hive in Hive 0.14.0 and later
This directory is used by Hive to store the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages
hive.start.cleanup.scratchdir Default Value: false
Execute the query with below command
hive --hiveconf hive.root.logger=DRFA --hiveconf hive.log.dir=./logs --hiveconf hive.log.level=DEBUG -e "select foo, count(*) from table where field=value group by foo"
It will create a log file in logs folder. Make sure that the logs folder exist in current directory.