i am facing some issues when trying to load a partitioned table using the incremental mode. Each partition is created based on a execution_date variable i pass as an argument.
For some reasons, the new partition is always generated using CurrentDate as the partition value. Even tough the variable passed as argument as a different date value.
I have defined a macro to format the variable passed through the command line.
Please find below the code for the macro:
{% macro formatted_date(execution_date) %}
{% set execution_date_obj = modules.datetime.datetime.strptime(execution_date|string, "%Y%m%d") %}
{{ return(execution_date_obj.strftime("%Y-%m-%d")) }}
{% endmacro %}
And below is how I have defined the dbt model for loading the table:
{{config(alias='unipr_salesforce',
materialized = 'incremental',
partitions = [formatted_date(var('execution_date'))]
)}}
Finally, this is the command use to run the dbt model, where you can see the variable used as argument:
unipr-subscription-pipeline toledanof$ dbt run --target dev --profiles-dir ./ --vars 'execution_date : "20210310"'
Every time i run the dbt model, the partition generated correspond to the CurrentDate, regardless the execution_date value.
Does anybody know a possible reason for this behaviour? Thank you !
it seems there are a couple of issues with what you're trying to do:
You need to tell dbt the name of the column you want to partition by to be able to write to a specific partition. If you don't, dbt treats this as a model that is updated incrementally but has no partitioning
To be able to specify what partitions you want to replace, you'd need to use the insert_overwrite strategy. When you don't specify a strategy explicitly, dbt defaults to the merge strategy which scans the entire table to decide what to update and what to insert (and I believe requires that you specify a unique_key in your config)
One way to solve this would be to also include the current date as a column in your model and use that as your partition key. Here's what your config might look like:
{{
config(
materialized='incremental',
partition_by={'field': 'current_date', 'data_type': 'date'},
incremental_strategy='insert_overwrite',
partitions=[formatted_date(var('execution_date'))]
)
}}
Related
The main problem I want to discuss is the schema synchronization conflicts that occur when tables already have data and some new required attribute is added or I rename an required attribute. This question already has some possible solutions but these are not acceptable in a production environment where you already have user data since they simply suggest to delete the data. I also want to enforce required fields so setting the column to {nullable: true} is also not an option for me.
As an example suppose I had a column named "time" that I renamed to "minutes". When I synchronize the schemas TypeORM produces the following error:
QueryFailedError: column "minutes" contains null values
Is there a more elegant/automated way to deal with these errors other than just setting the column to {nullable: true}? I can imagine that you could write some custom SQL with the migration script to also modify the row values. Seems like a little to much manual effort for me though.
I have data in S3 which is partitioned in YYYY/MM/DD/HH/ structure (not year=YYYY/month=MM/day=DD/hour=HH)
I set up a Glue crawler for this, which creates a table in Athena, but when I query the data in Athena it gives an error as one field has duplicate name (URL and url , which the SerDe converts to lowercase, causing a name conflict).
To fix this, I manually create another table (using the above table definition SHOW CREATE TABLE), adding 'case.insensitive'= FALSE to the SERDEPROPERTIES
WITH SERDEPROPERTIES ('paths'='deviceType,emailId,inactiveDuration,pageData,platform,timeStamp,totalTime,userId','case.insensitive'= FALSE)
I changed the s3 directory structure to the hive-compatible naming year=/month=/day=/hour= and then created the table with 'case.insensitive'= FALSE, then ran the MSCK REPAIR TABLE command for the new table, which loads all the partitions.
(Complete CREATE TABLE QUERY)
But upon querying, I can only find 1 data column (platform) and the partition columns, rest of all the columns are not parsed. But I've actually copied the Glue-generated CREATE TABLE query, with the case_insensitive=false condition.
How can I fix this?
I think you have multiple, separate issues: one with the crawler, and one with the serde, and one with duplicate keys:
Glue Crawler
If Glue Crawler delivered on what they promise they would be a fairly good solution for most situations and would save us from writing the same code over and over again. Unfortunately, if you stray outside of the (undocumented) use cases Glue Crawler was designed for, you often end up with various issues, from the strange to the completely broken (see for example this question, this question, this question, this question, this question, or this question).
I recommend that you skip Glue Crawler and instead write the table DDL by hand (you have a good template in what the crawler created, it just isn't good enough). Then you write a Lambda function (or shell script) that you run on a schedule to add new partitions.
Since your partitioning is only on time, this is a fairly simple script: it just needs to run every once in a while and add the partition for the next period.
It looks like your data is from Kinesis Data Firehose which produces a partitioned structure at hour granularity. Unless you have lots of data coming every hour I recommend you create a table that is only partitioned on date, and run the Lambda function or script once per day to add the next day's partition.
A benefit from not using Glue Crawler is that you don't have to have a one-to-one correspondence between path components and partition keys. You can have a single partition key that is typed as date, and add partitions like this: ALTER TABLE foo ADD PARTITION (dt = '2020-05-13') LOCATION 's3://some-bucket/data/2020/05/13/'. This is convenient because it's much easier to do range queries on a full date than when the components are separate.
If you really need hourly granularity you can either have two partition keys, one which is the date and one the hour, or just the one with the full timestamp, e.g. ALTER TABLE foo ADD PARTITION (ts = '2020-05-13 10:00:00') LOCATION 's3://some-bucket/data/2020/05/13/10/'. Then run the Lambda function or script every hour, adding the next hour's partition.
Having too a granular partitioning doesn't help with performance, and can instead hurt it (although the performance hit comes mostly from the small files and the directories).
SerDe config
As for the reason why you're only seeing the value of the platform column, it's because it's the only case where the column name and property have the same casing.
It's a bit surprising that the DDL you link to doesn't work, but I can confirm that it really doesn't. I tried creating a table from that DDL, but without the pagedata column (I also skipped the partitioning, but that shouldn't make a difference for the test), and indeed only the platform column had any value when I queried the table.
However, when I removed the case.insensitive serde property it worked as expected, which got me thinking that it might not work the way you think it does. I tried setting it to TRUE instead of FALSE, which made the table work as expected again. I think we can conclude from this that the Athena documentation is just wrong when it says "By default, Athena requires that all keys in your JSON dataset use lowercase". In fact, what happens is that Athena lower cases the column names, but it also lower cases the property names when reading the JSON.
With further experimentation it turned out the path property was redundant too. This is a table that worked for me:
CREATE EXTERNAL TABLE `json_case_test` (
`devicetype` string,
`timestamp` string,
`totaltime` string,
`inactiveduration` int,
`emailid` string,
`userid` string,
`platform` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
I'd say that case.insensitive seems to cause more problems than it solves.
Duplicate keys
When I added the pagedata column (as struct<url:string>) and added "pageData":{"URL":"URL","url":"url"} to the data, I got the error:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "url"
And I got the error regardless of whether the pagedata column was involved in the query or not (e.g. SELECT userid FROM json_case_test also errored). I tried the case.insensitive serde property with both TRUE and FALSE, but it had no effect.
Next, I took a look at the source documentation for the serde, which first of all is worded much better, and secondly contains the key piece of information: that you also need to provide mappings for the columns when you turn off case insensitivity.
With the following serde properties I was able to get the duplicate key issue to go away:
WITH SERDEPROPERTIES (
"case.insensitive" = "false",
"mapping.pagedata" = "pageData",
"mapping.pagedata.url" = "pagedata.url",
"mapping.pagedata.url2"= "pagedata.URL"
)
You would have to provide mappings for all the columns except for platform, too.
Alternative: use JSON functions
You mentioned in a comment to this answer that the schema of the pageData property is not constant. This is another case where Glue Crawlers unfortunately don't really work. If you're unlucky you'll end up with a flapping schema that includes some properties some days (see for example this question).
What I realised when I saw your comment is that there is another solution to your problem: set up the table manually (as described above) and use string as the type for the pagedata column. Then you can use functions like JSON_EXTRACT_SCALAR to extract the properties you want during query time.
This solution trades increased complexity of the queries for way fewer headaches trying to keep up with an evolving schema.
I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber
I have a Pentaho Kettle job that can load data from x number of tables, and put it into target tables with a different schema.
Assume I have table 1, like so:
I want to load this table into a destination table that looks like this:
The columns have been renamed, the order has been changed, and the data has been transformed. The rename, and order is easily managed by using the Select Values step, which can be used within an ETL Metadata Injection step, making it dependent on some configuration values loaded at runtime.
But if I need to perform some transformation logic on some of the columns, based on where they go in the target table, this seems to be less straightforward.
In my example, I want the column "CountryName" to be capitalised, and the column "Rating" to be floored (as in changing the real number to the previous integer value).
While I could do this by just manually adding a transformation to accomplish each, I want my solution to be dynamic, so it could just as easily run the "CountryName" column through a checksum component, or perform a ceiling on "Rating" instead.
I can easily wrap these transformations in another transformation so that they can be parameterised and executed when needed:
But, where I'm having trouble is, when I process a row of data, I need a way to be able to say:
Column "CountryName" should be passed through the Capitalisation transform
Column "Rating" should be passed through the Floor transform
Column(s) "AnythingElse" should be passed through the SomeOther transform
Is there a way to dynamically split out the columns in a row, and execute a different transform on each one, based on some configuration metadata that can be supplied?
Logically, it would be something like this, although I suspect there may be a way to handle it as a loop or some form of dynamic transformation, rather than mapping out a path per column:
Kettle is so flexible that it seems like there must be a way to do this, I'm just struggling to know which components to use and how to do it. Any experts out there have some suggestions?
I'm dealing with some biggish data sets here (hundreds of millions of rows) so reluctant to use Row Normaliser/Denormaliser or writing to file/DB if possible.
Have you considered the Modified Java Script Value step? Start with the Data Grid step, the a Select Values step, then the Modified Java Script Value step. In that step you will transform the value of each column in what you form you want and output that in a file.
That of course requires some Java script knowledge but given your example it seems that the required knowledge is pretty basic.
My client has given me a list of vehicles for a project. I need to get them into a table I can use, but they're currently in a .cvs file. I've read around and found some info, but nothing that solves my particular problem.
I've generated a model that matches the info in the .cvs file(id, year, make, model, trim), run the migration, and now have the table I need. The issue comes up when I try to use psql COPY. Here's what I've read will work:
copy list_vehicles from '/path/to/list.csv' DELIMITERS ',' CSV;
but it gives me
ERROR: missing data for column "created_at"
Fine, so I try this:
copy list_vehicles (id, year, make, model, trim) from '/path/to/list.csv' DELIMITERS ',' CSV;
and I get back
ERROR: null value in column "created_at" violates not-null constraint
Ok, so then this should work:
copy list_vehicles (id, year, make, model, trim) from '/path/to/list.csv' DELIMITERS ',' WITH NULL AS ' ' CSV FORCE NOT NULL created_at;
nope,
ERROR: FORCE NOT NULL column "created_at" not referenced by COPY
I'm not sure where to go from here. I was thinking of trying to take the created_at column back out for now, then add it in another migration? Any guidance would be much appreciated.
Thanks
The created_at column is made automatically by Rails most of the time when you run a migration to create a table for a new model. It's generally populated most of the time by the Rails default code when you create a new model object in your application.
You're loading the data directly into the database, though, bypassing all the Rails code. Which is fine, but you also need to do the things Rails does.
I think the easiest way is going to be to remove the created_at and other columns from the database directly, loading your CSV file, and then adding the columns back in.
You can also have Postgres read from STDIN, allowing you to modify the data prior to loading it.
I use something like this, which is untested, but should give you an outline.
connection = ActiveRecord::Base.connection.raw_connection
connection.exec("COPY #{tablename} (#{fields},created_at,updated_at) FROM STDIN")
data = File.open(datafile)
data::gets # abandon the header line (if needed)
data.each_with_index do |line, index|
connection.put_copy_data(line + ",,")
end
connection.put_copy_end
res = connection.get_result
if res.result_error_message
puts "Result of COPY is: %s" % [ res.result_error_message ]
end