airflow metabase the column end_date in table dag_run is null - instance

ENV: 1.8.2
airflow metabase
table: dag_run
column: end_date
99% value of column end_date : is null
1% value of column end_date : is not null
Q:
why? is there any idea/setting to fix this situation?

It looks like there are only two spots in the code that set the dag run end date. When a dag run hits the timeout and when you set a dag run to failed/success in the UI.
If you want to fix this, I'd look into the update_state method of the DagRun class and set the end date if the run is being set to a terminating state. Of course you should submit a PR to github with your change!

Related

Databricks SQL throws PARSE_DATETIME_BY_NEW_PARSER

I have a column in my databricks table, with a customised date time format as string,
while trying to convert the string to datetime I am observing below error
PARSE_DATETIME_BY_NEW_PARSER
SQL Command
select to_date(ORDERDATE, 'M/dd/yyyy H:mm') from sales_kaggle_chart limit 10;
The format of ORDERDATE column is M/dd/yyyy H:mm
example of ORDERDATE columns 10/10/2003 0:00 and 8/25/2003 0:00
complete error message
Job aborted due to stage failure: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '5/7/2003' in the new parser. You can set "legacy_time_parser_policy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
Note: the same command works for a single value
SELECT to_date("12/24/2003 0:00", 'M/d/yyyy H:mm') as date;
Have you tried setting to legacy parser, like the error message is hinting you?
SET legacy_time_parser_policy = legacy;
SELECT to_date(ORDERDATE, 'M/dd/yyyy H:mm') FROM sales_kaggle_chart LIMIT 10;
This error is quite common, and adjusting configuration typically does the job.

Is there a Denodo 8 VQL function or line of VQL for throwing an error in a VDP scheduler job?

My goal is to load a cache when there is new data available. Data is loaded into the source table once a day but at an unpredictable time.
I've been trying to set up a data availability trigger VDP scheduler job like described in this Denodo community post:
https://community.denodo.com/answers/question/details?questionId=9060g0000004FOtAAM&title=Run+Scheduler+Job+Based+on+Value+from+a+Query
The post describes creating a scheduler job to fail whenever the condition is not satisfied. Now the only way I've found to force an error on certain conditions is to just use (1/0) and this doesn't always work for some reason. I was wondering if there is way to do this with a function like in normal SQL, couldn't find anything in the Denodo documentation.
This is what my code currently looks like:
--Trigger job
SELECT CASE
WHEN (
data_in_cache = current_data
)
THEN 1 % 0
ELSE 1
END
FROM database.table;
The cache job waits for the trigger job to be successful so the cache will only load when the data in the cache is outdated. This doesn't always work even though I feel it should.
Hoping someone has a function or line of VQL to make Denodo scheduler VDP job result in an error.
This would be easy by creating a custom function that, when executed, just throws an Exception. It doesn't need to be an Exception, you could create your own Exception to see it in the error trace. In any case, it could be something like this...
#CustomElement(type = CustomElementType.VDPFUNCTION, name = "ERROR_SAMPLE_FUNCTION")
public class ErrorSampleVdpFunction {
#CustomExecutor
public CustomArrayValue errorSampleFunction() throws Exception {
throw new Exception("This is an error");
}
}
So you will use it like:
--Trigger job SELECT CASE WHEN ( data_in_cache = current_data ) THEN errorSampleFunction() ELSE 1 END FROM database.table;

getting running job id from BigQueryOperator using xcom

I want to get Bigquery's job id from BigQueryOperator.
I saw in bigquery_operator.py file the following line:
context['task_instance'].xcom_push(key='job_id', value=job_id)
I don't know if this is airflow's job id or BigQuery job id, if it's BigQuery job id how can I get it using xcom from downstream task?.
I tried to do the following in downstream Pythonoperator:
def write_statistics(**kwargs):
job_id = kwargs['templates_dict']['job_id']
print('tamir')
print(kwargs['ti'].xcom_pull(task_ids='create_tmp_big_query_table',key='job_id'))
print(kwargs['ti'])
print(job_id)
t3 = BigQueryOperator(
task_id='create_tmp_big_query_table',
bigquery_conn_id='bigquery_default',
destination_dataset_table= DATASET_TABLE_NAME,
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
sql = """
#standardSQL...
The UI is great for checking whether an XCom was written to or not, which I'd recommend you do even before you try to reference it in a separate task so you don't need to worry about whether you're fetching it correctly or not. Click your create_tmp_big_query_table task -> Task Instance Details -> XCom. It'll look something like the following:
In your case, the code looks right to me, but I'm guessing your version of Airflow doesn't have the change that added saving job id into an XCom. This feature was added in https://github.com/apache/airflow/pull/5195, which is currently only on master and currently not part of the most recent stable release (1.10.3). See for yourself in the 1.10.3 version of the BigQueryOperator.
Your options are to wait for it to be in a release (...sometimes takes awhile), running off a version of master with that change, or temporarily copy over the newer version of the operator as a custom operator. In the last case, I'd suggest naming it something like BigQueryOperatorWithXcom with a note to replace it with the built-in operator once it's released.
The JOB ID within bigquery_operator.py is the BQ JOB ID. You can understand it looking at the previous lines:
if isinstance(self.sql, str):
job_id = self.bq_cursor.run_query(
sql=self.sql,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
encryption_configuration=self.encryption_configuration
)
elif isinstance(self.sql, Iterable):
job_id = [
self.bq_cursor.run_query(
sql=s,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
encryption_configuration=self.encryption_configuration
)
for s in self.sql]
Eventually, run_with_configuration method returns self.running_job_id from BQ

Run Job every 4 days but first run should happen now

I am trying to setup APScheduler to run every 4 days, but I need the job to start running now. I tried using interval trigger but I discovered it waits the specified period before running. Also I tried using cron the following way:
sched = BlockingScheduler()
sched.add_executor('processpool')
#sched.scheduled_job('cron', day='*/4')
def test():
print('running')
One final idea I got was using a start_date in the past:
#sched.scheduled_job('interval', seconds=10, start_date=datetime.datetime.now() - datetime.timedelta(hours=4))
but that still waits 10 seconds before running.
Try this instead:
#sched.scheduled_job('interval', days=4, next_run_time=datetime.datetime.now())
Similar to the above answer, only difference being it uses add_job method.
scheduler = BlockingScheduler()
scheduler.add_job(dump_data, trigger='interval', days=21,next_run_time=datetime.datetime.now())

Reducers failing

We are using 3 cluster machine and mapreduce.tasktracker.reduce.tasks.maximum property is set to 9. When I set no of reducer is equal to or less than 9 job is getting succeeded but if I set greater than 9 then it is failing with the exception "Task attempt_201701270751_0001_r_000000_0 failed to ping TT for 60 seconds. Killing!". Can any one guide me what will be the problem
There seem to be some bug in hadoop -0.20.
https://issues.apache.org/jira/browse/MAPREDUCE-1905 (for reference ).
Can you please try to increase the task timeout ?
(mapreduce.task.timeout to a higher value ) ( 0 will disable the timeout )