airflow dynamically generating tasks with branchpython operator - dynamic

Currently we are running the following kind of dags.
t1 >> t2 >> [ t3, t4 ]
t4 >> t6
t3 >> t5 >>t6
We receive files of same structure from different clients and process the data with one dag for each client.
But the tasks are similar. Only the source and destination locations are differnt. t2 is just checking if files exist. If yes, then executes spark else sends a notification and ends.
I tried dynamically creating the tasks using a for loop as following.
def t2(client,**kwargs):
return PythonOperator(task_id='t2_'+client, python_callable=t1func,...)
####other functions for t2,t3 etc t6
t1=DummyOperator(taskid='t1',dag=dag)
t6=DummyOperator(taskid='t6',dag=dag)
for c in client_list:
t1 >> t2(i) >> [t3(i),t4(i)]
t3(i) >> t5(i) >> t6
t4(i) >> t6
In the for loop step 2 and three are not working. The tasks t5, t6 are shown as tasks without any predecessors.
I expect the dag to be
t1 >> t2_cl1 >> [t3_cl1,t4_cl1]
t3_cl1 >> t5_cl1 >> t6
t4_cl1 >> t6
t1 >> t2_cl2 >> [t3_cl2,t4_cl1]
t3_cl2 >> t5_cl2 >> t6
t4_cl2 >> t6
Could you please suggest me a solution.

I think you did not specify DAG in the dynamically created tasks. You should either add dag explicitely in the operators or wrap your whole task creation in with DAG(..):

Related

UNNEST list of integers in BigQuery query return ```TypeError: Object of type int64 is not JSON serializable```

I am querying a join of tables in BigQuery for a specific list of ids that are of type INT64 and I cannot find what to do, as I constantly have th following error
TypeError: Object of type int64 is not JSON serializable
My query looks like this:
client = bigquery.Client(credentials=credentials)
query = """
SELECT t1.*,
t2.*,
t3.*,
t4.* FROM `<project>.<dataset>.<tabel1>` as t1
join `<project>.<dataset>.<tabel2>` as t2
on t1.label = t2.id
join `<project>.<dataset>.<tabel3>` as t3
on t3.A = t2.A
join `<project>.<dataset>.<tabel4>` as t4
on t4.obj= t2.obj and t4.A = t3.A
where t1.id in unnest(#list)
"""
job_config = bigquery.QueryJobConfig(query_parameters=[
bigquery.ArrayQueryParameter("list", "STRING", list),
])
choices= client.query(query, job_config=job_config).to_dataframe()
where my list is of the type:
list = [3651056, 3651049, 3640195, 3629411, 3627024,3624939]
Now, this method works perfectly whenever the list is of the type:
list = ['3651056', '3651049', '3640195', '3629411', '3627024', '3624939']
I have tried casting the column I want to pick the list items from before querying but it implies I need to cast the entire table, which contain over 4 billion rows. Not efficient at all.
I would be grateful for any insights on how to solve this.
EDIT:
There is one option. Namely to first cast my list to string and then:
client = bigquery.Client(credentials=credentials)
query = """
SELECT t1.*,
t2.*,
t3.*,
t4.* FROM `<project>.<dataset>.<tabel1>` as t1
join `<project>.<dataset>.<tabel2>` as t2
on t1.label = t2.id
join `<project>.<dataset>.<tabel3>` as t3
on t3.A = t2.A
join `<project>.<dataset>.<tabel4>` as t4
on t4.obj= t2.obj and t4.A = t3.A
where cast(t1.id as STRING) in unnest(#list)
"""
job_config = bigquery.QueryJobConfig(query_parameters=[
bigquery.ArrayQueryParameter("list", "STRING", list),
])
choices= client.query(query, job_config=job_config).to_dataframe()
But is there a more direct way to do this?
As #PratikPatil mentioned in comments:
During comparison - both sides needs to be of same data type otherwise, BigQuery will result to an error.
Casting is still best choice for the issue you have with query. But if you want to check if any side of data can be natively converted to int64 (if you are not expecting any strings etc at all). That will avoid extra casting.
Posting the answer as community wiki as this is the best practice and for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

how can I broadcast the tempTable instead of "sort merge join" when I use sparkSQL?

I set the autoBroadcast 200M ,table a is 20KB , table b is 20KB ,table c is 100G.
I found "a left join b on..." is a "broadcast join" and register the result as "TempTable" (TempTable is 30KB),my question is when I do "c left join TempTable on...",I expect that autobroadcast the TempTable to make a broadcast join but it made a sort merge join.I also tried cache the TempTable and broadcast DataFrame of the TempTable,but it doesn't work...
How can I broadcast the TempTable to make a broadcast join with sparkSQL?
I'm using spark-1.6.1
thanks!
Though it's very difficult to understand without any code snippet what you have tried so far. I am sharing some sample code tried at Spark 2.3 where broadcast function applied on Temp View(as Temp table is deprecated in Spark2). In following scala code, I have forcefully use broadcast hash join.
import org.apache.spark.sql.functions.broadcast
val adf = spark.range(99999999)
val bdf = spark.range(99999999)
adf.createOrReplaceTempView("a")
bdf.createOrReplaceTempView("b")
val bdJoinDF = spark.sql("select /*+ BROADCASTJOIN(b) */a.id, b.id from a join b on a.id = b.id")
val normalJoinDF = spark.sql("select a.id, b.id from a join b on a.id = b.id")
println(normalJoinDF.queryExecution) //== Physical Plan == *(5) SortMergeJoin [id#39512L], [id#39514L], Inner
println(bdJoinDF.queryExecution) //== Physical Plan == *(2) BroadcastHashJoin [id#39611L], [id#39613L], Inner, BuildRight, false

Rule Based Join in pig script

I have a rules_table data
Ruleid,leftColumn,rightColumn
1,c1,c1
2,c2,c3
3,c4,c4
rules_table contains the column names of left_table and right_table to give hint about the join keys.
Left_table
Schema : c1,c2,c3,c4,c5,c6,c7,c8,c9
Right_table
schema : c1,c2,c3,c4,c10,c12,c13,c14
i need to join the left_table and right_table according to the rules_table applying rules one by one(it should be sequential as the rule_id is the rule priority) . After each rule i need to get a matched_set and unmatched_set. Unmatched_Set data has to flow into next rule and go on like that. Final output will have 2 seperate datasets
matched_set,rule_id
unmatched_set
Right now I am using unix_script to read the rules table in hive and call the pig-script repeatedly to generate the matched_set and unmatched_set. But it is taking too much time as the pig initial set_up and store is taking too much time.
Can any body please suggest an optimal solution to do this in pig_script with single execution ?
You can't do it directly, but you can generate single pig script that will look somthing like that:
LeftTable = load ...;
RightTable = load ...;
joined1 = join LeftTable by c1 full, RightTable by c2;
SPLIT joined1 INTO Matched_rule1_raw IF LeftTable::c1 is not null and RightTable::c2 is not null, UnMatched_rule1 IF LeftTable::c1 is null or RightTable::c2 is null;
Matched_rule1 = foreach Matched_rule1_raw generate 1 as rule_id, ..;
At the end you can do union matched.

Creating a materialized view filling up the complete temp space

I want to create a materialized view(mv) please see below SQL query. When I try to create the materialized view my temp tablespace completely being (~128g) used and given below error
SQL Error: ORA-12801: error signaled in parallel query server P007
ORA-01652: unable to extend temp segment by 64 in tablespace TEMP1
12801. 00000 - "error signaled in parallel query server %s"
then I checked in the OEM it used parallelism 8 degree. So I disabled the parallelism using alter statement (ALTER SESSION DISABLE PARALLEL QUERY). Then the mv ran long and took several hours and got created. Please suggest is there any approaches to create it with out using much space temp space. The count for the select query for this MV is around 55 million rows. Any suggestions really appreciated.
DB: Oracle 11gR2
CREATE MATERIALIZED VIEW TEST NOLOGGING
REFRESH FORCE ON DEMAND
ENABLE QUERY REWRITE
AS
select
table4.num as "Number",table4.num as "SNum",
table4.status as "S_status",
'Open' as "NLP",
create_table2.fmonth as "SMN",
table6.wgrp as "SOW",
(table2.end_dt - create_table2.dt) as "elp",
table6.d_c as "SDC",
create_table2.fiscal_quarter_name as "SQN",
'TS' as "SSL",
table3.table3_id as "SR Owner CEC ID",
table4.sev as "ssev",
SUBSTR(table8.stech,1,INSTR(table8.stech,'=>')-1) as "srtech",
SUBSTR(table8.stech,INSTR(table8.stech,'=>')+2) as "srstech",
table5.sr_type as "SR Type",
table5.problem_code as "SR Problem Code",
--null as "SR Entry Channel",
--null as "SR Time in Status (Days)",
table6.center,
table6.th1col,
table6.master_theater,
table6.rol_3,
table7.center hier_17_center,
table7.rol_1
table7.rol_2,
table7.rol_3 wg,
table2.dt as "SBD",
table2.wk_n as "SBFW",
table2.fmonth as "SBFM",
table3.defect_indicator as "Has Defect",
table2.sofw,
table2.sofm
from
A table1
join B table2 on (table1.date_id = table2.dw_date_key)
join C table3 on (table1.date_id = table3.date_id and table1.incident_id = table3.incident_id)
join D table4 on (table3.incident_id = table4.incident_id and table4.key_d <= table3.date_id and table3.table3_id = table4.current_owner_table3_id)
join E table5 on table4.incident_id = table5.incident_id
join B create_table2 on (table5.creation_dw_date_key = create_table2.dw_date_key)
join F table6 on (table1.objectnumber=table6.DW_WORKGROUP_KEY)
join G table7 on (table1.objectnumber=table7.DW_WORKGROUP_KEY)
left outer JOIN H table8 ON (table8.natural_key= table5.UPDATED_COT_TECH_KEY)
where
table4.bl_incident_key in (select max(bl_incident_key) from D b
where b.incident_id=table3.incident_id and b.key_d <= table3.date_id and b.current_owner_table3_id = table3.table3_id)
and table2.fiscal_year_name in ('FY2013','FY2014')
Without knowing your system, tables or data i assume, that
some of the 8 tables have many rows (>> 55 millons)
join-predicates and filters will not reduce the amount of data significantly
so nearly all data will be written to the mv
Probably the execution plan will use some hash-operations and/or sort-aggregations.
This hashing and sorting cannot be done in memory, if hash and sort segments are too big.
So this will be done in temp.
8 parallel slots will probably use more temp than 1 session. So this can be reason for the ora.
You can
accept the several hours. normally such operations are done at night or weekend
it doesn't matter, if it takes 4 or 1 hour.
Increase temp
Try to scale the degree of parallelism by a hint: create .... as select /*+ parallel(n) */ table4.num...
Use 2 or 4 or 8 for n to have 2,4,8 slots
Try some indexes for the joined columns, i.e.
TABLE1(DATE_ID, INCIDENT_ID)
TABLE1(OBJECTNUMBER)
TABLE2(DW_DATE_KEY)
TABLE2(FISCAL_YEAR_NAME)
TABLE3(DATE_ID, INCIDENT_ID, TABLE3_ID)
TABLE3(INCIDENT_ID, TABLE3_ID, DATE_ID)
TABLE4(INCIDENT_ID, CURRENT_OWNER_TABLE3_ID, KEY_D, BL_INCIDENT_KEY)
TABLE5(INCIDENT_ID)
TABLE5(CREATION_DW_DATE_KEY)
TABLE5(UPDATED_COT_TECH_KEY)
TABLE6(DW_WORKGROUP_KEY)
TABLE7(DW_WORKGROUP_KEY)
TABLE8(NATURAL_KEY)
And use explain plan for the different sqls to see wich plan oracle will generate.

Improved way for multi-table SQL (MySQL) query?

Hoping you can help. I have three tables and would like to create a conditional query to make a subset based on a row's presence in one table then excluding the row from the results, then query a final, 3rd table. I thought this would be simple enough, but I'm not well practiced in SQL and after researching/testing for 6 hours on left joins, correlated sub-queries etc, it has helped, but I still can't hit the correct result set. So here's the setup:
T1
arn_mkt_stn
A00001_177_JOHN_FM
A00001_177_BILL_FM
A00001_174_DAVE_FM
A00002_177_JOHN_FM
A00006_177_BILL_FM
A00010_177_JOHN_FM - note: the name's relationship to the 3 digit prefix (e.g. _177) and the FM part always is consistent: '_177_JOHN_FM' only the A000XX changes
T2
arn_mkt
A00001_105
A00001_177
A00001_188
A00001_246
A00002_177
A00003_177
A00004_026
A00004_135
A00004_177
A00006_177
A00010_177
Example: So if _177_JOHN_FM is a substring of arn_mkt_stn rows in T1, exclude it when getting arn_mkts with a substring of 177 from T2 - in this case, the desired result set would be:
A00003_177
A00004_177
A00006_177
Similarly, _177_BILL_FM would return:
A00002_177
A00003_177
A00004_177
A00010_177
Then I would like to use this result set to pull records from a third table based on the 'A00003' etc
T3
arn
A00001
A00002
A00003
A00004
A00005
A00006
...
I've tried a number of methods [where here $stn_code = JOHN_FM and $stn_mkt = 177]
"SELECT * FROM T2, T1 WHERE arn != SUBSTRING(T1.arn_mkt_stn, 1,6)
AND SUBSTRING(T1.arn_mkt_stn, 12,7) = '$stn_code'
AND SUBSTRING(arn_mkt, 8,3) = '$stn_mkt' (then use this result to query T3..)
Also a left join and a subquery, but I'm clearly missing something!
Any pointers gratefully received, thanks,
Rich.
[EDIT: Thanks for helping out sgeddes. I'll expand on my logic above... first, the result set desired is always in connection with one name only per query, e.g. from T1, lets use JOHN_FM. In T1, JOHN_FM is currently associated with 'arn's (within the arn_mkt_stn): A00001, A00002 & A00010'. The next step in T2 is to find all the 'arn's (within arn_mkt)' that have JOHN_FM's 3 digit prefix (177), then exclude those that are in T1. Note: A00006 remains because it is not connected to JOHN_FM in T1. The same query for BILL_FM gives slightly different results, excluding A00001 & A00006 as it has this assoc in T1.. Thanks, R]
You can use a LEFT JOIN to remove the records from T2 that match those in T1. However, I'm not sure I'm understanding your logic.
You say A00001_177_JOHN_FM should return:
A00003_177
A00004_177
A00006_177
However, wouldn't A00006_177_BILL_FM exclude A00006_177 from the above results?
This query should be close (wasn't completely sure which fields you needed returned) to what you're looking for if I'm understanding you correctly:
SELECT T2.arn_mkt, T3.arn
FROM T2
LEFT JOIN T1 ON
T1.arn_mkt_stn LIKE CONCAT(T2.arn_mkt,'%')
INNER JOIN T3 ON
T2.arn_mkt LIKE CONCAT(T3.arn,'%')
WHERE T1.arn_mkt_stn IS NULL
Sample Fiddle Demo
--EDIT--
Reviewing the comments, this should be what you're looking for:
SELECT *
FROM T2
LEFT JOIN T1 ON
T1.arn_mkt_stn LIKE CONCAT(LEFT(T2.arn_mkt,LOCATE('_',T2.arn_mkt)),'%') AND T1.arn_mkt_stn LIKE '%JOHN_FM'
INNER JOIN T3 ON
T2.arn_mkt LIKE CONCAT(T3.arn,'%')
WHERE T1.arn_mkt_stn IS NULL
And here is the updated Fiddle: http://sqlfiddle.com/#!2/3c293/13