Pyspark - Passing inequality condition dynamically to dataframes join

Pyspark - Passing inequality condition dynamically to dataframes join - dataframe

I am using this code from another question: my question is how can I passing an inequality condition here for the join apart from the ON clause.
e.g my join condition is ("ID == ID") & ((DATE1 < DATE2) & (DATE3 > DATE4))
If my condition was only ID == ID, I am able to do that using list_of_join_columns = ['ID'] but I want to pass the inequality condition as well in the below code: please advise how can that be achieved.
*** existing code *****
def join_dataframes(list_of_join_columns, left_df, right_df):
return left_df.join(right_df, on=list_of_join_columns)
joined_df = functools.reduce(
functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes,
)

You need to tag dataframe name along with the column in the join condition, in case the column name varies in both the dataframe
Also, it's adviced to rename the right dataframe common column such as id ,
right_df = right_df.withColumnRenamed("ID", "right_ID")
new_df = left_df.join(right_df, (left_df.ID == right_df.right_ID) & ((left_df.DATE1 < right_df.DATE2) & (left_df.DATE3 > right_df.DATE4))

Related

Redshift - ERROR: Target table must be part of an equijoin predicate

I am trying to make an update on a temporal table I created in Redshift. The code I am trying to run goes like this:
UPDATE #new_emp
SET rpt_to_emp_id = CAST(ht.se_value AS INTEGER),
rpt_to_extrnl_email = ht.extrnl_email_addr,
rpt_to_fst_nm = ht.first_nm,
rpt_to_lst_nm = ht.last_nm,
rpt_to_mdl_init = ht.mdl_nm,
rpt_to_nm = ht.full_nm,
rpt_to_ssn = CAST(ht.ssn AS INTEGER),
FROM #new_emp,
(SELECT DISTINCT t.se_value,h.first_nm,h.last_nm,
h.mdl_nm,h.full_nm,h.ssn,h.extrnl_email_addr
FROM spec_hr.dtbl_translate_codes_dw t, spec_hr.emp_hron h
WHERE t.inf_name = 'system'
AND t.fld_name = 'HRONDirector'
AND h.foreign_emp_id = t.se_value
) ht
WHERE #new_emp.foreign_emp_id <> ht.se_value
AND (#new_emp.emp_status_cd <> 'T'
AND (#new_emp.ult_rpt_emp_id = #new_emp.foreign_emp_id
OR #new_emp.ult_rpt_emp_id = #new_emp.psoft_id
OR #new_emp.ult_rpt_emp_id IS NULL));
I've tried both with and without specyfing the updated table from the FROM command. But it keeps throwing me this error:
ERROR: Target table must be part of an equijoin predicate
Any ideas why is this failing? Thank you!

Redshift needs an equality join condition to know what value to update and with which values. Your join condition is "#new_emp.foreign_emp_id <> ht.se_value" which is an inequality or Redshift speak - this is not "an equijoin predicate". You have a SET of "rpt_to_lst_nm = ht.last_nm" but if the only join condition is an inequality then which value of last_nm is Redshift putting in the table?
To put it the other way around - you need to tell Redshift exactly which rows of the target table are receiving which values (equijoin). The join condition you have doesn't meet this requirement.

How can I count all NULL values, without column names, using SQL?

I'm reading and executing sql queries from file and I need to inspect the result sets to count all the null values across all columns. Because the SQL is read from file, I don't know the column names and thus can't call the columns by name when trying to find the null values.
I think using CTE is the best way to do this, but how can I call the columns when I don't know what the column names are?
WITH query_results AS
(
<sql_read_from_file_here>
)
select count_if(<column_name> is not null) FROM query_results

If you are using Python to read the file of SQL statements, you can do something like this which uses pglast to parse the SQL query to get the columns for you:
import pglast
sql_read_from_file_here = "SELECT 1 foo, 1 bar"
ast = pglast.parse_sql(sql_read_from_file_here)
cols = ast[0]['RawStmt']['stmt']['SelectStmt']['targetList']
sum_stmt = "sum(iff({col} is null,1,0))"
sums = [sum_sql.format(col = col['ResTarget']['name']) for col in cols]
print(f"select {' + '.join(sums)} total_null_count from query_results")
# outputs: select sum(iff(foo is null,1,0)) + sum(iff(bar is null,1,0)) total_null_count from query_results

How to dynamic, handle nested WHERE AND/OR queries using Rails and SQL

I'm currently building a feature that requires me to loop over an hash, and for each key in the hash, dynamically modify an SQL query.
The actual SQL query should look something like this:
select * from space_dates d
inner join space_prices p on p.space_date_id = d.id
where d.space_id = ?
and d.date between ? and ?
and (
(p.price_type = 'monthly' and p.price_cents <> 9360) or
(p.price_type = 'daily' and p.price_cents <> 66198) or
(p.price_type = 'hourly' and p.price_cents <> 66198) # This part should be added in dynamically
)
The last and query is to be added dynamically, as you can see, I basically need only one of the conditions to be true but not all.
query = space.dates
.joins(:price)
.where('date between ? and ?', start_date, end_date)
# We are looping over the rails enum (hash) and getting the key for each key value pair, alongside the index
SpacePrice.price_types.each_with_index do |(price_type, _), index|
amount_cents = space.send("#{price_type}_price").price_cents
query = if index.positive? # It's not the first item so we want to chain it as an 'OR'
query.or(
space.dates
.joins(:price)
.where('space_prices.price_type = ?', price_type)
.where('space_prices.price_cents <> ?', amount_cents)
)
else
query # It's the first item, chain it as an and
.where('space_prices.price_type = ?', price_type)
.where('space_prices.price_cents <> ?', amount_cents)
end
end
The output of this in rails is:
SELECT "space_dates".* FROM "space_dates"
INNER JOIN "space_prices" ON "space_prices"."space_date_id" = "space_dates"."id"
WHERE "space_dates"."space_id" = $1 AND (
(
(date between '2020-06-11' and '2020-06-11') AND
(space_prices.price_type = 'hourly') AND (space_prices.price_cents <> 9360) OR
(space_prices.price_type = 'daily') AND (space_prices.price_cents <> 66198)) OR
(space_prices.price_type = 'monthly') AND (space_prices.price_cents <> 5500)
) LIMIT $2
Which isn't as expected. I need to wrap the last few lines in another set of round brackets in order to produce the same output. I'm not sure how to go about this using ActiveRecord.
It's not possible for me to use find_by_sql since this would be dynamically generated SQL too.

So, I managed to solve this in about an hour using Arel with rails
dt = SpaceDate.arel_table
pt = SpacePrice.arel_table
combined_clauses = SpacePrice.price_types.map do |price_type, _|
amount_cents = space.send("#{price_type}_price").price_cents
pt[:price_type]
.eq(price_type)
.and(pt[:price_cents].not_eq(amount_cents))
end.reduce(&:or)
space.dates
.joins(:price)
.where(dt[:date].between(start_date..end_date).and(combined_clauses))
end
And the SQL output is:
SELECT "space_dates".* FROM "space_dates"
INNER JOIN "space_prices" ON "space_prices"."space_date_id" = "space_dates"."id"
WHERE "space_dates"."space_id" = $1
AND "space_dates"."date" BETWEEN '2020-06-11' AND '2020-06-15'
AND (
("space_prices"."price_type" = 'hourly'
AND "space_prices"."price_cents" != 9360
OR "space_prices"."price_type" = 'daily'
AND "space_prices"."price_cents" != 66198)
OR "space_prices"."price_type" = 'monthly'
AND "space_prices"."price_cents" != 5500
) LIMIT $2
What I ended up doing was:
Creating an array of clauses based on the enum key and the price_cents
Reduced the clauses and joined them using or
Added this to the main query with an and operator and the combined_clauses

How to concatenate date variables to create between in where condition

While I am trying to set my value to between the user given start date and end date for a query, I am running into a run-time error 3071 (The expression is typed incorrectly or it is too complex)
This is being used to pass a user given variable from a form to a query
Please see below
WHERE (
IIf([Forms]![Find]![Entity]<>"",DB.Entity=[Forms]![Find]![Entity],"*")
AND IIf([Forms]![Find]![AEPS]<>"",DB.AEPSProgram=[Forms]![Find]![AEPS],"*")
AND IIf([Forms]![Find]![DeliveryType]<>"",DB.DeliveryType=[Forms]![Find]![DeliveryType],"*")
AND IIf([Forms]![Find]![ReportingYear] is not Null,DB.ReportingYear=[Forms]![Find]![ReportingYear],"*")
AND IIf([Forms]![Find]![Price] is not Null,DB.Price=[Forms]![Find]![Price],"*")
AND IIf([Forms]![Find]![Volume]is not Null,DB.Volume=[Forms]![Find]![Volume],"*")
AND IIf([Forms]![Find]![sDate] is not Null AND [Forms]![Find]![eDate] is not Null,DB.TransactionDate= ">" & [Forms]![Find]![sdate] & " and <" & [Forms]![Find]![edate],"*")
);
If I set it equal to one of the dates it works as expected. I assume I am missing something with how I am joining the dates
Thank you

This (for the date field only) works:
WHERE (((DB.TransactionDate)>[Forms]![Find]![sdate] And (DB.TransactionDate)<[Forms]![Find]![edate])) OR ((([Forms]![Find]![sDate]+[Forms]![Find]![eDate]) Is Null));
Don't use *, it is for use with Like.

Consider assigning NULL condition to corresponding column via NZ() which sets column equal to itself and so avoids filtering any rows by that respective condition. Asterisks alone does not evaluate to a boolean condition unless using LIKE operator.
WHERE DB.Entity = NZ([Forms]![Find]![Entity], DBEntity)
AND DB.AEPSProgram = NZ([Forms]![Find]![AEPS], DB.AEPSProgram)
AND DB.DeliveryType = NZ([Forms]![Find]![DeliveryType], DB.DeliveryType)
AND DB.ReportingYear = NZ([Forms]![Find]![ReportingYear], DB.ReportingYear)
AND DB.Price = NZ([Forms]![Find]![Price], DB.Price)
AND DB.Volume = NZ([Forms]![Find]![Volume], DB.Volume)
AND DB.TransactionDate >= NZ([Forms]![Find]![sdate], DB.TransactionDate)
AND DB.TransactionDate <= NZ([Forms]![Find]![edate], DB.TransactionDate)
;

SQL: Can I convert a string value of a "<" operator into an operator in a where clause?

I am storing an operator in a string column for two rows of a table (">", "<=")
I am joining the table with another table and want to make the where clause as dynamic as possible.
I was wondering if it's possible to convert the string value operator into an actual operator for this line of SQL code:
ABS(DATEDIFF(dd,Table2.DUE_DT,GETDATE())) > 120
VS
ABS(DATEDIFF(dd,Table2.DUE_DT,GETDATE())) <= 120
The operator will change depending on matching columns in the row. Is it possible to change the operator based of the string value containing the correct operator? If so, how can this be done?
Below are the two rows from Table1
NEFL_TYPE GRGR_ID NEFL_KEY NEFL_VALUE NEFL_COLUMN
"PDRU" "2600" "PD" "RV" ">"
"PDRU" "2600" "RV" "PD" "<="
This is the snippet of code I use:
INNER JOIN
Table1
ON
Table2.STATUS = Table1.NEFL_KEY
AND
Table1.NEFL_TYPE = 'PDRU'
WHERE
Table1.GRGR_ID = '2600'
AND
ABS(DATEDIFF(dd, Table2.DUE_DT,GETDATE())) > 120
So Table2.STATUS should determine which operator to use in the NEFL_COLUMN

I don't think there's an easy way to do what you want for a general case, not even with dynamic queries or client-side generated queries, as the comparison is per row and performance would be an issue with dynamic queries.
I see 2 ways to solve the particular example case, though:
a) Make 2 separate queries and do a UNION on them
SELECT...
INNER JOIN
Table1
ON
Table2.STATUS = Table1.NEFL_KEY
AND
Table1.NEFL_TYPE = 'PDRU'
WHERE
Table1.GRGR_ID = '2600'
AND
ABS(DATEDIFF(dd, Table2.DUE_DT,GETDATE())) > 120 AND Table1.NEFL_COLUMN = ">"
UNION
SELECT...
INNER JOIN
Table1
ON
Table2.STATUS = Table1.NEFL_KEY
AND
Table1.NEFL_TYPE = 'PDRU'
WHERE
Table1.GRGR_ID = '2600'
AND
ABS(DATEDIFF(dd, Table2.DUE_DT,GETDATE())) <= 120 AND Table1.NEFL_COLUMN = "<="
b) Use 2 range columns instead of an "operator" column
NEFL_TYPE GRGR_ID NEFL_KEY NEFL_VALUE NEFL_COLUMN START END
"PDRU" "2600" "PD" "RV" ">" 121 9999
"PDRU" "2600" "RV" "PD" "<=" 0 120
--------------------------------------------------------------------------
INNER JOIN
Table1
ON
Table2.STATUS = Table1.NEFL_KEY
AND
Table1.NEFL_TYPE = 'PDRU'
WHERE
Table1.GRGR_ID = '2600'
AND
ABS(DATEDIFF(dd, Table2.DUE_DT,GETDATE())) BETWEEN Table1.START AND Table1.END
But none of them are as clean or elegant as I think you would like

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark - Passing inequality condition dynamically to dataframes join - dataframe

Related

Redshift - ERROR: Target table must be part of an equijoin predicate

How can I count all NULL values, without column names, using SQL?

How to dynamic, handle nested WHERE AND/OR queries using Rails and SQL

How to concatenate date variables to create between in where condition

SQL: Can I convert a string value of a "<" operator into an operator in a where clause?

Categories

Resources