How can I replicate the following SQL-esque left join in R? - sql

I'm very well-versed in SQL, and an absolute novice in R. Unfortunately, due to an update in company policy, we must use Athena to run our SQL queries. Athena is weak, so despite having a complete/correct SQL query, I cannot run it to manipulate my large, insurance-based dataset.
I have seen similar posts, but haven't managed to crack my own problem trying to utilize the methodologies provided. Here are the details:
After running the SQL block in R (using a connection string), I have a countrywide data block denoted CW_Data in R
Each record contains a policy with a multitude of characteristics (columns) such as the Policy_Number, Policy_Effective_Date, Policy_Earned_Premium
Athena breaks down when I try add two columns based on the already-existing ones
Namely, I want to left join such that I can obtain a new columns for Policy_Prior_Year_Earned_Premium and Policy_Second_Prior_Year_Earned_Premium
Per the above, I know I need to add columns such that, for a given policy, I can find the record where the Policy_Number=Policy_Number and Policy_Effective_Date = Policy_Effective_Date-1 or Policy_Effective_Date-2 years. This is quite simple in SQL, but I cannot get it in R for the life of me.
Here is the (watered-down) left join I attempted in SQL using CTEs that breaks Athena (even if the SQL is run via R):
All_Info as (
Select
PC.Policy_Number
,PC.Policy_Effective_Date
,PC.Policy_EP
from Policy_Characteristics as PC
left join Almost_All_Info as AAI
on AAI.Policy_Number = PC.Policy_Number
and AAI.Policy_Effective_Date = date_add('year', -1, PC.Policy_Effective_Date)
left join All_Segments as AST
on AST.Policy_Number = PC.Policy_Number
and AST.Policy_Effective_Date = date_add('year', -2, PC.Policy_Effective_Date)
Group by
PC.Policy_Number
,PC.Policy_Effective_Date
,PC.Policy_EP

As #zephryl pointed out, examples of data and expected result would be very helpful.
From your description, the R equivalent might look like this:
library(dplyr)
library(lubridate) ## datetime helpers
All_Info <-
Policy_Characteristics |>
select(Policy_Number,
Policy_Effective_Date, ## make sure this has class "Date"
Policy_EP
) |>
mutate(one_year_earlier = Policy_Effective_Date + duration(years = -1),
two_years_earlier = Policy_Effective_Date + duration(years = -2)
) |>
left_join(Almost_All_Info,
by = c('Policy_Number' = 'Policy_Number',
'one_year_earlier' = 'Policy_Effective_Date'
)
) |>
left_join(All_Segments,
by = c('Policy_Number' = 'Policy_Number',
'two_years_earlier' = 'Policy_Effective_Date'
)
) |>
group_by(Policy_Number,
Policy_Effective_Date,
Policy_EP
)

Related

ORA-01841 happens on one environment but not all

I have the following SQL-code in my (SAP IdM) Application:
Select mcmskeyvalue as MKV,v1.searchvalue as STARTDATE, v2.avalue as Running_Changes_flag
from idmv_entry_simple
inner join idmv_value_basic_active v1 on mskey = mcmskey and attrname = 'Start_of_company_change'
and mcentrytype = 'MX_PERSON' and to_date(v1.searchvalue,'YYYY-MM-DD')<= sysdate+3
left join idmv_value_basic v2 on v2.mskey = mcmskey and v2.attrname = 'Running_Changes_flag'
where mcmskey not in (Select mskey from idmv_value_basic_active where attrname = 'Company_change_running_flag')
I already found the solution for the ORA-01841 problem, as it could either be a solution similar to MSSQLs try_to_date as mentioned here: How to handle to_date exceptions in a SELECT statment to ignore those rows?
or a solution where I change the code to something like this, to work soly on strings:
Select mcmskeyvalue as MKV,v1.searchvalue as STARTDATE, v2.avalue as Running_Changes_flag
from idmv_entry_simple
inner join idmv_value_basic_active v1 on mskey = mcmskey and attrname = 'Start_of_company_change'
and mcentrytype = 'MX_PERSON' and v1.searchvalue<= to_char(sysdate+3,'YYYY-MM-DD')
left join idmv_value_basic v2 on v2.mskey = mcmskey and v2.attrname = 'Running_Changes_flag'
where mcmskey not in (Select mskey from idmv_value_basic_active where attrname = 'Company_change_running_flag')
So for the actually problem I have a solution.
But now I came into discussion with my customers and teammates why the error happens at all.
Basically for all entries of idmv_value_basic_activ that comply to the requirement of "attrname = 'Start_of_company_change'" we can be sure that those are dates. In addition, if we execute the query to check all values that would be delivered, all are in a valid format.
I learned in university that the DB-Engine could decide in which order it will run individual segments of a query. So for me the most logical explanation would be that, on the development environment (where we face the problem), the section " to_date(v1.searchvalue,'YYYY-MM-DD')<= sysdate+3” is executed before the section “attrname = 'Start_of_company_change'”
Whereas on the productive environment, where everything works like a charm, the segments are executed in the order that is descripted by the SQL Statement.
Now my Question is:
First: do I remember that right, since the teacher said that only once and at that time I could not really make sense out of it
And Second: Is this assumption of mine correct or is there another reason for the problem?
Borderinformation:
The Tool uses a kind of shifted data structure which is why there can be quite a few different types in the actual “Searchvalue” column of the idmv_value_basic_activ view. The datatype on the database layer is always a varchar one.
"the DB-Engine could decide in which order it will run individual segments of a query"
This is correct. A SQL query is just a description of the data you want and where it's stored. Oracle will calculate an execution plan to retrieve that data as best it can. That plan will vary based on any number of factors, like the number of actual rows in the table and the presence of indexes, so it will vary from environment to environment.
So it sounds like you have an invalid date somewhere in your table, so to_date raises an exception. You can use validate_conversion to find it.

How to select columns from more then thee tables using dplyr&

I develop a Shiny app using reactivity programming assuming that reactive objects are functions thus in order to refer to some table I have to pass () after table we are refering to.
An algorithm I've worked out is well realised using SQL syntax (in this case sqldf package). I provide you with one query as an example:
ratios_135_final <- sqldf("select
b.tot_cap_after_stress*100/c.rwa_0_after_stress as \"n1.0_after_stress\",
b.osn_cap_after_stress*100/c.rwa_2_after_stress as \"n1.2_after_stress\",
b.bas_cap_after_stress*100/c.rwa_1_after_stress as \"n1.1_after_stress\",
a.\"REGN\", d.\"NAME\", a.date, f.buff
from ratios a
inner join capital_final b on (a.\"REGN\" = b.\"REGN\")
inner join rwa_final c on (a.\"REGN\" = c.\"REGN\")
inner join names d on (a.\"REGN\" = d.\"REGN\")
inner join buffer_bank f on (a.\"REGN\" = f.\"REGN\") ")
As you can see there are 5 tables that I'm refering to build a query. But I can't write for instanse ...*from ratios()*. I tried to learn dplyr syntax but I've revealed that dplyr does not provide any functions for working with three or more tables.
Could you help me you to handle this problem?
Thanks in advance.
This is the equivalent code however this assumes that "REGN" is the only column that exists in multiple tables. If there are other columns names that are shared among different tables it will need further modifications.
ratios_135_final <-
ratios %>%
inner_join(capital_final, by = "REGN") %>%
inner_join(rwa_final, by = "REGN") %>%
inner_join(names, by = "REGN") %>%
inner_join(buffer_bank, by "REGN") %>%
mutate(n1.0_after_stress = tot_cap_after_stress * 100 / rwa_0_after_stress,
n1.2_after_stress = osn_cap_after_stress * 100 / rwa_2_after_stress,
n1.1_after_stress = bas_cap_after_stress * 100 / rwa_1_after_stress) %>%
select(n1.0_after_stress, n1.2_after_stress, n1.1_after_stress, REGN.x, NAME, date, buff) %>%
rename(REGN = REGN.x)

Generate CROSS JOIN queries with dbplyr

Given 2 remote tables (simulated with tbl_lazy for this example)
library("dplyr")
library("dbplyr")
t1 <- tbl_lazy(df = iris, src = dbplyr::simulate_mysql())
t2 <- tbl_lazy(df = mtcars, src = dbplyr::simulate_mysql())
How can I perform an actual* cross join between t1 and t2 using R and dbplyr?
* i.e. using CROSS JOIN in the translated SQL query
Note that I know how to perform all the other types of joins, this is precisely about CROSS joins.
I am aware of the following trick:
joined <- t1 %>%
mutate(tmp = 1) %>%
full_join(mutate(t2, tmp = 1), by = "tmp") %>%
select(-tmp)
However
This is ugly (even if it could be hidden in a function)
I would like to take advantage of the highly optimised join capabilities of the DB, so I'd like to pass a real SQL CROSS JOIN. Using show_query(joined) shows that the generated SQL query uses LEFT JOIN.
Sadly, there is no cross_join operator in dplyr and sql_join(t1, t2, type = "cross") does not work either (not implemented for tbls, works only on DB connections).
How can I generate an SQL CROSS JOIN with dbplyr?
According to the dbplyr NEWS file, since version 1.10, if you use a full_join(..., by = character()), it will "promote" the join to a cross join. This doesn't seem to be documented anywhere else yet, but searching the dbplyr Github repo for "cross" turned it up in both code and the NEWS file.
This syntax does not seem to work for local data frames, only via SQL.

Ibis Impala JOIN problem with relabel/name 'column AS newName'

When you use the Ibis API to query impala, for some reason Ibis API forces it to become a subquery (when you join 4-5 tables it suddenly becomes super slow). It simply won't join normally, due to column name overlap problem on joins. I want a way to quickly rename the columns perhaps, isn't that's how SQL usually works?
i0 = impCon.table('shop_inventory')
s0 = impCon.table('shop_expenditure')
s0 = s0.relabel({'element_date': 'spend_element_date', 'element_shop_item': 'spend_shop_item'})
jn = i0.inner_join(s0, [i0['element_date'] == s0['spend_element_date'], i0['element_shop_item'] == s0['spend_shop_item']])
jn.materialize()
jn.execute(limit=900)
Then you have IBIS generating SQL that is SUBQUERYING it without me suggesting it:
SELECT *
FROM (
SELECT `element_date`, `element_shop_item`, `element_address`, `element_expiration`,
`element_category`, `element_description`
FROM dbp.`shop_inventory`
) t0
INNER JOIN (
SELECT `element_shop_item` AS `spend_shop_item`, `element_comm` AS `spend_comm`,
`element_date` AS `spend_date`, `element_amount`,
`element_spend_type`, `element_shop_item_desc`
FROM dbp.`shop_spend`
) t1
ON (`element_shop_item` = t1.`spend_shop_item`) AND
(`element_category` = t1.`spend_category`) AND
(`element_subcategory` = t1.`spend_subcategory`) AND
(`element_comm` = t1.`spend_comm`) AND
(`element_date` = t1.`spend_date`)
LIMIT 900
Why is this so difficult?
It should be ideally as simple as:
jn = i0.inner_join(s0, [s0['element_date'].as('spend_date') == i0['element_date']]
to generate a single: SELECT s0.element_date as spend_date, i0.element_date INNER JOIN s0 dbp.shop_spend ON s0.spend_date == i0.element_date
right?
Are we not ever allowed to have same column names on tables that are being joined? I am pretty sure in raw SQL you can just use "X AS Y" without having to need subquery.
I spent the last few hours struggling with this same issue. A better solution I found is to do the following. Join keeping the variable names the same. Then, before you materialize, only select a subset of the variables such that there isn't any overlap.
So in your code it would look something like this:
jn = i0.inner_join(s0, [i0['element_date'] == s0['element_date'], i0['element_shop_item'] == s0['element_shop_item']])
expr = jn[i0, s0['variable_of_interest_1'],s0['variable_of_interest_2']]
expr.materialize()
See here for more resources
https://docs.ibis-project.org/sql.html

How to improve query performance in Oracle

Below sql query is taking too much time for execution. It might be due to repetitive use of same table in from clause. I am not able to find out how to fix this query so that performance would be improve.
Can anyone help me out with this?
Thanks in advance !!
select --
from t_carrier_location act_end,
t_location end_loc,
t_carrier_location act_start,
t_location start_loc,
t_vm_voyage_activity va,
t_vm_voyage v,
t_location_position lp_start,
t_location_position lp_end
where act_start.carrier_location_id = va.carrier_location_id
and act_start.carrier_id = v.carrier_id
and act_end.carrier_location_id =
decode((select cl.carrier_location_id
from t_carrier_location cl
where cl.carrier_id = act_start.carrier_id
and cl.carrier_location_no =
act_start.carrier_location_no + 1),
null,
(select cl2.carrier_location_id
from t_carrier_location cl2, t_vm_voyage v2
where v2.hire_period_id = v.hire_period_id
and v2.voyage_id =
(select min(v3.voyage_id)
from t_vm_voyage v3
where v3.voyage_id > v.voyage_id
and v3.hire_period_id = v.hire_period_id)
and v2.carrier_id = cl2.carrier_id
and cl2.carrier_location_no = 1),
(select cl.carrier_location_id
from t_carrier_location cl
where cl.carrier_id = act_start.carrier_id
and cl.carrier_location_no =
act_start.carrier_location_no + 1))
and lp_start.location_id = act_start.location_id
and lp_start.from_date <=
nvl(act_start.actual_dep_time, act_start.actual_arr_time)
and (lp_start.to_date is null or
lp_start.to_date >
nvl(act_start.actual_dep_time, act_start.actual_arr_time))
and lp_end.location_position_id = act_end.location_id
and lp_end.from_date <=
nvl(act_end.actual_dep_time, act_end.actual_arr_time)
and (lp_end.to_date is null or
lp_end.to_date >
nvl(act_end.actual_dep_time, act_end.actual_arr_time))
and act_end.location_id = end_loc.location_id
and act_start.location_id = start_loc.location_id;
There is no Stright forward one answer for your question and the query you've mentioned.
In order to get a better response time of any query, you need to keep few things in mind while writing your queries. I will mention few here which appeared to be important for your query
Use joins instead of subqueries.
Use EXPLAIN to determine queries are functioning appropriately.
Use the columns which are having indexes with your where clause else create an index on those columns. here use your common sense which are the columns to be indexed ex: foreign key columns, deleted, orderCreatedAt, startDate etc.
Keep the order of the select columns as they appear at the table instead of arbitrarily selecting columns.
The above four points are enough for the query you've provided.
To dig deep about SQL optimization and tuning refer this https://docs.oracle.com/database/121/TGSQL/tgsql_intro.htm#TGSQL130