How to select columns from more then thee tables using dplyr&

How to select columns from more then thee tables using dplyr& - sql

I develop a Shiny app using reactivity programming assuming that reactive objects are functions thus in order to refer to some table I have to pass () after table we are refering to.
An algorithm I've worked out is well realised using SQL syntax (in this case sqldf package). I provide you with one query as an example:
ratios_135_final <- sqldf("select
b.tot_cap_after_stress*100/c.rwa_0_after_stress as \"n1.0_after_stress\",
b.osn_cap_after_stress*100/c.rwa_2_after_stress as \"n1.2_after_stress\",
b.bas_cap_after_stress*100/c.rwa_1_after_stress as \"n1.1_after_stress\",
a.\"REGN\", d.\"NAME\", a.date, f.buff
from ratios a
inner join capital_final b on (a.\"REGN\" = b.\"REGN\")
inner join rwa_final c on (a.\"REGN\" = c.\"REGN\")
inner join names d on (a.\"REGN\" = d.\"REGN\")
inner join buffer_bank f on (a.\"REGN\" = f.\"REGN\") ")
As you can see there are 5 tables that I'm refering to build a query. But I can't write for instanse ...*from ratios()*. I tried to learn dplyr syntax but I've revealed that dplyr does not provide any functions for working with three or more tables.
Could you help me you to handle this problem?
Thanks in advance.

This is the equivalent code however this assumes that "REGN" is the only column that exists in multiple tables. If there are other columns names that are shared among different tables it will need further modifications.
ratios_135_final <-
ratios %>%
inner_join(capital_final, by = "REGN") %>%
inner_join(rwa_final, by = "REGN") %>%
inner_join(names, by = "REGN") %>%
inner_join(buffer_bank, by "REGN") %>%
mutate(n1.0_after_stress = tot_cap_after_stress * 100 / rwa_0_after_stress,
n1.2_after_stress = osn_cap_after_stress * 100 / rwa_2_after_stress,
n1.1_after_stress = bas_cap_after_stress * 100 / rwa_1_after_stress) %>%
select(n1.0_after_stress, n1.2_after_stress, n1.1_after_stress, REGN.x, NAME, date, buff) %>%
rename(REGN = REGN.x)

Related

How can I replicate the following SQL-esque left join in R?

I'm very well-versed in SQL, and an absolute novice in R. Unfortunately, due to an update in company policy, we must use Athena to run our SQL queries. Athena is weak, so despite having a complete/correct SQL query, I cannot run it to manipulate my large, insurance-based dataset.
I have seen similar posts, but haven't managed to crack my own problem trying to utilize the methodologies provided. Here are the details:
After running the SQL block in R (using a connection string), I have a countrywide data block denoted CW_Data in R
Each record contains a policy with a multitude of characteristics (columns) such as the Policy_Number, Policy_Effective_Date, Policy_Earned_Premium
Athena breaks down when I try add two columns based on the already-existing ones
Namely, I want to left join such that I can obtain a new columns for Policy_Prior_Year_Earned_Premium and Policy_Second_Prior_Year_Earned_Premium
Per the above, I know I need to add columns such that, for a given policy, I can find the record where the Policy_Number=Policy_Number and Policy_Effective_Date = Policy_Effective_Date-1 or Policy_Effective_Date-2 years. This is quite simple in SQL, but I cannot get it in R for the life of me.
Here is the (watered-down) left join I attempted in SQL using CTEs that breaks Athena (even if the SQL is run via R):
All_Info as (
Select
PC.Policy_Number
,PC.Policy_Effective_Date
,PC.Policy_EP
from Policy_Characteristics as PC
left join Almost_All_Info as AAI
on AAI.Policy_Number = PC.Policy_Number
and AAI.Policy_Effective_Date = date_add('year', -1, PC.Policy_Effective_Date)
left join All_Segments as AST
on AST.Policy_Number = PC.Policy_Number
and AST.Policy_Effective_Date = date_add('year', -2, PC.Policy_Effective_Date)
Group by
PC.Policy_Number
,PC.Policy_Effective_Date
,PC.Policy_EP

As #zephryl pointed out, examples of data and expected result would be very helpful.
From your description, the R equivalent might look like this:
library(dplyr)
library(lubridate) ## datetime helpers
All_Info <-
Policy_Characteristics |>
select(Policy_Number,
Policy_Effective_Date, ## make sure this has class "Date"
Policy_EP
) |>
mutate(one_year_earlier = Policy_Effective_Date + duration(years = -1),
two_years_earlier = Policy_Effective_Date + duration(years = -2)
) |>
left_join(Almost_All_Info,
by = c('Policy_Number' = 'Policy_Number',
'one_year_earlier' = 'Policy_Effective_Date'
)
) |>
left_join(All_Segments,
by = c('Policy_Number' = 'Policy_Number',
'two_years_earlier' = 'Policy_Effective_Date'
)
) |>
group_by(Policy_Number,
Policy_Effective_Date,
Policy_EP
)

Self joining columns from the same table with calculation on one column not displaying column name

I am fairly new to SQL and having issues figuring out how to solve the simple issue below. I have a dataset I am trying to self-join, I am using (b.calendar_year_number -1) as one of the columns to join. I applied a calculation of -1 with the goal of trying to match values from the previous year. However, it is not working as the resulting column shows (No column name) with a screenshot attached below. How do I change the alias to b.calendar_year_number after the calculation?
Code:
SELECT a.day_within_fiscal_period,
a.calendar_month_name,
a.cost_period_rolling_three_month_start_date,
a.calendar_year_number,
b.day_within_fiscal_period,
b.calendar_month_name,
b.cost_period_rolling_three_month_start_date,
(b.calendar_year_number -1)
FROM [data_mart].[v_dim_date_consumer_complaints] AS a
JOIN [data_mart].[v_dim_date_consumer_complaints] AS b
ON b.day_within_fiscal_period = a.day_within_fiscal_period AND
b.calendar_month_name = a.calendar_month_name AND
b.calendar_year_number = a.calendar_year_number

I am using (b.calendar_year_number -1) as one of the columns to join.
Nope, you're not. Look at your join statement and you'll see the third condition is:
b.calendar_year_number = a.calendar_year_number
So just change that to include the calculation. As far as the 'no column name' issue, you can use colname = somelogic syntax or somelogic as colname. Below, I used the former syntax.
select a.day_within_fiscal_period,
a.calendar_month_name,
a.cost_period_rolling_three_month_start_date,
a.calendar_year_number,
b.day_within_fiscal_period,
b.calendar_month_name,
b.cost_period_rolling_three_month_start_date,
bCalYearNum = b.calendar_year_number
from [data_mart].[v_dim_date_consumer_complaints] a
left join [data_mart].[v_dim_date_consumer_complaints] b
on b.day_within_fiscal_period = a.day_within_fiscal_period
and b.calendar_month_name = a.calendar_month_name
and b.calendar_year_number - 1 = a.calendar_year_number;

You could use the analytical function LAG/LEAD to get your required result, no self-join necessary:
select a.day_within_fiscal_period,
a.calendar_month_name,
a.cost_period_rolling_three_month_start_date,
a.calendar_year_number,
old_cost_period_rolling_three_month_start_date =
LAG(cost_period_rolling_three_month_start_date) OVER
(PARTITION BY calendar_month_name, day_within_fiscal_period
ORDER BY calendar_year_number),
old_CalYearNum = LAG(calendar_year_number) OVER
(PARTITION BY calendar_month_name, day_within_fiscal_period
ORDER BY calendar_year_number)
from [data_mart].[v_dim_date_consumer_complaints] a

How to use SparkR::read.jdbc() or sparklyr::spark_read_jdbc() to get results of SQL query rather than whole table?

I usually use RODBC locally to query my databases. However our company has recently moved to Azure Databricks which does not inherently support RODBC or other odbc connections, but does support jdbc connections which I have not previously used.
I have read the documentation for SparkR::read.jdbc() and sparklyr::spark_read_jdbc() but these seem to pull an entire table from the database rather than just the results of a query, which is not suitable for me as I never have to pull whole tables and instead run queries that join multiple tables together but only return a very small subset of the data in each table.
I cannot find a method for using the jdbc connector to:
(A) run a query referring to multiple tables on the same database
and
(B) store the results as an R dataframe or something that can very easily be converted to an R dataframe (such as a SparkR or sparklyr dataframe).
If possible, the solution would also only require me to specify the connection credentials once per script/notebook rather than every time I connect to the database to run a query and store the results as a dataframe.
e.g. is there a jdbc equivalent of the following:
my_server="myserver.database.windows.net"
my_db="mydatabase"
my_username="database_user"
my_pwd="abc123Ineedabetterpassword"
myconnection <- RODBC::odbcDriverConnect(paste0("DRIVER={SQL Server};
server=",my_server,";
database=",my_db,";
uid=",my_username,";
pwd=",my_pwd))
df <- RODBC::sqlQuery(myconnection,
"SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2
")
df2 <- RODBC::sqlQuery(myconnection,
"SELECT x.var1, y.var2, z.var3
FROM table_x as x
LEFT JOIN table_y as y on x.id = y.x_id
LEFT JOIN table_z on as z on x.id = z.x_id
WHERE z.category like '%vegetable%'
AND y.category IN ('A', 'B', 'C')
“)
How would I do something that gives the same results (two R dataframes df and df2) as the above using the jdbc connectors from SparkR or sparklyr inbuilt in Databricks?
I know that I can use the spark connector and some scala code (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector) to store the query results as a spark dataframe, convert this to a global temp table, store the global temp table as a SparkR dataframe and collapse this to an R dataframe, but this code is very difficult to read, requires me to change the language to scala (which I do not know well) for one of the cells in my notebook, and takes a really long time due to the large amount of steps. Because my R script often starts with several SQL queries -- often to multiple different databases -- this method gets very time-consuming and makes my scripts almost unreadable. Surely there is a more straightforward way?
(We are using Databricks primarily for automation via LogicApps and Azure Data Factory, and occasionally for increased RAM, rather than for parallel processing; our data (once extracted) are generally not large enough to require parallelisation and some of the models we use (e.g. lme4::lmer()) do not benefit from it.)

I worked this out eventually and want to post the answer here in case anyone else is having issues.
You can use SparkR::read.jdbc() with a query but you must surround the query in brackets and alias the results as something, otherwise you will get an ambiguous syntax error. The "portnum" seems to work fine for me as the default 1433 but if you have a different kind of SQL database you might need to change this in the URL. Then you can call SparkR::collect() on the SparkDataFrame containing the query results to convert it to an R dataframe:
e.g.
myconnection <- "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydatabase;user=database_user;password=abc123Ineedabetterpassword"
df <- read.jdbc( myconnection, "(
SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2) as result" ) %>%
SparkR::collect()

Generate CROSS JOIN queries with dbplyr

Given 2 remote tables (simulated with tbl_lazy for this example)
library("dplyr")
library("dbplyr")
t1 <- tbl_lazy(df = iris, src = dbplyr::simulate_mysql())
t2 <- tbl_lazy(df = mtcars, src = dbplyr::simulate_mysql())
How can I perform an actual* cross join between t1 and t2 using R and dbplyr?
* i.e. using CROSS JOIN in the translated SQL query
Note that I know how to perform all the other types of joins, this is precisely about CROSS joins.
I am aware of the following trick:
joined <- t1 %>%
mutate(tmp = 1) %>%
full_join(mutate(t2, tmp = 1), by = "tmp") %>%
select(-tmp)
However
This is ugly (even if it could be hidden in a function)
I would like to take advantage of the highly optimised join capabilities of the DB, so I'd like to pass a real SQL CROSS JOIN. Using show_query(joined) shows that the generated SQL query uses LEFT JOIN.
Sadly, there is no cross_join operator in dplyr and sql_join(t1, t2, type = "cross") does not work either (not implemented for tbls, works only on DB connections).
How can I generate an SQL CROSS JOIN with dbplyr?

According to the dbplyr NEWS file, since version 1.10, if you use a full_join(..., by = character()), it will "promote" the join to a cross join. This doesn't seem to be documented anywhere else yet, but searching the dbplyr Github repo for "cross" turned it up in both code and the NEWS file.
This syntax does not seem to work for local data frames, only via SQL.

Ibis Impala JOIN problem with relabel/name 'column AS newName'

When you use the Ibis API to query impala, for some reason Ibis API forces it to become a subquery (when you join 4-5 tables it suddenly becomes super slow). It simply won't join normally, due to column name overlap problem on joins. I want a way to quickly rename the columns perhaps, isn't that's how SQL usually works?
i0 = impCon.table('shop_inventory')
s0 = impCon.table('shop_expenditure')
s0 = s0.relabel({'element_date': 'spend_element_date', 'element_shop_item': 'spend_shop_item'})
jn = i0.inner_join(s0, [i0['element_date'] == s0['spend_element_date'], i0['element_shop_item'] == s0['spend_shop_item']])
jn.materialize()
jn.execute(limit=900)
Then you have IBIS generating SQL that is SUBQUERYING it without me suggesting it:
SELECT *
FROM (
SELECT `element_date`, `element_shop_item`, `element_address`, `element_expiration`,
`element_category`, `element_description`
FROM dbp.`shop_inventory`
) t0
INNER JOIN (
SELECT `element_shop_item` AS `spend_shop_item`, `element_comm` AS `spend_comm`,
`element_date` AS `spend_date`, `element_amount`,
`element_spend_type`, `element_shop_item_desc`
FROM dbp.`shop_spend`
) t1
ON (`element_shop_item` = t1.`spend_shop_item`) AND
(`element_category` = t1.`spend_category`) AND
(`element_subcategory` = t1.`spend_subcategory`) AND
(`element_comm` = t1.`spend_comm`) AND
(`element_date` = t1.`spend_date`)
LIMIT 900
Why is this so difficult?
It should be ideally as simple as:
jn = i0.inner_join(s0, [s0['element_date'].as('spend_date') == i0['element_date']]
to generate a single: SELECT s0.element_date as spend_date, i0.element_date INNER JOIN s0 dbp.shop_spend ON s0.spend_date == i0.element_date
right?
Are we not ever allowed to have same column names on tables that are being joined? I am pretty sure in raw SQL you can just use "X AS Y" without having to need subquery.

I spent the last few hours struggling with this same issue. A better solution I found is to do the following. Join keeping the variable names the same. Then, before you materialize, only select a subset of the variables such that there isn't any overlap.
So in your code it would look something like this:
jn = i0.inner_join(s0, [i0['element_date'] == s0['element_date'], i0['element_shop_item'] == s0['element_shop_item']])
expr = jn[i0, s0['variable_of_interest_1'],s0['variable_of_interest_2']]
expr.materialize()
See here for more resources
https://docs.ibis-project.org/sql.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to select columns from more then thee tables using dplyr& - sql

Related

How can I replicate the following SQL-esque left join in R?

Self joining columns from the same table with calculation on one column not displaying column name

How to use SparkR::read.jdbc() or sparklyr::spark_read_jdbc() to get results of SQL query rather than whole table?

Generate CROSS JOIN queries with dbplyr

Ibis Impala JOIN problem with relabel/name 'column AS newName'

Categories

Resources