Generate CROSS JOIN queries with dbplyr - sql

Given 2 remote tables (simulated with tbl_lazy for this example)
library("dplyr")
library("dbplyr")
t1 <- tbl_lazy(df = iris, src = dbplyr::simulate_mysql())
t2 <- tbl_lazy(df = mtcars, src = dbplyr::simulate_mysql())
How can I perform an actual* cross join between t1 and t2 using R and dbplyr?
* i.e. using CROSS JOIN in the translated SQL query
Note that I know how to perform all the other types of joins, this is precisely about CROSS joins.
I am aware of the following trick:
joined <- t1 %>%
mutate(tmp = 1) %>%
full_join(mutate(t2, tmp = 1), by = "tmp") %>%
select(-tmp)
However
This is ugly (even if it could be hidden in a function)
I would like to take advantage of the highly optimised join capabilities of the DB, so I'd like to pass a real SQL CROSS JOIN. Using show_query(joined) shows that the generated SQL query uses LEFT JOIN.
Sadly, there is no cross_join operator in dplyr and sql_join(t1, t2, type = "cross") does not work either (not implemented for tbls, works only on DB connections).
How can I generate an SQL CROSS JOIN with dbplyr?

According to the dbplyr NEWS file, since version 1.10, if you use a full_join(..., by = character()), it will "promote" the join to a cross join. This doesn't seem to be documented anywhere else yet, but searching the dbplyr Github repo for "cross" turned it up in both code and the NEWS file.
This syntax does not seem to work for local data frames, only via SQL.

Related

How can I replicate the following SQL-esque left join in R?

I'm very well-versed in SQL, and an absolute novice in R. Unfortunately, due to an update in company policy, we must use Athena to run our SQL queries. Athena is weak, so despite having a complete/correct SQL query, I cannot run it to manipulate my large, insurance-based dataset.
I have seen similar posts, but haven't managed to crack my own problem trying to utilize the methodologies provided. Here are the details:
After running the SQL block in R (using a connection string), I have a countrywide data block denoted CW_Data in R
Each record contains a policy with a multitude of characteristics (columns) such as the Policy_Number, Policy_Effective_Date, Policy_Earned_Premium
Athena breaks down when I try add two columns based on the already-existing ones
Namely, I want to left join such that I can obtain a new columns for Policy_Prior_Year_Earned_Premium and Policy_Second_Prior_Year_Earned_Premium
Per the above, I know I need to add columns such that, for a given policy, I can find the record where the Policy_Number=Policy_Number and Policy_Effective_Date = Policy_Effective_Date-1 or Policy_Effective_Date-2 years. This is quite simple in SQL, but I cannot get it in R for the life of me.
Here is the (watered-down) left join I attempted in SQL using CTEs that breaks Athena (even if the SQL is run via R):
All_Info as (
Select
PC.Policy_Number
,PC.Policy_Effective_Date
,PC.Policy_EP
from Policy_Characteristics as PC
left join Almost_All_Info as AAI
on AAI.Policy_Number = PC.Policy_Number
and AAI.Policy_Effective_Date = date_add('year', -1, PC.Policy_Effective_Date)
left join All_Segments as AST
on AST.Policy_Number = PC.Policy_Number
and AST.Policy_Effective_Date = date_add('year', -2, PC.Policy_Effective_Date)
Group by
PC.Policy_Number
,PC.Policy_Effective_Date
,PC.Policy_EP
As #zephryl pointed out, examples of data and expected result would be very helpful.
From your description, the R equivalent might look like this:
library(dplyr)
library(lubridate) ## datetime helpers
All_Info <-
Policy_Characteristics |>
select(Policy_Number,
Policy_Effective_Date, ## make sure this has class "Date"
Policy_EP
) |>
mutate(one_year_earlier = Policy_Effective_Date + duration(years = -1),
two_years_earlier = Policy_Effective_Date + duration(years = -2)
) |>
left_join(Almost_All_Info,
by = c('Policy_Number' = 'Policy_Number',
'one_year_earlier' = 'Policy_Effective_Date'
)
) |>
left_join(All_Segments,
by = c('Policy_Number' = 'Policy_Number',
'two_years_earlier' = 'Policy_Effective_Date'
)
) |>
group_by(Policy_Number,
Policy_Effective_Date,
Policy_EP
)

How to use SparkR::read.jdbc() or sparklyr::spark_read_jdbc() to get results of SQL query rather than whole table?

I usually use RODBC locally to query my databases. However our company has recently moved to Azure Databricks which does not inherently support RODBC or other odbc connections, but does support jdbc connections which I have not previously used.
I have read the documentation for SparkR::read.jdbc() and sparklyr::spark_read_jdbc() but these seem to pull an entire table from the database rather than just the results of a query, which is not suitable for me as I never have to pull whole tables and instead run queries that join multiple tables together but only return a very small subset of the data in each table.
I cannot find a method for using the jdbc connector to:
(A) run a query referring to multiple tables on the same database
and
(B) store the results as an R dataframe or something that can very easily be converted to an R dataframe (such as a SparkR or sparklyr dataframe).
If possible, the solution would also only require me to specify the connection credentials once per script/notebook rather than every time I connect to the database to run a query and store the results as a dataframe.
e.g. is there a jdbc equivalent of the following:
my_server="myserver.database.windows.net"
my_db="mydatabase"
my_username="database_user"
my_pwd="abc123Ineedabetterpassword"
myconnection <- RODBC::odbcDriverConnect(paste0("DRIVER={SQL Server};
server=",my_server,";
database=",my_db,";
uid=",my_username,";
pwd=",my_pwd))
df <- RODBC::sqlQuery(myconnection,
"SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2
")
df2 <- RODBC::sqlQuery(myconnection,
"SELECT x.var1, y.var2, z.var3
FROM table_x as x
LEFT JOIN table_y as y on x.id = y.x_id
LEFT JOIN table_z on as z on x.id = z.x_id
WHERE z.category like '%vegetable%'
AND y.category IN ('A', 'B', 'C')
“)
How would I do something that gives the same results (two R dataframes df and df2) as the above using the jdbc connectors from SparkR or sparklyr inbuilt in Databricks?
I know that I can use the spark connector and some scala code (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector) to store the query results as a spark dataframe, convert this to a global temp table, store the global temp table as a SparkR dataframe and collapse this to an R dataframe, but this code is very difficult to read, requires me to change the language to scala (which I do not know well) for one of the cells in my notebook, and takes a really long time due to the large amount of steps. Because my R script often starts with several SQL queries -- often to multiple different databases -- this method gets very time-consuming and makes my scripts almost unreadable. Surely there is a more straightforward way?
(We are using Databricks primarily for automation via LogicApps and Azure Data Factory, and occasionally for increased RAM, rather than for parallel processing; our data (once extracted) are generally not large enough to require parallelisation and some of the models we use (e.g. lme4::lmer()) do not benefit from it.)
I worked this out eventually and want to post the answer here in case anyone else is having issues.
You can use SparkR::read.jdbc() with a query but you must surround the query in brackets and alias the results as something, otherwise you will get an ambiguous syntax error. The "portnum" seems to work fine for me as the default 1433 but if you have a different kind of SQL database you might need to change this in the URL. Then you can call SparkR::collect() on the SparkDataFrame containing the query results to convert it to an R dataframe:
e.g.
myconnection <- "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydatabase;user=database_user;password=abc123Ineedabetterpassword"
df <- read.jdbc( myconnection, "(
SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2) as result" ) %>%
SparkR::collect()

Ibis Impala JOIN problem with relabel/name 'column AS newName'

When you use the Ibis API to query impala, for some reason Ibis API forces it to become a subquery (when you join 4-5 tables it suddenly becomes super slow). It simply won't join normally, due to column name overlap problem on joins. I want a way to quickly rename the columns perhaps, isn't that's how SQL usually works?
i0 = impCon.table('shop_inventory')
s0 = impCon.table('shop_expenditure')
s0 = s0.relabel({'element_date': 'spend_element_date', 'element_shop_item': 'spend_shop_item'})
jn = i0.inner_join(s0, [i0['element_date'] == s0['spend_element_date'], i0['element_shop_item'] == s0['spend_shop_item']])
jn.materialize()
jn.execute(limit=900)
Then you have IBIS generating SQL that is SUBQUERYING it without me suggesting it:
SELECT *
FROM (
SELECT `element_date`, `element_shop_item`, `element_address`, `element_expiration`,
`element_category`, `element_description`
FROM dbp.`shop_inventory`
) t0
INNER JOIN (
SELECT `element_shop_item` AS `spend_shop_item`, `element_comm` AS `spend_comm`,
`element_date` AS `spend_date`, `element_amount`,
`element_spend_type`, `element_shop_item_desc`
FROM dbp.`shop_spend`
) t1
ON (`element_shop_item` = t1.`spend_shop_item`) AND
(`element_category` = t1.`spend_category`) AND
(`element_subcategory` = t1.`spend_subcategory`) AND
(`element_comm` = t1.`spend_comm`) AND
(`element_date` = t1.`spend_date`)
LIMIT 900
Why is this so difficult?
It should be ideally as simple as:
jn = i0.inner_join(s0, [s0['element_date'].as('spend_date') == i0['element_date']]
to generate a single: SELECT s0.element_date as spend_date, i0.element_date INNER JOIN s0 dbp.shop_spend ON s0.spend_date == i0.element_date
right?
Are we not ever allowed to have same column names on tables that are being joined? I am pretty sure in raw SQL you can just use "X AS Y" without having to need subquery.
I spent the last few hours struggling with this same issue. A better solution I found is to do the following. Join keeping the variable names the same. Then, before you materialize, only select a subset of the variables such that there isn't any overlap.
So in your code it would look something like this:
jn = i0.inner_join(s0, [i0['element_date'] == s0['element_date'], i0['element_shop_item'] == s0['element_shop_item']])
expr = jn[i0, s0['variable_of_interest_1'],s0['variable_of_interest_2']]
expr.materialize()
See here for more resources
https://docs.ibis-project.org/sql.html

Syntax error. in query (MS Access sql)

I have a MS access query which I am running in my c sharp application, I am able to run the query fine using SSMS (I know this isn't an access sql but its all I can use) and when I import it into my c sharp application I get an incorrect syntax error. (My c sharp application reads from access dbf files) Here is the full sql below:
SELECT ([T2_BRA].[REF] + [F7]) AS NewStyle,
Sum(T2_BRA.Q11) AS QTY1, Sum(T2_BRA.Q12) AS QTY2,
Sum(T2_BRA.Q13) AS QTY3, Sum(T2_BRA.Q14) AS QTY4, Sum(T2_BRA.Q15) AS QTY5, Sum(T2_BRA.Q16) AS QTY6, Sum(T2_BRA.Q17) AS QTY7, Sum(T2_BRA.Q18) AS QTY8,
Sum(T2_BRA.Q19) AS QTY9, Sum(T2_BRA.Q20) AS QTY10, Sum(T2_BRA.Q21) AS QTY11, Sum(T2_BRA.Q22) AS QTY12, Sum(T2_BRA.Q23) AS QTY13, T2_HEAD.REF,
Sum(T2_BRA.LY11) AS LY1, Sum(T2_BRA.LY12) AS LY2, Sum(T2_BRA.LY13) AS LY3, Sum(T2_BRA.LY14) AS LY4, Sum(T2_BRA.LY15) AS LY5,
Sum(T2_BRA.LY16) AS LY6, Sum(T2_BRA.LY17) AS LY7, Sum(T2_BRA.LY18) AS LY8, Sum(T2_BRA.LY19) AS LY9, Sum(T2_BRA.LY20) AS LY10,
Sum(T2_BRA.LY21) AS LY11, Sum(T2_BRA.LY22) AS LY12, Sum(T2_BRA.LY23) AS LY13, T2_BRA.BRANCH, T2_HEAD.LASTDELV, T2_EAN.EAN_CODE, T2_SIZES.S01 AS S1,
T2_SIZES.S02 AS S2,
T2_SIZES.S03 AS S3,
T2_SIZES.S04 AS S4,
T2_SIZES.S05 AS S5,
T2_SIZES.S06 AS S6,
T2_SIZES.S07 AS S7,
T2_SIZES.S08 AS S8,
T2_SIZES.S09 AS S9,
T2_SIZES.S10 AS S10,
T2_SIZES.S11 AS S11,
T2_SIZES.S12 AS S12,
T2_SIZES.S13 AS S13
FROM ((((((T2_BRA INNER JOIN T2_HEAD ON T2_BRA.REF = T2_HEAD.REF)) INNER JOIN T2_SIZES ON T2_HEAD.SIZERANGE = T2_SIZES.SIZERANGE) INNER JOIN
(SELECT Right(T2_LOOK.[KEY],3) AS NewCol, T2_LOOK.F1 AS MasterColour, Left(T2_LOOK.[KEY],3) AS Col, T2_LOOK.F7
FROM T2_LOOK
WHERE (Left(T2_LOOK.[KEY],3))='COL') as Colour ON T2_BRA.COLOUR = Colour.NewCol) LEFT JOIN T2_EAN ON T2_EAN.T2T_CODE LIKE (SELECT ('#' + ([T2_BRA].[REF] + [F7]) + '#'))))
WHERE [T2_BRA].[REF] = '010403' AND T2_BRA.BRANCH in ('A','G')
GROUP BY ([T2_BRA].[REF] + [F7]),T2_HEAD.REF, T2_BRA.BRANCH, T2_HEAD.LASTDELV, T2_EAN.EAN_CODE, T2_SIZES.S01,
T2_SIZES.S02, T2_SIZES.S03, T2_SIZES.S04, T2_SIZES.S05, T2_SIZES.S06, T2_SIZES.S07, T2_SIZES.S08, T2_SIZES.S09, T2_SIZES.S10, T2_SIZES.S11, T2_SIZES.S12, T2_SIZES.S13
The line I am getting the syntax error is:
LEFT JOIN T2_EAN ON T2_EAN.T2T_CODE LIKE (SELECT ('#' + ([T2_BRA].[REF] + [F7]) + '#')
Any help would be great! :)
Your problem JOIN clause has some issues for the MS Access dialect:
SELECT that uses a column and table reference must have FROM source;
String concatenation does not use + but & operator;
LIKE expressions can be used in ON clauses but the comparison will be row by row (not searching values across all rows of joining table as possibly intended).
Correcting above still imposes a challenge since you are attempting to join a table by the LIKE expression in a LEFT JOIN relationship.
Consider first comma-separating your table, T2_EAN, which equates to a cross join then add a WHERE clause running an EXISTS subquery. Doing so, WHERE becomes the implicit join and T2_EAN column will point to field in main query. Do be aware other tables in query must use INNER JOIN for this comma-separated table. And adjust parentheses with removal of LEFT JOIN.
FROM T2_EAN, (((((
...
WHERE [T2_BRA].[REF] = '010403' AND T2_BRA.BRANCH in ('A','G')
AND EXISTS
(SELECT 1 FROM [T2_BRA] t
WHERE T2_EAN.T2T_CODE LIKE ('%' & (t.[REF] & t.[F7]) & '%')
Now, the challenge here is the WHERE will correspond to an INNER JOIN and not LEFT JOIN. To overcome this, consider adding a UNION (not UNION ALL) query exactly the same as above but without the EXISTS subquery. This will then return records that did not meet LIKE criteria and UNION will leave out duplicates. See LEFT JOIN Equivalent here. Be sure to add a NULL to SELECT wherever T2_EAN column was referenced:
SELECT ... T2_HEAD.LASTDELV, T2_EAN.EAN_CODE, T2_SIZES.S01 AS S1 ...
UNION
SELECT ... T2_HEAD.LASTDELV, NULL AS EAN_CODE, T2_SIZES.S01 AS S1 ...
Do note: performance is not guaranteed with this adjustment. Further considerations include:
Once query compiles and runs, be sure to save this large query or view as a stored object in the MS Access database and not as a scripted C# string query. Even if you do not have MS Access GUI .exe, you can save queries via code using MS Access' querydefs object with VBA (i.e., Excel VBA) or COM-interface with C# or any other language that supports COM like open-source Python, PHP, R.
Then have C# app simply retrieve the view for its purposes: SELECT * FROM mySavedQuery. Stored queries tend to be more efficient especially for many joins and complex queries than coded queries since the Access engine saves best execution plan and caches stats.
Remove the need of LIKE by saving matching values without extraneous other characters so = can be used as I believe MS Access's LIKE will not use indexes in query plans.
Upsize your Access database to SQL Server for more sophisticated handling with the T-SQL dialect. SQL Server has easy facilities in SSMS to import Access .mdb/.accdb files.
You LEFT OUTER JOIN's ON clause makes no sense:
LEFT JOIN T2_EAN ON T2_EAN.T2T_CODE LIKE (SELECT ('#' + ([T2_BRA].[REF] + [F7]) + '#')
You need to join T2_EAN to your values in ALREADY PRESENT T2_BRA table in your FROM clause. By sticking T2_BRA into a subquery here you are bringing the table in twice, which is nonsense. It's also not allowed to use a subquery inside a LIKE condition.
If it were allowed and did make sense, you would end up with a cartesian product between all the intermediate result set from those inner joins and your left outer join'd table, which is almost definitely not what you are after.
Instead (probably something like):
LEFT JOIN T2_EAN ON T2_EAN.T2T_CODE LIKE '#' + [T2_BRA].[REF] + [F7] + '#'
This is now saying "Left outer join t2_ean to T2_Bra where the T2T_Code matches the concatenation of <any one digit> + T2_Bra.Ref + F7 + <any one digit>" Without knowing your data, I cant' vouch for that being the thing you want, but it feels like the closest interpretation when reverse engineering your incorrect query.
You mention in a comment "I have tried using all the wildcard symbols *, # and ?" Don't just try wildcard symbols hoping something will work. They each do something VERY different. Use the one that you need for you situation. Decent explanation of the three wildcards that work with the LIKE operator in access here. You may want to switch to the asterisk while debugging (since it's the most wide open of the wild cards) and then once you are getting reasonable results, use the much tighter # (match only one digit) operator.

ORA-00918: column ambigously defined, using DB Link

When I execute the query below I get the following error message :
ORA-00918: column ambigously defined
ORA-02063: preceding line from ABC
Query:
SELECT
dos.*,
cmd.*,
cmd_r.*,
adr_inc.*,
adr_veh.*,
loc.*,
fou_d.*,
fou_r.*, --Works if I comment this line
mot.*
FROM
DOSSIERS#ABC dos
LEFT JOIN CMDS#ABC cmd ON cmd.DOS_CODE_ID = dos.dos_code_id
LEFT JOIN CMDS_RECCSTR#ABC cmd_r ON cmd_r.DOS_CODE_ID = dos.DOS_CODE_ID AND cmd_r.CMD_CODE_ID = cmd.CMD_CODE_ID AND cmd_r.CMD_DT_CREAT = cmd.CMD_DT_CREAT
LEFT JOIN HISTO_ADR#ABC adr_inc ON adr_inc.DOS_CODE_ID = dos.DOS_CODE_ID
LEFT JOIN HISTO_ADR#ABC adr_veh ON adr_veh.DOS_CODE_ID = dos.DOS_CODE_ID
LEFT JOIN LOC#ABC loc ON dos.DOS_CODE_ID = loc.DOS_CODE_ID
LEFT JOIN FOURNISS#ABC fou_d ON fou_d.PAY_CODE_ID = loc.PAY_CODE_ID_D AND fou_d.FOU_CODE_ID = loc.FOU_CODE_ID_D
LEFT JOIN FOURNISS#ABC fou_r ON fou_r.PAY_CODE_ID = loc.PAY_CODE_ID_R AND fou_r.FOU_CODE_ID = loc.FOU_CODE_ID_R
LEFT JOIN REF_MOT#ABC mot ON mot.RMR_CODE_ID = cmd_r.RMR_CODE_ID
WHERE
dos.REF_EXT = 'XXXXXXX'
If I comment fou_r.* in SELECT it works.
The following queries don't work neither:
SELECT *
FROM ... ;
SELECT (SELECT count(xxx) FROM ...)
FROM ...;
I looked at similar issues on SO but they were all using complex queries or was using many SELECT inside WHERE. Mine is simple that is why I don't understand what could be wrong.
Current Database: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
Target Database (refers to db link ABC target): Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bi
Client: Toad for Oracle 9.7.2.5
You seem to be hitting bug 13589271. I can't share details from MOS, but there isn't much to share anyway. It's related to the remote table having a column with a 30-character name though, as you have in your remote FOURNIUSS table.
Unfortunately simply aliasing the column in your query, like this:
fou_d.COLUMN_WITH_30_CHARACTERS_NAME alias_a,
fou_r.COLUMN_WITH_30_CHARACTERS_NAME alias_b,
... doesn't help and still gets the same error, as the alias is applied by the local database and the problem seems to be during the remote access. What does seem to work is using an in-line view to apply a column alias before the join:
...
LEFT JOIN LOC#ABC loc ON dos.DOS_CODE_ID = loc.DOS_CODE_ID
LEFT JOIN (
SELECT PAY_CODE_ID, FOU_CODE_ID, COLUMN_WITH_30_CHARACTERS_NAME alias_a FROM FOURNISS#ABC
) fou_d ON fou_d.PAY_CODE_ID = loc.PAY_CODE_ID_D AND fou_d.FOU_CODE_ID = loc.FOU_CODE_ID_D
LEFT JOIN (
SELECT PAY_CODE_ID, FOU_CODE_ID, COLUMN_WITH_30_CHARACTERS_NAME alias_b FROM FOURNISS#ABC
) fou_r ON fou_r.PAY_CODE_ID = loc.PAY_CODE_ID_R AND fou_r.FOU_CODE_ID = loc.FOU_CODE_ID_R
LEFT JOIN REF_MOT#ABC mot ON mot.RMR_CODE_ID = cmd_r.RMR_CODE_ID
...
This even works if you give the column the same alias in both inline views. The downside is that you have to explicitly list all of the columns from the table (or at least those you're interested in) in order to be able to apply the alias to the problematic one, but having done so you can still use fou_d.* and fou_r.* in the outer select list.
I don't have an 11.2.0.2 database but I've run this successfully in an 11.2.0.3 database which still showed the ORA-00918 error from your original code. It's possible something else in 11.2.0.2 will stop this workaround being effective, of course. I don't see the original problem in 11.2.0.4 at all, so upgrading to that terminal patch release might be a better long-term solution.
Using * is generally considered a bad practice anyway though, not least because you're going to get a lot of duplicated columns from the joins (lots of dos_code_id in each row, for example); but you're also likely to be getting other data you don't really want, and anything that consumes this result set will have to assume the column order is always the same in those tables - any variation, or later addition or removal of a column, will cause problems.