How to use SparkR::read.jdbc() or sparklyr::spark_read_jdbc() to get results of SQL query rather than whole table? - sql

I usually use RODBC locally to query my databases. However our company has recently moved to Azure Databricks which does not inherently support RODBC or other odbc connections, but does support jdbc connections which I have not previously used.
I have read the documentation for SparkR::read.jdbc() and sparklyr::spark_read_jdbc() but these seem to pull an entire table from the database rather than just the results of a query, which is not suitable for me as I never have to pull whole tables and instead run queries that join multiple tables together but only return a very small subset of the data in each table.
I cannot find a method for using the jdbc connector to:
(A) run a query referring to multiple tables on the same database
and
(B) store the results as an R dataframe or something that can very easily be converted to an R dataframe (such as a SparkR or sparklyr dataframe).
If possible, the solution would also only require me to specify the connection credentials once per script/notebook rather than every time I connect to the database to run a query and store the results as a dataframe.
e.g. is there a jdbc equivalent of the following:
my_server="myserver.database.windows.net"
my_db="mydatabase"
my_username="database_user"
my_pwd="abc123Ineedabetterpassword"
myconnection <- RODBC::odbcDriverConnect(paste0("DRIVER={SQL Server};
server=",my_server,";
database=",my_db,";
uid=",my_username,";
pwd=",my_pwd))
df <- RODBC::sqlQuery(myconnection,
"SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2
")
df2 <- RODBC::sqlQuery(myconnection,
"SELECT x.var1, y.var2, z.var3
FROM table_x as x
LEFT JOIN table_y as y on x.id = y.x_id
LEFT JOIN table_z on as z on x.id = z.x_id
WHERE z.category like '%vegetable%'
AND y.category IN ('A', 'B', 'C')
“)
How would I do something that gives the same results (two R dataframes df and df2) as the above using the jdbc connectors from SparkR or sparklyr inbuilt in Databricks?
I know that I can use the spark connector and some scala code (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector) to store the query results as a spark dataframe, convert this to a global temp table, store the global temp table as a SparkR dataframe and collapse this to an R dataframe, but this code is very difficult to read, requires me to change the language to scala (which I do not know well) for one of the cells in my notebook, and takes a really long time due to the large amount of steps. Because my R script often starts with several SQL queries -- often to multiple different databases -- this method gets very time-consuming and makes my scripts almost unreadable. Surely there is a more straightforward way?
(We are using Databricks primarily for automation via LogicApps and Azure Data Factory, and occasionally for increased RAM, rather than for parallel processing; our data (once extracted) are generally not large enough to require parallelisation and some of the models we use (e.g. lme4::lmer()) do not benefit from it.)

I worked this out eventually and want to post the answer here in case anyone else is having issues.
You can use SparkR::read.jdbc() with a query but you must surround the query in brackets and alias the results as something, otherwise you will get an ambiguous syntax error. The "portnum" seems to work fine for me as the default 1433 but if you have a different kind of SQL database you might need to change this in the URL. Then you can call SparkR::collect() on the SparkDataFrame containing the query results to convert it to an R dataframe:
e.g.
myconnection <- "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydatabase;user=database_user;password=abc123Ineedabetterpassword"
df <- read.jdbc( myconnection, "(
SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2) as result" ) %>%
SparkR::collect()

Related

Generate CROSS JOIN queries with dbplyr

Given 2 remote tables (simulated with tbl_lazy for this example)
library("dplyr")
library("dbplyr")
t1 <- tbl_lazy(df = iris, src = dbplyr::simulate_mysql())
t2 <- tbl_lazy(df = mtcars, src = dbplyr::simulate_mysql())
How can I perform an actual* cross join between t1 and t2 using R and dbplyr?
* i.e. using CROSS JOIN in the translated SQL query
Note that I know how to perform all the other types of joins, this is precisely about CROSS joins.
I am aware of the following trick:
joined <- t1 %>%
mutate(tmp = 1) %>%
full_join(mutate(t2, tmp = 1), by = "tmp") %>%
select(-tmp)
However
This is ugly (even if it could be hidden in a function)
I would like to take advantage of the highly optimised join capabilities of the DB, so I'd like to pass a real SQL CROSS JOIN. Using show_query(joined) shows that the generated SQL query uses LEFT JOIN.
Sadly, there is no cross_join operator in dplyr and sql_join(t1, t2, type = "cross") does not work either (not implemented for tbls, works only on DB connections).
How can I generate an SQL CROSS JOIN with dbplyr?
According to the dbplyr NEWS file, since version 1.10, if you use a full_join(..., by = character()), it will "promote" the join to a cross join. This doesn't seem to be documented anywhere else yet, but searching the dbplyr Github repo for "cross" turned it up in both code and the NEWS file.
This syntax does not seem to work for local data frames, only via SQL.

Group by concat in SQL Server

I’m using multiple joins for a specific logic , but encountered a problem . Some of the records has 1-2 relation in one of the tables which mess up my output. I want to concat all these string so it will appear it one record, but I don’t know how to do it in sql server . In oracle and MySQL it’s easy but I tried playing with online examples and failed miserably.
My query:
SELECT c.customerName,c.Guid,p.campaignTitle ,
(SELECT k.campaignTitle FROM [DEV_TEST2].[dbo].campaigns l JOIN [DEV_TEST2].[dbo].campaignstitle k on k.campaignname = l.campaignname where l.campaignid = t.referrerurl) as Referrertitle,
t.activitydate,t.type
FROM [DEV_TEST2].[dbo].campaignknowncustomers c
join [DEV_TEST2].[dbo].[CampaignCustomerMatch] t ON(c.guid = t.visitorexternalid)
join [DEV_TEST2].[dbo].campaigns s ON(t.url = s.campaignid)
join [DEV_TEST2].[dbo].campaignstitle p on(s.campaignname = p.campaignname)
order by customername,activitydate
My problem is with campaigntitle column and referrertitle correlated query. Both come from the same table. I need to concat it and show only 1 row per ‘customername, guid, activitydate’

Syntax error. in query (MS Access sql)

I have a MS access query which I am running in my c sharp application, I am able to run the query fine using SSMS (I know this isn't an access sql but its all I can use) and when I import it into my c sharp application I get an incorrect syntax error. (My c sharp application reads from access dbf files) Here is the full sql below:
SELECT ([T2_BRA].[REF] + [F7]) AS NewStyle,
Sum(T2_BRA.Q11) AS QTY1, Sum(T2_BRA.Q12) AS QTY2,
Sum(T2_BRA.Q13) AS QTY3, Sum(T2_BRA.Q14) AS QTY4, Sum(T2_BRA.Q15) AS QTY5, Sum(T2_BRA.Q16) AS QTY6, Sum(T2_BRA.Q17) AS QTY7, Sum(T2_BRA.Q18) AS QTY8,
Sum(T2_BRA.Q19) AS QTY9, Sum(T2_BRA.Q20) AS QTY10, Sum(T2_BRA.Q21) AS QTY11, Sum(T2_BRA.Q22) AS QTY12, Sum(T2_BRA.Q23) AS QTY13, T2_HEAD.REF,
Sum(T2_BRA.LY11) AS LY1, Sum(T2_BRA.LY12) AS LY2, Sum(T2_BRA.LY13) AS LY3, Sum(T2_BRA.LY14) AS LY4, Sum(T2_BRA.LY15) AS LY5,
Sum(T2_BRA.LY16) AS LY6, Sum(T2_BRA.LY17) AS LY7, Sum(T2_BRA.LY18) AS LY8, Sum(T2_BRA.LY19) AS LY9, Sum(T2_BRA.LY20) AS LY10,
Sum(T2_BRA.LY21) AS LY11, Sum(T2_BRA.LY22) AS LY12, Sum(T2_BRA.LY23) AS LY13, T2_BRA.BRANCH, T2_HEAD.LASTDELV, T2_EAN.EAN_CODE, T2_SIZES.S01 AS S1,
T2_SIZES.S02 AS S2,
T2_SIZES.S03 AS S3,
T2_SIZES.S04 AS S4,
T2_SIZES.S05 AS S5,
T2_SIZES.S06 AS S6,
T2_SIZES.S07 AS S7,
T2_SIZES.S08 AS S8,
T2_SIZES.S09 AS S9,
T2_SIZES.S10 AS S10,
T2_SIZES.S11 AS S11,
T2_SIZES.S12 AS S12,
T2_SIZES.S13 AS S13
FROM ((((((T2_BRA INNER JOIN T2_HEAD ON T2_BRA.REF = T2_HEAD.REF)) INNER JOIN T2_SIZES ON T2_HEAD.SIZERANGE = T2_SIZES.SIZERANGE) INNER JOIN
(SELECT Right(T2_LOOK.[KEY],3) AS NewCol, T2_LOOK.F1 AS MasterColour, Left(T2_LOOK.[KEY],3) AS Col, T2_LOOK.F7
FROM T2_LOOK
WHERE (Left(T2_LOOK.[KEY],3))='COL') as Colour ON T2_BRA.COLOUR = Colour.NewCol) LEFT JOIN T2_EAN ON T2_EAN.T2T_CODE LIKE (SELECT ('#' + ([T2_BRA].[REF] + [F7]) + '#'))))
WHERE [T2_BRA].[REF] = '010403' AND T2_BRA.BRANCH in ('A','G')
GROUP BY ([T2_BRA].[REF] + [F7]),T2_HEAD.REF, T2_BRA.BRANCH, T2_HEAD.LASTDELV, T2_EAN.EAN_CODE, T2_SIZES.S01,
T2_SIZES.S02, T2_SIZES.S03, T2_SIZES.S04, T2_SIZES.S05, T2_SIZES.S06, T2_SIZES.S07, T2_SIZES.S08, T2_SIZES.S09, T2_SIZES.S10, T2_SIZES.S11, T2_SIZES.S12, T2_SIZES.S13
The line I am getting the syntax error is:
LEFT JOIN T2_EAN ON T2_EAN.T2T_CODE LIKE (SELECT ('#' + ([T2_BRA].[REF] + [F7]) + '#')
Any help would be great! :)
Your problem JOIN clause has some issues for the MS Access dialect:
SELECT that uses a column and table reference must have FROM source;
String concatenation does not use + but & operator;
LIKE expressions can be used in ON clauses but the comparison will be row by row (not searching values across all rows of joining table as possibly intended).
Correcting above still imposes a challenge since you are attempting to join a table by the LIKE expression in a LEFT JOIN relationship.
Consider first comma-separating your table, T2_EAN, which equates to a cross join then add a WHERE clause running an EXISTS subquery. Doing so, WHERE becomes the implicit join and T2_EAN column will point to field in main query. Do be aware other tables in query must use INNER JOIN for this comma-separated table. And adjust parentheses with removal of LEFT JOIN.
FROM T2_EAN, (((((
...
WHERE [T2_BRA].[REF] = '010403' AND T2_BRA.BRANCH in ('A','G')
AND EXISTS
(SELECT 1 FROM [T2_BRA] t
WHERE T2_EAN.T2T_CODE LIKE ('%' & (t.[REF] & t.[F7]) & '%')
Now, the challenge here is the WHERE will correspond to an INNER JOIN and not LEFT JOIN. To overcome this, consider adding a UNION (not UNION ALL) query exactly the same as above but without the EXISTS subquery. This will then return records that did not meet LIKE criteria and UNION will leave out duplicates. See LEFT JOIN Equivalent here. Be sure to add a NULL to SELECT wherever T2_EAN column was referenced:
SELECT ... T2_HEAD.LASTDELV, T2_EAN.EAN_CODE, T2_SIZES.S01 AS S1 ...
UNION
SELECT ... T2_HEAD.LASTDELV, NULL AS EAN_CODE, T2_SIZES.S01 AS S1 ...
Do note: performance is not guaranteed with this adjustment. Further considerations include:
Once query compiles and runs, be sure to save this large query or view as a stored object in the MS Access database and not as a scripted C# string query. Even if you do not have MS Access GUI .exe, you can save queries via code using MS Access' querydefs object with VBA (i.e., Excel VBA) or COM-interface with C# or any other language that supports COM like open-source Python, PHP, R.
Then have C# app simply retrieve the view for its purposes: SELECT * FROM mySavedQuery. Stored queries tend to be more efficient especially for many joins and complex queries than coded queries since the Access engine saves best execution plan and caches stats.
Remove the need of LIKE by saving matching values without extraneous other characters so = can be used as I believe MS Access's LIKE will not use indexes in query plans.
Upsize your Access database to SQL Server for more sophisticated handling with the T-SQL dialect. SQL Server has easy facilities in SSMS to import Access .mdb/.accdb files.
You LEFT OUTER JOIN's ON clause makes no sense:
LEFT JOIN T2_EAN ON T2_EAN.T2T_CODE LIKE (SELECT ('#' + ([T2_BRA].[REF] + [F7]) + '#')
You need to join T2_EAN to your values in ALREADY PRESENT T2_BRA table in your FROM clause. By sticking T2_BRA into a subquery here you are bringing the table in twice, which is nonsense. It's also not allowed to use a subquery inside a LIKE condition.
If it were allowed and did make sense, you would end up with a cartesian product between all the intermediate result set from those inner joins and your left outer join'd table, which is almost definitely not what you are after.
Instead (probably something like):
LEFT JOIN T2_EAN ON T2_EAN.T2T_CODE LIKE '#' + [T2_BRA].[REF] + [F7] + '#'
This is now saying "Left outer join t2_ean to T2_Bra where the T2T_Code matches the concatenation of <any one digit> + T2_Bra.Ref + F7 + <any one digit>" Without knowing your data, I cant' vouch for that being the thing you want, but it feels like the closest interpretation when reverse engineering your incorrect query.
You mention in a comment "I have tried using all the wildcard symbols *, # and ?" Don't just try wildcard symbols hoping something will work. They each do something VERY different. Use the one that you need for you situation. Decent explanation of the three wildcards that work with the LIKE operator in access here. You may want to switch to the asterisk while debugging (since it's the most wide open of the wild cards) and then once you are getting reasonable results, use the much tighter # (match only one digit) operator.

sql query help multi joins?

here are my tables, im using sql developer oracle
Carowner(Carowner id, carowner-name,)
Car (Carid, car-name, carowner-id*)
Driver(driver_licenceno, driver-name)
Race(Race no, race-name, prize-money, race-date)
RaceEntry(Race no*, Car id*, Driver_licenceno*, finishing_position)
im trying to list to do the query below
which drivers have come second in races from the start of this year.
lncluding race name, driver name, and the name of the car in the output
i have attempted
select r.racename, d.driver-name, c.carowner-name
from race r, driver d, car c, raceentry re
where re.finishing_position = 2 and r.race-date is ...
Something like:
select r.racename, d.driver-name, c.carowner-name
from race r
join raceentry re on r.race_no = re.race_no
join car c on re.car_Id = c.car_id
join driver d on re.driverliscenceNo = d.driverliscenceNo
where re.finishing_position = 2 and r.race-date >='20130101'
This assumes only one car and one driver with a finsih place of 2nd in a particular race. You may need more conditions otherwise. If this is your own table design, you need to start right now learning to be consistent in your nameing between tables. It is important. Fields that are in multiple tables should have the same name and data type. Also you need to stop using implicit syntax. This ia aSQL antipattern and a very poor programming technique. It leads to mistakes such as accidental cross joins and is harder to read and maintain when things get more complex. As you are clearly learning, you need to stop this bad habit right now.
First off, multiple joins in the where clause are hard to get used to when you define more than 3 or 4 tables IMHO.
Do this instead:
Select
a.columnfroma
, b.columnfromb
, c.columnfromc
from tablea a
join tableb b on a.columnAandBShare = b.columnAandBShare
join tablec c on b.columnBandCShare = c.columnBandCShare
This while no one would say is a method you have to use, it is a much more readable method of performing joins.
Otherwise you are doing the joins in the where clause and if you have other predicates with your joins you are going to have to comment out which is which if you ever need to go back and look at it.

Apache Pig load entire relationship into UDF

I have a pig script that pertains to 2 Pig relations, lets say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently I do it like this.
A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);
I then have every machine load from 'templocation' to get A.This works, but I have two problems with it.
My understanding is I should be using the HDFS cache somehow, but I'm not sure how to load a relationship directly into the HDFS cache.
When I reload the file in my UDF I got to write logic to parse the output from A that was outputted to file when I'd rather be directly using bags and tuples (is there a built in Pig java function to parse Strings back into Bag/Tuple form?).
Does anyone know how it should be done?
Here's a trick that will work for you.
You do a GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them. This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.
It's like this:
(say originally in A, you have fields fa1, fa2, fa3, in B you have fb1, fb2)
-- add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);
since this is replicated join, it's also only map-side join.