Extending dplyr and use of internal functions - sql

I'm working on a fork of the RSQLServer package and am trying to implement joins. With the current version of the package, joins for any DBI-connected database are implemented using sql_join.DBIConnection. However, that implementation doesn't work well for SQL server. For instance, it makes use of USING which is not supported by SQL server.
I've got a version of this function sql_join.SQLServerConnection working (though not complete yet). I've based my function on sql_join.DBIConnection as much as possible. One issue I've had is that sql_join.DBIConnection calls a number of non-exported functions within dplyr such as common_by. For now, I've worked around this by using dplyr:::common_by, but I'm aware that that's not ideal practice.
Should I:
Ask Hadley Wickham/Romain Francois to export the relevant functions to make life easier for people developing packages that build on dplyr?
Copy the internal functions into the package I'm working on?
Continue to use the ::: operator to call the functions?
Something else?
Clearly with option 3, there's a chance that the interface will change (since they're not exported functions) and that the package would break in the longer term.
Sample code:
sql_join.SQLServerConnection <- function (con, x, y, type = "inner", by = NULL, ...) {
join <- switch(type, left = sql("LEFT"), inner = sql("INNER"),
right = sql("RIGHT"), full = sql("FULL"), stop("Unknown join type:",
type, call. = FALSE))
by <- dplyr:::common_by(by, x, y)
using <- FALSE # all(by$x == by$y)
x_names <- dplyr:::auto_names(x$select)
y_names <- dplyr:::auto_names(y$select)
# more code
}

It looks to me like you may not have to use those functions verbs. Since dplyr now put it's database functionality in dbplyr, the relevant code is here. I don't see the use of auto_names or common_by there.
I strongly recommend following the steps in Creating New Backends after reading SQL Translation.
It may also be worth reviewing some other alternative backends, such as Hrbrmaster's sergeant package for Apache Drill using JDBC.

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

How can I write raw binary data to duckdb from R?

My best guess is that this simply isn't currently supported by the {duckdb} package, however I'm not sure if I'm doing something wrong/not in the in the intended way. Here's a reprex which reproduces the (fairly self-explanatory) issue:
con <- DBI::dbConnect(duckdb::duckdb())
# Note: this connection would work fine
# con <- DBI::dbConnect(RSQLite::SQLite())
DBI::dbCreateTable(
conn = con,
name = "raw_test",
fields = list(file = "blob")
)
DBI::dbAppendTable(
conn = con,
name = "raw_test",
value = data.frame(file = I(list(as.raw(1:3)))),
field.types = list(file = "blob")
)
#> Error: rapi_execute: Unsupported column type for scan
#> Error: rapi_register_df: Failed to register data frame: std::exception
NB (1), I'm trying to find a way to write arbitrary R objects to SQL. To do this, I plan to serialise the objects in question to binary format, write to SQL, read back and unserialise. I also want to find a method that works reliably with as many SQL backends as possible, as I'm planning to create a package which allows the user to specify the connection.
NB (2), I've posted this as an issue on the duckdb github as I have a feeling this is simply a bug/not yet a supported feature.
Edit #1
I'm now more convinced that this is simply a bug with {duckdb}. From the documentation for DBI::dbDataType():
If the backend needs to override this generic, it must accept all basic R data types as its second argument, namely logical, integer, numeric, character, dates (see Dates), date-time (see DateTimeClasses), and difftime. If the database supports blobs, this method also must accept lists of raw vectors, and blob::blob objects.
duckdb certainly supports blob types, so as far as I can see, these objects should be writeable. Note, this code produces the same issue outlined above (using blob::blob() instead of I(list()):
DBI::dbAppendTable(
conn = con,
name = "raw_test",
value = data.frame(file = blob::blob(as.raw(1:3))),
field.types = list(file = "blob")
)
#> Error: rapi_execute: Unsupported column type for scan
#> Error: rapi_register_df: Failed to register data frame: std::exception
I'm leaving this open for now in case any kindly duckdb dev can confirm this is a bug/missing feature, or if anyone can suggest a workaround.
Yup, it’s just a missing feature according to this issue

Unable to use pickAFile in TigerJython

In JES, I am able to use:
file=pickAFile()
In TigerJython, however, I get the following error
NameError: name 'pickAFile' is not defined
What am I doing wrong here?
You are not doing anything wrong at all. The thing is that pickAFile() is not a standard function in Python. It is actually rather a function that JES has added for convenience, but which you probably will not find it in any other environment.
Since TigerJython and JES are both based on Jython, you can easily write a pickAFile() function on your own that uses Java's Swing. Here is a possible simple implementation (the pickAFile() found in JES might be a bit more complex, but this should get you started):
def pickAFile():
from javax.swing import JFileChooser
fc = JFileChooser()
retVal = fc.showOpenDialog(None)
if retVal == JFileChooser.APPROVE_OPTION:
return fc.getSelectedFile()
else:
return None
Given that it is certainly a useful function, we might have to consider including it into our next update of TigerJython.
P.S. I would like to apologise for answering so late, I have just joined SO recently and was not aware of your question (I am one of the original authors of TigerJython).

How to quote values for LuaSQL?

LuaSQL, which seems to be the canonical library for most SQL database systems in Lua, doesn't seem to have any facilities for quoting/escaping values in queries. I'm writing an application that uses SQLite as a backend, and I'd love to use an interface like the one specified by Python's DB-API:
c.execute('select * from stocks where symbol=?', t)
but I'd even settle for something even dumber, like:
conn:execute("select * from stocks where symbol=" + luasql.sqlite.quote(t))
Are there any other Lua libraries that support quoting for SQLite? (LuaSQLite3 doesn't seem to.) Or am I missing something about LuaSQL? I'm worried about rolling my own solution (with regexes or something) and getting it wrong. Should I just write a wrapper for sqlite3_snprintf?
I haven't looked at LuaSQL in a while but last time I checked it didn't support it. I use Lua-Sqlite3.
require("sqlite3")
db = sqlite3.open_memory()
db:exec[[ CREATE TABLE tbl( first_name TEXT, last_name TEXT ); ]]
stmt = db:prepare[[ INSERT INTO tbl(first_name, last_name) VALUES(:first_name, :last_name) ]]
stmt:bind({first_name="hawkeye", last_name="pierce"}):exec()
stmt:bind({first_name="henry", last_name="blake"}):exec()
for r in db:rows("SELECT * FROM tbl") do
print(r.first_name,r.last_name)
end
LuaSQLite3 as well an any other low level binding to SQLite offers prepared statements with variable parameters; these use methods to bind values to the statement parameters. Since SQLite does not interpret the binding values, there is simply no possibility of an SQL injection. This is by far the safest (and best performing) approach.
uroc shows an example of using the bind methods with prepared statements.
By the way in Lua SQL there is an undocumented escape function for the sqlite3 driver in conn:escape where conn is a connection variable.
For example with the code
print ("con:escape works. test'test = "..con:escape("test'test"))
the result is:
con:escape works. test'test = test''test
I actually tried that to see what it'd do. Apparently there is also such a function for their postgres driver too. I found this by looking at the tests they had.
Hope this helps.

Using statement with more than one system resource

I have used the using statement in both C# and VB. I agree with all the critics regarding nesting using statements (C# seems well done, VB not so much)
So with that in mind I was interested in improving my VB using statements by "using" more than one system resource within the same block:
Example:
Using objBitmap As New Bitmap(100,100)
Using objGraphics as Graphics = Graphics.From(objBitmap)
End Using
End Using
Could be written like this:
Using objBitmap As New Bitmap(100,100), objGraphics as Gaphics = Graphics.FromImage(objbitmap)
End Using
So my question is what is the better method?
My gut tells me that if the resources are related/dependent then using more than one resource in a using statement is logical.
My primary language is C#, and there most people prefer "stacked" usings when you have many of them in the same scope:
using (X)
using (Y)
using (Z)
{
// ...
}
The problem with the single using statement that VB.NET has is that it seems cluttered and it would be likely to fall of the edge of the screen. So at least from that perspective, multiple usings look better in VB.NET.
Maybe if you combine the second syntax with line continuations, it would look better:
Using objBitmap As New Bitmap(100,100), _
objGraphics as Graphics = Graphics.FromImage(objbitmap)
' ...
End Using
That gets you closer to what I would consider better readability.
They are both the same, you should choose the one that you find to be the most readable as they are identical at the IL level.
I personally like the way that C# handles this by allowing this syntax:
using (Foo foo = new Foo())
using (Bar bar = new Bar())
{
// ....
}
However I find the VB.NET equivalent of this form (your second example) to be less readable than the nested Using statements from your first example. But this is just my opinion. Choose the style that best suits the readability of the code in question as that is the most important thing considering that the output is identical.