How can I write raw binary data to duckdb from R? - sql

My best guess is that this simply isn't currently supported by the {duckdb} package, however I'm not sure if I'm doing something wrong/not in the in the intended way. Here's a reprex which reproduces the (fairly self-explanatory) issue:
con <- DBI::dbConnect(duckdb::duckdb())
# Note: this connection would work fine
# con <- DBI::dbConnect(RSQLite::SQLite())
DBI::dbCreateTable(
conn = con,
name = "raw_test",
fields = list(file = "blob")
)
DBI::dbAppendTable(
conn = con,
name = "raw_test",
value = data.frame(file = I(list(as.raw(1:3)))),
field.types = list(file = "blob")
)
#> Error: rapi_execute: Unsupported column type for scan
#> Error: rapi_register_df: Failed to register data frame: std::exception
NB (1), I'm trying to find a way to write arbitrary R objects to SQL. To do this, I plan to serialise the objects in question to binary format, write to SQL, read back and unserialise. I also want to find a method that works reliably with as many SQL backends as possible, as I'm planning to create a package which allows the user to specify the connection.
NB (2), I've posted this as an issue on the duckdb github as I have a feeling this is simply a bug/not yet a supported feature.
Edit #1
I'm now more convinced that this is simply a bug with {duckdb}. From the documentation for DBI::dbDataType():
If the backend needs to override this generic, it must accept all basic R data types as its second argument, namely logical, integer, numeric, character, dates (see Dates), date-time (see DateTimeClasses), and difftime. If the database supports blobs, this method also must accept lists of raw vectors, and blob::blob objects.
duckdb certainly supports blob types, so as far as I can see, these objects should be writeable. Note, this code produces the same issue outlined above (using blob::blob() instead of I(list()):
DBI::dbAppendTable(
conn = con,
name = "raw_test",
value = data.frame(file = blob::blob(as.raw(1:3))),
field.types = list(file = "blob")
)
#> Error: rapi_execute: Unsupported column type for scan
#> Error: rapi_register_df: Failed to register data frame: std::exception
I'm leaving this open for now in case any kindly duckdb dev can confirm this is a bug/missing feature, or if anyone can suggest a workaround.

Yup, it’s just a missing feature according to this issue

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

Create new column from existing column in Dataset - Apache Spark Java

I am new to Spark ML and got stuck in a task which require some data normalization and there is very less documentation available on net for Spark ML - Java. Any help is much appreciated.
Problem Description :
I have a Dataset that contains encoded url in column (ENCODED_URL) and I want to create new column (DECODED_URL) in existing Dataset that contains decoded version of ENCODED_URL.
For Eg :
Current Dataset
ENCODED_URL
https%3A%2F%2Fmywebsite
New Dataset
ENCODED_URL | DECODED_URL
https%3A%2F%2Fmywebsite | https://mywebsite
Tried using withColumn but had no clue what i should pass as 2nd argument
Dataset<Row> newDs = ds.withColumn("new_col",?);
After reading the Spark documentation got an idea that it may be possible with SQLTransformer but couldn't figure out how to customize it to decode the url.
This is how i read information from CSV
Dataset<Row> urlDataset = s_spark.read().option("header", true).csv(CSV_FILE).persist(StorageLevel.MEMORY_ONLY());
A Spark primer
The first thing to know is that Spark Datasets are effectively immutable. Whenever you do a transformation, a new Dataset is created and returned. Another thing to keep in mind is the difference between actions and transformations -- actions cause Spark to actually to start crunching numbers and compute your DataFrame while transformations add to the definition of a DataFrame but are not computed unless an action is called. An example of an action is DataFrame#count while an example of a transformation is DataFrame#withColumn. See the full list of actions and transformations in the Spark Scala documentation.
A solution
withColumn allows you to either create a new column or replace an existing column in a Dataset (if the first argument is an existing column's name). The docs for withColumn will tell you that the second argument is supposed to be a Column object. Unfortunately, the Column documentation only describes methods available to Column objects but does not link to other ways to create Column objects, so it's not your fault that you're at a loss for what do next.
The thing you're looking for is org.apache.spark.sql.functions#regexp_replace. Putting it all together, your code should look something like this:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
ds = ds.withColumn(
"decoded_url",
functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
regexp_replace requires that we pass a Column object as the first value but nothing requires that it even exist on any Dataset because Column objects are basically instructions for how to compute a column, they don't actually contain any real data themselves. To illustrate this principle, we could write the above snippet as:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
Column myColExpression = functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
ds = ds.withColumn("decoded_url", myColExpression)
If you wanted, you could reuse myColExpression on other datasets that have an encoded_url column.
Suggestion
If you haven't already, you should familiarize yourself with the org.apache.spark.sql.functions class. It's a util class that's effectively the Spark standard lib for transformations.

Lossless assignment between Field-Symbols

I'm currently trying to perform a dynamic lossless assignment in an ABAP 7.0v SP26 environment.
Background:
I want to read in a csv file and move it into an internal structure without any data losses. Therefore, I declared the field-symbols:
<lfs_field> TYPE any which represents a structure component
<lfs_element> TYPE string which holds a csv value
Approach:
My current "solution" is this (lo_field is an element description of <lfs_field>):
IF STRLEN( <lfs_element> ) > lo_field->output_length.
RAISE EXCEPTION TYPE cx_sy_conversion_data_loss.
ENDIF.
I don't know how precisely it works, but seems to catch the most obvious cases.
Attempts:
MOVE EXACT <lfs_field> TO <lfs_element>.
...gives me...
Unable to interpret "EXACT". Possible causes: Incorrect spelling or comma error
...while...
COMPUTE EXACT <lfs_field> = <lfs_element>.
...results in...
Incorrect statement: "=" missing .
As the ABAP version is too old I also cannot use EXACT #( ... )
Example:
In this case I'm using normal variables. Lets just pretend they are field-symbols:
DATA: lw_element TYPE string VALUE '10121212212.1256',
lw_field TYPE p DECIMALS 2.
lw_field = lw_element.
* lw_field now contains 10121212212.13 without any notice about the precision loss
So, how would I do a perfect valid lossless assignment with field-symbols?
Don't see an easy way around that. Guess that's why they introduced MOVE EXACT in the first place.
Note that output_length is not a clean solution. For example, string always has output_length 0, but will of course be able to hold a CHAR3 with output_length 3.
Three ideas how you could go about your question:
Parse and compare types. Parse the source field to detect format and length, e.g. "character-like", "60 places". Then get an element descriptor for the target field and check whether the source fits into the target. Don't think it makes sense to start collecting the possibly large CASEs for this here. If you have access to a newer ABAP, you could try generating a large test data set there and use it to reverse-engineer the compatibility rules from MOVE EXACT.
Back-and-forth conversion. Move the value from source to target and back and see whether it changes. If it changes, the fields aren't compatible. This is unprecise, as some formats will change although the values remain the same; for example, -42 could change to 42-, although this is the same in ABAP.
To-longer conversion. Move the field from source to target. Then construct a slightly longer version of target, and move source also there. If the two targets are identical, the fields are compatible. This fails at the boundaries, i.e. if it's not possible to construct a slightly-longer version, e.g. because the maximum number of decimal places of a P field is reached.
DATA target TYPE char3.
DATA source TYPE string VALUE `123.5`.
DATA(lo_target) = CAST cl_abap_elemdescr( cl_abap_elemdescr=>describe_by_data( target ) ).
DATA(lo_longer) = cl_abap_elemdescr=>get_by_kind(
p_type_kind = lo_target->type_kind
p_length = lo_target->length + 1
p_decimals = lo_target->decimals + 1 ).
DATA lv_longer TYPE REF TO data.
CREATE DATA lv_longer TYPE HANDLE lo_longer.
ASSIGN lv_longer->* TO FIELD-SYMBOL(<longer>).
<longer> = source.
target = source.
IF <longer> = target.
WRITE `Fits`.
ELSE.
WRITE `Doesn't fit, ` && target && ` is different from ` && <longer>.
ENDIF.

Extending dplyr and use of internal functions

I'm working on a fork of the RSQLServer package and am trying to implement joins. With the current version of the package, joins for any DBI-connected database are implemented using sql_join.DBIConnection. However, that implementation doesn't work well for SQL server. For instance, it makes use of USING which is not supported by SQL server.
I've got a version of this function sql_join.SQLServerConnection working (though not complete yet). I've based my function on sql_join.DBIConnection as much as possible. One issue I've had is that sql_join.DBIConnection calls a number of non-exported functions within dplyr such as common_by. For now, I've worked around this by using dplyr:::common_by, but I'm aware that that's not ideal practice.
Should I:
Ask Hadley Wickham/Romain Francois to export the relevant functions to make life easier for people developing packages that build on dplyr?
Copy the internal functions into the package I'm working on?
Continue to use the ::: operator to call the functions?
Something else?
Clearly with option 3, there's a chance that the interface will change (since they're not exported functions) and that the package would break in the longer term.
Sample code:
sql_join.SQLServerConnection <- function (con, x, y, type = "inner", by = NULL, ...) {
join <- switch(type, left = sql("LEFT"), inner = sql("INNER"),
right = sql("RIGHT"), full = sql("FULL"), stop("Unknown join type:",
type, call. = FALSE))
by <- dplyr:::common_by(by, x, y)
using <- FALSE # all(by$x == by$y)
x_names <- dplyr:::auto_names(x$select)
y_names <- dplyr:::auto_names(y$select)
# more code
}
It looks to me like you may not have to use those functions verbs. Since dplyr now put it's database functionality in dbplyr, the relevant code is here. I don't see the use of auto_names or common_by there.
I strongly recommend following the steps in Creating New Backends after reading SQL Translation.
It may also be worth reviewing some other alternative backends, such as Hrbrmaster's sergeant package for Apache Drill using JDBC.

Compare 2 datasets with dbunit?

Currently I need to create tests for my application. I used "dbunit" to achieve that and now need to compare 2 datasets:
1) The records from the database I get with QueryDataSet
2) The expected results are written in the appropriate FlatXML in a file which I read in as a dataset as well
Basically 2 datasets can be compared this way.
Now the problem are columns with a Timestamp. They will never fit together with the expected dataset. I really would like to ignore them when comparing them, but it doesn't work the way I want it.
It does work, when I compare each table for its own with adding a column filter and ignoreColumns. However, this approch is very cumbersome, as many tables are used in that comparison, and forces one to add so much code, it eventually gets bloated.
The same applies for fields which have null-values
A probable solution would also be, if I had the chance to only compare the very first column of all tables - and not by naming it with its column name, but only with its column index. But there's nothing I can find.
Maybe I am missing something, or maybe it just doesn't work any other way than comparing each table for its own?
For the sake of completion some additional information must be posted. Actually my previously posted solution will not work at all as the process reading data from the database got me trapped.
The process using "QueryDataset" did read the data from the database and save it as a dataset, but the data couldn't be accessed from this dataset anymore (although I could see the data in debug mode)!
Instead the whole operation failed with an UnsupportedOperationException at org.dbunit.database.ForwardOnlyResultSetTable.getRowCount(ForwardOnlyResultSetTable.java:73)
Example code to produce failure:
QueryDataSet qds = new QueryDataSet(connection);
qds.addTable(“specificTable”);
qds.getTable(„specificTable“).getRowCount();
Even if you try it this way it fails:
IDataSet tmpDataset = connection.createDataSet(tablenames);
tmpDataset.getTable("specificTable").getRowCount();
In order to make extraction work you need to add this line (the second one):
IDataSet tmpDataset = connection.createDataSet(tablenames);
IDataSet actualDataset = new CachedDataSet(tmpDataset);
Great, that this was nowhere documented...
But that is not all: now you'd certainly think that one could add this line after doing a "QueryDataSet" as well... but no! This still doesn't work! It will still throw the same Exception! It doesn't make any sense to me and I wasted so much time with it...
It should be noted that extracting data from a dataset which was read in from an xml file does work without any problem. This annoyance just happens when trying to get a dataset directly from the database.
If you have done the above you can then continue as below which compares only the columns you got in the expected xml file:
// put in here some code to read in the dataset from the xml file...
// and name it "expectedDataset"
// then get the tablenames from it...
String[] tablenames = expectedDataset.getTableNames();
// read dataset from database table using the same tables as from the xml
IDataSet tmpDataset = connection.createDataSet(tablenames);
IDataSet actualDataset = new CachedDataSet(tmpDataset);
for(int i=0;i<tablenames.length;i++)
{
ITable expectedTable = expectedDataset.getTable(tablenames[i]);
ITable actualTable = actualDataset.getTable(tablenames[i]);
ITable filteredActualTable = DefaultColumnFilter.includedColumnsTable(actualTable, expectedTable.getTableMetaData().getColumns());
Assertion.assertEquals(expectedTable,filteredActualTable);
}
You can also use this format:
// Assert actual database table match expected table
String[] columnsToIgnore = {"CONTACT_TITLE","POSTAL_CODE"};
Assertion.assertEqualsIgnoreCols(expectedTable, actualTable, columnsToIgnore);