Dask DataFrame to_parquet return bytes instead of writing to file - pandas

Is it possible to write dask/pandas DataFrame to parquet and than return bytes string? I know that is not possible with to_parquet() function which accepts file path. Maybe, you have some other ways to do it. If there is no possibility to do something like this, is it makes sense to add such functionality? Ideally, it should be like this:
parquet_bytes = df.to_parquet() # bytes string is returned
Thanks!

There has been work undertaken to allow such a thing, but it's not currently a one-line thing like you suggest.
Firstly, if you have data which can fit in memory, you can use fastparquet's write() method, and supply an open= argument. This must be a function that creates a file-like object in binary-write mode, in your case a BytesIO() would do.
To make this work directly with dask, you could make use of the MemoryFileSystem from the filesystem_spec project. You would need to add the class to Dask and write as following:
dask.bytes.core._filesystems['memory'] = fsspec.implementations.memory.MemoryFileSystem
df.to_parquet('memory://name.parquet')
When done, MemoryFileSystem.store, which is a class attribute, will contain keys that are like filenames, and values which are BytesIO objects containing data.

Related

Accord.net Codification can't handle non-strings

I am trying to use the Accord.net library to build test method of several of the machine learning algorithms that library supports.
One of the issues I have run into is that when I am trying to codify my string data, the Codification class does not seem capable of dealing with any datatable columns that are not strings, despite the documentation saying otherwise.
Codification codebook = new Codification(fulldata, AllAttributeNames);
I call that line where fulldata is a datatable, and I have tried including columns of both Int32 type and Double type, and the Codification class has thrown an error saying it is unable to convert them to type String.
"System.InvalidCastException: 'Unable to cast object of type 'System.Double' to type 'System.String'.'"
EDIT: It turns out this error is because the Codification system can only handle alternate data types if it is encoding the entire table. I suppose I can see the logic here, although I would prefer a better error, or that the method was a little smarter.
I now have another issue that has cropped up related to this. After changing my code to this:
Codification codebook = new Codification(fulldata);
I then learning.Learn(inputs, outputs) my algorithm and want to use the newly trained algorithm. So the next step would be to take a bunch of test data, make sure it matches the codebooks encoding, and send it through the algorithm. Unfortunately, when I try and use the
int[][] testinput = codebook.Transform(testData, inputColumnNameArray);
It blows up claiming it could not find a mapping to transform. It does this in reference to an Integer column that the codebook correctly did not map to new values. So now it seems this Transform method is not capable of handling non-string columns, and I have not found an overload of it that can, even though the documentation indicates it should be able to handle this.
Does anyone know how to get around this issue without manually building the entire int[][] testinput array one value at a time?
Turns out I was able to answer my own question eventually.
The Codification class has two methods of using it as near as I can tell. The constructor that takes a list of column names, as well as the Transform methods both lack intelligence in dealing with non-string data types, perhaps these methods are going away in the future.
The constructor that just takes a datatable by itself, as well as the Apply method, are both capable of handling data types other than strings. Once I switched to using these two methods my errors went away.
Codification codebook = new Codification(fulldata);
int[][] testinput = codebook.Apply(testData, inputColumnNameArray);
The confusion for me lay in all the example code seemingly randomly using these two methods, but using the Apply method only when processing the training data, and using the Transform method when encoding test data.
I am not sure why they chose to do this in the documentation example code, but it definitely took me a long time to figure out what was going on enough to stop having this particular issue.

Java read variable number values from text file and assign to declared program variables

Is there a way in Java to have text file with listed a=10.35 b=20.57 c=30.79 and get program to only read the variable decimal values and assign them to declared variables a, b, c in the program.
Searched youtube found nothing.
Do not know if it is possible.
Do not know.
Got it working.
You can certainly read in the contents of the text file and parse it down to the chars and doubles.
If you are referring to declaring named variables based on the file, there is no way to do this directly at runtime. You can, however, use a data structure like a dictionary or map to store the data and access it using the name as a key.
If you could provide more details about what you are trying to do, that would make it easier to answer your question more specifically.

Octave: converting dataframe to cell array

Given an Octave dataframe object created as
c = cell(m,n);
%populate c...
pkg load dataframe
df = dataframe(c);
(see https://octave.sourceforge.io/dataframe/overview.html),
Is it possible to access the underlying cell array?
Is it there a conversion mechanism back to cell array?
Is it possible to save df to CSV?
Yes. A dataframe object, like any object, can be converted back into a struct.
Once you have the resulting struct, look for the fields x_name to get the column names, and x_data to get the data in the form of a cell array, i.e.
struct(df).x_data
As for conversion to csv, the dataframe package does not seem to provide any relevant methods as far as I can tell (in particular the package does not provide an overloaded #dataframe/csvwrite method). Therefore, I'd just extract the information as above, and go about writing it into a csv file from there.
If you're not dealing with strictly numerical data, you might want to have a look at the cell2csv / csv2cell methods from the io package (since the built-in csvwrite function is strictly for numerical data).
And if that doesn't do exactly what you want, I'd probably just go for creating a csv file manually via custom fprintf statements.
PS. You can generally see what methods a package provides via pkg describe -verbose dataframe, or the methods for a particular class via methods(dataframe) (or even methods(df)). Also, if you ever wanted to access the documentation for an overloaded method, e.g. say the summary method, then this is the syntax for doing so: help #dataframe/summary

Runtime method to get names of argument variables?

Inside an Objective-C method, it is possible to get the selector of the method with the keyword _cmd. Does such a thing exist for the names of arguments?
For example, if I have a method declared as such:
- (void)methodWithAnArgument:(id)foo {
...
}
Is there some sort of construct that would allow me to get access to some sort of string-like representation of the variable name? That is, not the value of foo, but something that actually reflects the variable name "foo" in a local variable inside the method.
This information doesn't appear to be stored in NSInvocation or any of its related classes (NSMethodSignature, etc), so I'm not optimistic this can be done using Apple's frameworks or the runtime. I suspect it might be possible with some sort of compile-time macro, but I'm unfamiliar with C macros so I wouldn't know where to begin.
Edit to contain more information about what I'm actually trying to do.
I'm building a tool to help make working with third-party URL schemes easier. There are two sides to how I want my API to look:
As a consumer of a URL scheme, I can call a method like [twitterHandler showUserWithScreenName:#"someTwitterHandle"];
As a creator of an app with a URL scheme, I can define my URLs in a plist dictionary, whose key-value pairs look something like #"showUserWithScreenName": #"twitter://user?screenName={screenName}".
What I'm working on now is finding the best way to glue these together. The current fully-functioning implementation of showUserWithScreenName: looks something like this:
- (void)showUserWithScreenName:(NSString *)screenName {
[self performCommand:NSStringFromSelector(_cmd) withArguments:#{#"screenName": screenName}];
}
Where performCommand:withArguments: is a method that (besides some other logic) looks up the command key in the plist (in this case "showUserWithScreenName:") and evaluates the value as a template using the passed dictionary as the values to bind.
The problem I'm trying to solve: there are dozens of methods like this that look exactly the same, but just swap out the dictionary definition to contain the correct template params. In every case, the desired dictionary key is the name of the parameter. I'm trying to find a way to minimize my boilerplate.
In practice, I assume I'm going to accept that there will be some boilerplate needed, but I can probably make it ever-so-slightly cleaner thanks to NSDictionaryOfVariableBindings (thanks #CodaFi — I wasn't familiar with that macro!). For the sake of argument, I'm curious if it would be possible to completely metaprogram this using something like forwardInvocation:, which as far as I can tell would require some way to access parameter names.
You can use componentsSeparatedByString: with a : after you get the string from NSStringFromSelector(_cmd) and use your #selector's argument names to put the arguments in the correct order.
You can also take a look at this post, which is describing the method naming conventions in Objective C

return a computed field in a serialized django object?

I'm writing an API using Django, and I'm running into some issues around returning data that isn't stored in the database directly, or in other cases organized differently than the database schema.
In particular, given a particular data request, I want to add a field of computed data to my model before I serialize and return it. However, if I just add the field to the model, the built-in serializer (I'm using json) ignores it, presumably because it's getting the list of fields from the model definition.
I could write my own serializer, but what a pain. Or I guess I could run model_to_dict, then serialize the dict instead of the model. Anyone have any better ideas?
Here's what the code vaguely looks like right now:
squidlets = Squidlet.objects.filter(stuff)
for i in range(len(squidlets)):
squidlets[i].newfield = do_some_computation(squid)
return HttpResponse(json_serializer.serialize(squidlets,ensure_ascii=False),
'text/json')
But newfield ain't in the returned json.
i think you should serialize using simple json.. and it doent have to be a queryset... to escape it as json also use marksafe
from django.utils.safestring import mark_safe
from django.utils import simplejson
simplejson.dumps(mark_safe(your_data_structure))
I went with the dict solution, which turned out to be fairly clean.
Here's what the code looks like:
from django.forms.models import model_to_dict
squiddicts = []
squidlets = Squidlet.objects.filter(stuff)
for i in range(len(squidlets)):
squiddict = model_to_dict(squidlets[i])
squiddict["newfield"] = do_some_computation(squidlets[i])
squiddicts.append(squiddict)
return HttpResponse(simplejson.dumps(squiddicts,ensure_ascii=False),
'text/json')
This is maybe slightly more verbose than necessary but I think it's clearer this way.
This does still feel somewhat unsatisfying, but seems to work just fine.