What's the syntax to pass a dictionary as an input type in Flyte? - flyte

I'm trying to pass a dictionary as a parameter to a batch_sub_task, but I'm not quite sure how to define the #input.

Have you tried Types.Generic as the type? It maps to a json object in Python.
Use this to specify a simple JSON type.
When used with an SDK-decorated method, expect this behavior from the default type engine:
As input:
1) If set, a Python dict with JSON-ifiable primitives and nested lists or maps.
2) Otherwise, a None value will be received.
As output:
1) User code may pass a Python dict with arbitrarily nested lists and dictionaries. JSON-ifiable
primitives may also be specified.
2) Output can also be nulled with a None value.
From command-line:
Specify a JSON string.
.. code-block:: python
#inputs(a=Types.Generic)
#outputs(b=Types.Generic)
#python_task
def operate(wf_params, a, b):
if a['operation'] == 'add':
a['value'] += a['operand'] # a['value'] is a number
elif a['operation'] == 'merge':
a['value'].update(a['some']['nested'][0]['field'])
b.set(a)

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

List comprehension- Multiple inputs

I am a beginner , trying to understand how list comprehension for multiple input works.
Can someone explain how the below code works?
x,y = [int(x) for x in input("Enter the value ").split()]
print(x,y)
Thanks in advance!
This is actually is not directly related to list comprehensions but instead a concept called "sequence unpacking", which applies to any sequence type (list, tuple, range). What is happening here is that the user input is expected to be two whitespace-separated values. The split call will split the user input on the whitespace, returning a list of size 2. Then, the list comprehension is looping over each element of this split-produced list and converting each one to an int. Thus, the list comprehension will return a list of length 2, and each of its elements will be "unpacked" separately into the x and y variables on the left-hand side of the assignment operator. Here is an excerpt from the Data Structures section of the Python tutorial that explains sequence unpacking:
The statement t = 12345, 54321, 'hello!' is an example of tuple packing: the values 12345, 54321 and 'hello!' are packed together in a tuple. The reverse operation is also possible:
>>> x, y, z = t
This is called, appropriately enough, sequence unpacking and works for
any sequence on the right-hand side. Sequence unpacking requires that
there are as many variables on the left side of the equals sign as
there are elements in the sequence. Note that multiple assignment is
really just a combination of tuple packing and sequence unpacking.
Note that this only works if the user input is of length 2, else the
sequence unpacking will not work and will result in an error.

I am making my first ml model. I cannot understand what this line of code is doing?

features = {key:np.array(value) for key,value in dict(features).items()}
What is value? Why is features put as a parameter for dict constructor?
features = {key:np.array(value) for key,value in dict(features).items()} is a basic comprehension. There are list-comprehensions and dict -comprehensions ( maybe even more).See https://docs.python.org/3/tutorial/datastructures.html for further details. Long story short you have a for-loop run over an iterator and directly use the arguments to construct a list / dict in your case
{key:np.array(value) for key,value in dict(features).items()}
{key:np.array(value) ... is the usual dict-creation syntax
for key,value in dict(features).items() a simple for loop over a dictionary of features.
Together they create a new dict for every key of the dict(features).items() with the corresponding np.array(value) for every value of dict(features).items()

Repeated and unique in hash-like things

The repeated method takes a function as an argument for normalizing the elements before finding out which ones are repeated. However, I can't seen to make it work with values. For instance:
%(:a(3),:b(3),:c(2)).repeated( as=> *.values ).say
Returns an empty list, while I was expecting the pairs :a(3) and :b(3), same as
%(:a(3),:b(3),:c(2)).repeated( as=> .values ).say
In this case, for instance, it seems to work as expected:
(3+3i, 3+2i, 2+1i).unique(as => *.re).say # OUTPUT: «(3+3i 2+1i)␤»
Any idea of what I'm missing here?
.values is a method for returning all of the values of a container.
Since it is a List method, if you call it on a singular value it pretends it is a List containing only that value.
say 5.values.perl;
# (5,)
The as named parameter of .repeated gets called on all of the singular values.
%(:a(3),:b(3),:c(2)).repeated( as=> *.perl.say );
# :a(3)
# :b(3)
# :c(2)
So by giving it the *.values lambda, it is effectively not doing anything useful.
The method you were looking for is .value. Which is a method on a Pair.
%(:a(3),:b(3),:c(2)).repeated( as=> *.value ).say
# (a => 3)

What is the lambda function doing in the info_dict parameter of the summary_col in this code?

I'm running summary statistics for a group of standard OLS regressions. The code was written by my professor and I'm trying to figure out what's going on specifically in a portion of the code.
summary_col(
[reg0,reg1,reg2,reg3],
stars=True,
float_format='%0.2f',
info_dict = {
'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)
})
I looked up lambda functions. I have a fairly decent understanding of how they work. Aspects of the code that I do understand:
info_dict is a dictionary of values that can be called if you wish to include them in your summary statistics
lambda function work by calling an anonymous function "lambda x" then you place the : and list what operation you want to take place (i.e. x + 5) and then if you already know what parameters you want it to run you can put in a list after a second ":".
{0:d} will round to integers which makes perfect sense for observations. Although I don't know why you can't just say {%.f}. Maybe it's because the former returns an explicit int and the latter returns a float that looks like an int.
{:.2f} will return a float with 2 decimal places
What I don't fully understand is what somestring.format() does. Somehow x is getting defined as the results from the regression I believe and x.nobs is the variable "number of observations". Similar for x.rsquared.
Could someone fill in the gaps for me about what's going on in the formula? What exactly about the lambda function is enabling it to fetch data for each individual regression?
Let's break this out a little bit to make it obvious what is happening:
summary_col(
[reg0,reg1,reg2,reg3],
stars=True,
float_format='%0.2f',
info_dict={
'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)
}
)
The summary_col object is taking in some input, the first argument being a list of regression objects, [reg0,reg1,reg2,reg3]. Then there are three named arguments, stars, float_format, and info_dict. When we pass in the list of regression objects as the first argument, I believe that the lambda function knows to apply the anonymous function to each object. So all info_dict is doing is creating a dictionary with two keys, N and R2 which map to strings. When the member x.nobs and x.rsquared are referenced in the lambda functions they are applied against the regression objects due to the context in which these are used.
If you try to use lambda in that line of code on something that does not exist in the regression objects, you'll almost certainly get an error. The key is in the context against which the lambda is applied.
A good example on the context of lambda functions is iterating over a dictionary and sorting by key and value.
# sort the dict by value first, and key second...
# x is inferred from the context (my_dict.items())
for key, value in sorted(my_dict.items(), key=lambda x: (x[1], x[0]):
print(key, value)