Fuzzball inside Bigquery - google-bigquery

Fuzzball inside Bigquery - google-bigquery

How to implement Fuzzball Java script as a UDF inside Bigquery? Fuzzball has good amount of dependency libraries which is challenging to include as part of UDF inside Bigquery.

It's unclear where you're running into trouble, so I will walk through the process of creating a JavaScript UDF using fuzzball.
Download the fuzzball package: npm i fuzzball
Upload the appropriate file(s) to a GCS bucket. What you want is likely a umd or esm file. At time of writing, fuzzball.umd.min.js
Write your SQL UDF, providing the bucket path and package file in OPTIONS.
For example:
CREATE OR REPLACE FUNCTION
project.table.func (str_1 STRING, str_2 STRING)
RETURNS INT64
LANGUAGE js AS '''
return fuzzball.distance(str_1, str_2);
'''
OPTIONS (library='gs://bucket_name/fuzzball.umd.min.js');
And now you should be able to call your UDF as needed.

Related

How to convert multiple LCI ecospold files to a custom excel format/ how to use parse_file from pyecospold/ how to read ecospold into brightway

I have multiple ecospold (version 1) files with LCI data that I want to convert to a custom excel format. I need all data given in the ecospold file. For my own convinience I want to use python to complete this task.
My research until now has lead me to the following conclusions:
There exist at least two converters (by GLAD and openLCA) to convert ecospold formats (1 and 2) to e.g. the ILCD. But those formats are not helping me to go anywhere, since I need to have all the data accessible in python and in order to then write it into my custom excel format.
To get the data in python, the package pyecospold (https://github.com/sami-m-g/pyecospold) seems to be a suitable choice.
According to the README that can be found at the pyecospold github repository,
ecoSpold = parse_file("data/v1/v1_1.xml") # Replace with your own XML file
should do the job. So I implemented the following lines:
import os
from pyecospold import parse_file, save_file, Defaults
from lxml import etree
cd = os.getcwd()
path_input = cd + r'\inputs\ecospold_test.xml'
# Parse the required XML file to EcoSpold class.
es = parse_file('inputs/ecospold_test.xml')
Now I run into the error:
TypeError: parse_file() missing 2 required positional arguments: 'schema_path' and 'ecospold_lookup'
I understood that a schema in xsd format is needed, therefore I got the schema files from the github and amended my last line of code:
es = parse_file('inputs/ecospold_test.xml', 'inputs/schemas/v1/EcoSpold01Dataset.xsd')
Now there is still one argument missing:
TypeError: parse_file() missing 1 required positional argument: 'ecospold_lookup'
Since I have no experience in parsing xml files in python, I have no idea what to do with this. Additionally, I am confused why the README does not say anything about those additionally needed arguments.
My second idea was to use brightway to get the data into python. But since brightway itself is quite an extensive package, I could not find a simple (or any) way to do this. (Sadly, the notebooks linked in the answer of this question Import Ecoinvent 2.2 Ecospold files into Brightway do not exist anymore)
Another option would of course be to write my own parser. But because I am lacking experience and pyecospold does exactly this (at least in my understanding), I would like to avoid this option.
Additionally, there in openLCA it is possible to read in ecospold files and then export them to an excel format. From this excel format I could of course make my custom excel format. The problem here is that I have no idea how to automize this, because I do not want to read in and export each file individually and manually in openLCA.
If anyone has an idea on how to solve one of my subproblems or a good alternative on how to solve my general problem, I would be very thankful. :)

Directly passing pandas data into zipline

I am currently looking for a way to directly pass in a pandas dataframe or csv file to zipline for simple backtesting WITHOUT having to ingest a data bundle. The reason is that I am planning to generate new data outside of the existing bundle during a backtest and it seems very inefficient to ingest a new bundle for every handle_data call.
I have been looking for this everywhere, including the source codes of zipline. I found that an older version of zipline has a 'data' param in the run_algo function call where you could pass in a df directly, but I can't find that old version at the moment. Is anyone attempting the same thing? Is there any way other than ingesting data bundles in the command line everytime?

I'm using zipline 1.3.0 and it actually does have a data param. This comment is from run_algo.py file of zipline:
data : pd.DataFrame, pd.Panel, or DataPortal, optional
The ohlcv data to run the backtest with.
This argument is mutually exclusive with:
``bundle``
``bundle_timestamp``
Hope it helped

AWS Lambda - dynamically import python module from S3 at runtime

I have some tens of python modules, each has one common method (e.g: run(params)) but with different implementations. I also have an AWS Lambda which will need to call that method from within one of those modules. Choosing which module depending on the input of that lambda.
It seems that I can achieve that by using Layers in Lambda.
However, if I use one single layer for all those modules, then I could see problems with versioning that. If I need to update one module, I'll need to re-deploy that layer, which could bring unexpected changes to other modules.
If I use one layer for each module, then there will be too many layers to manage.
I thought of putting each module into one individual zip file, and put those zip files into an S3 location. My lambda will then dynamically reads the required zip files from S3 and execute.
Is that approach viable?
=====================
My current solution is to have something like this:
def read_python_script_from_zip(bucket: str, key: str, script_name: str) -> str:
s3 = boto3.resource('s3')
raw = s3.Object(bucket, key).get()['Body'].read()
zf = zipfile.ZipFile(io.BytesIO(raw), "r")
scripts = list(filter(lambda f: f.endswith(f"/{script_name}.py"), zf.namelist()))
if len(scripts) == 0:
raise ModuleNotFoundError(f"{script_name} not found.")
if len(scripts) > 1:
raise ModuleNotFoundError(f"{script_name} is ambiguous.")
source = zf.read(scripts[0])
mod = ModuleType(script_name, '')
exec(source, mod.__dict__)
return mod
read_python_script_from_zip(source_bucket, source_key, module_name).run(params)
Looks complicate to me though, would expect an easier way.

You could try packaging each module as a separate distribution package, which would let you version them separately. However, creating a Python distribution package is not as simple as you might hope, especially if you want to publish it to a private repository hosted on S3.

Use js packages in BigQuery UDF

I was trying to create a BigQuery UDF which requires an external npm package.
CREATE TEMPORARY FUNCTION tempfn(message STRING)
RETURNS STRING
LANGUAGE js AS """
var tesfn = require('js-123');
return tesfn(message)
""";
SELECT tempfn("Hello") as test;
It gives me an error
ReferenceError: require is not defined at tempfn(STRING) line 2,
columns 15-16
Is there a way that I can use these packages?

You can't load npm packages using require from JavaScript UDFs. You can, however, load external libraries from GCS, as outlined in the documentation. The example that the documentation gives is,
CREATE TEMP FUNCTION myFunc(a FLOAT64, b STRING)
RETURNS STRING
LANGUAGE js AS
"""
// Assumes 'doInterestingStuff' is defined in one of the library files.
return doInterestingStuff(a, b);
"""
OPTIONS (
library="gs://my-bucket/path/to/lib1.js",
library=["gs://my-bucket/path/to/lib2.js", "gs://my-bucket/path/to/lib3.js"]
);
SELECT myFunc(3.14, 'foo');
Here the assumption is that you have files with these names in Cloud Storage, and that one of them defines doInterestingStuff.

Retrieve version number from another exe [duplicate]

Salvete! I am writing a vb.net program to update the readme files for my applications. I want to extract the version number from other compiled applications. I want to read the version number from the executable, not from its uncompiled resources.
How can I do this in vb.net without using an external tool like reshacker?
(I found this link, but it is for another language.)

You can use a function like this to do it:
Private Function GetFileVersionInfo(ByVal filename As String) As Version
Return Version.Parse(FileVersionInfo.GetVersionInfo(filename).FileVersion)
End Function
Usage:
Debug.WriteLine(GetFileVersionInfo("C:\foo\bar\myapp.exe").ToString)
Output:
4.2.9.281

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Fuzzball inside Bigquery - google-bigquery

How to implement Fuzzball Java script as a UDF inside Bigquery? Fuzzball has good amount of dependency libraries which is challenging to include as part of UDF inside Bigquery.

Related

How to convert multiple LCI ecospold files to a custom excel format/ how to use parse_file from pyecospold/ how to read ecospold into brightway

Directly passing pandas data into zipline

AWS Lambda - dynamically import python module from S3 at runtime

Use js packages in BigQuery UDF

Retrieve version number from another exe [duplicate]

Categories

Resources