my team is interested in a feature store solution that enables rapid experimentation of features, probably using feature versioning. In the Feast slack history, I found
#Benjamin Tan’s post that explains their feast workflow, and they explain FeatureView versioning:
insights_v1 = FeatureView(
features=[
Feature(name="insight_type", dtype=ValueType.STRING)
]
)
insights_v2 = FeatureView(
features=[
Feature(name="customer_id", dtype=ValueType.STRING)
Feature(name="insight_type", dtype=ValueType.STRING)
]
)
Is this the recommended best practice for FeatureView versioning? It looks like Features do not have a version field. Is there a recommended strategy for Feature versioning?
Creating a new column for each Feature version is one approach:
driver_rating_v1
driver_rating_v2
But that could get unwieldy if we want to experiment with dozens of permutations of the same Feature.
Featureform appears to have support for feature versions through the "variant" field, but their documentation is a bit unclear.
Adding additional clarity on Featureform: Variant is analogous to version. You'd supply a string which then becomes an immutable identifier for the version of the transformation, source, etc. Variant is one of the common metadata fields provided in the Featureform API.
Using the example of an ecommerce dataset & spark, here's an example of using the variant field to version a source (a parquet file in this case):
orders = spark.register_parquet_file(
name="orders",
variant="default",
description="This is the core dataset. From each order you might find all other information.",
file_path="path_to_file",
)
You can set the variant variable ahead of time:
VERSION="v1" # You can change this to rerun the definitions with with new variants
orders = spark.register_parquet_file(
name="orders",
variant=f"{VERSION}",
description="This is the core dataset. From each order you might find all other information.",
file_path="path_to_file",
)
And you can create versions or variants of the transformations -- here I'm taking a dataframe called total_paid_per_customer_per_day and aggregating it.
# Get average order value per day
#spark.df_transformation(inputs=[("total_paid_per_customer_per_day", "default")], variant="skeller88_20220110")
def average_daily_transaction(df):
from pyspark.sql.functions import mean
return df.groupBy("day_date").agg(mean("total_customer_order_paid").alias("average_order_value"))
There are some more details on the Featureform CLI here: https://docs.featureform.com/getting-started/interact-with-the-cli
Related
I am an R user and I am interested to use the World Bank Group (WBG) Topical Taxonomy through SPARQL queries.
This can be done directly on the API https://vocabulary.worldbank.org/PoolParty/sparql/taxonomy but it can be done also through R by using the functions load.rdf (to load the taxonomy.rdf rdfxml file downloaded from https://vocabulary.worldbank.org/ ) and after using the sparql.rdf to perform the query. These functions are available on the "rrdf" package
These are the three lines of codes:
taxonomy_file <- load.rdf("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_1 <- sparql.rdf(taxonomy_file,query)
What I obtain from result_query_1 is exactly the same I get through the API.
However, the load.rdf function uses all the cores available on my computer and not only one. This function is somehow parallelizing the load task over all the core available on my machine and I do not want that. I haven't found any option on that function to specify its serialized usege.
Therefore, I am trying to find other solutions. For instance, I have tried "rdf_parse" and "rdf_query" of the package "rdflib" but without any encouraging result. These are the code lines I have used.
taxonomy_file <- rdf_parse("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_2 <- rdf_query(taxonomy_file , query = query)
Is there any other function that perform this task? The objective of my work is to run several queries simultaneously using foreach.
Thank you very much for any suggestion you could provide me.
How to programmatically list available Google BigQuery locations? I need a result similar to what is in the table of this page: https://cloud.google.com/bigquery/docs/locations.
As #shollyman has mentioned
The BigQuery API does not expose the equivalent of a list locations call at this time.
So, you should consider filing a feature request on the issue tracker.
Meantime, I wanted to add Option 3 to those two already proposed by #Tamir
This is a little naïve option with its pros and cons, but depends on your specific use case can be useful and easy adapted to your application
Step 1 - load page (https://cloud.google.com/bigquery/docs/locations) html
Step 2 - parse and extract needed info
Obviously, this is super simple to implement in any client of your choice
As I am huge BigQuery fan - I went through "prove of concept" using BigQuery Tool - Magnus
I've created workflow with just two Tasks:
API Task - to load page's HTML into variable var_payload
and
BigQuery Task - to parse and extract wanted info out of html
The "whole" workflow is as simple as it looks in below screenshot
The query I used in BigQuery Task is
CREATE TEMP FUNCTION decode(x STRING) RETURNS STRING
LANGUAGE js AS """
return he.decode(x);
"""
OPTIONS (library="gs://my_bucket/he.js");
WITH t AS (
SELECT html,
REGEXP_EXTRACT_ALL(
REGEXP_REPLACE(html,
r'\n|<strong>|</strong>|<code>|</code>', ''),
r'<table>(.*?)</table>'
)[OFFSET(0)] x
FROM (SELECT'''<var_payload>''' AS html)
)
SELECT pos,
line[SAFE_OFFSET(0)] Area,
line[SAFE_OFFSET(1)] Region_Name,
decode(line[SAFE_OFFSET(2)]) Region_Description
FROM (
SELECT
pos, REGEXP_EXTRACT_ALL(line, '<td>(.*?)</td>') line
FROM t,
UNNEST(REGEXP_EXTRACT_ALL(x, r'<tr>(.*?)</tr>')) line
WITH OFFSET pos
WHERE pos > 0
)
As you can see, i used he library. From its README:
he (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports all standardized named character references as per HTML, handles ambiguous ampersands and other edge cases just like a browser would ...
After workflow is executed and those two steps are done - result is in project.dataset.location_extraction and we can query this table to make sure we've got what is expected
Note: obviously parsing and extracting needed locations info is quite simplified and surely can be improved to be more flexible in terms of changing source page layout
Unfortunately, There is no API which provides BigQuery supported location list.
I see two options which might be good for you:
Option 1
You can manually manage a list and expose this list to your client via an API or any other means your application support (You will need to follow BigQuery product updates to follow on updates on this list)
Option 2
If your use case is to provide a list of the location you are using to store your own data you can call dataset.list to get a list of location and display/use it in your app
{
"kind": "bigquery#dataset",
"id": "id1",
"datasetReference": {
"datasetId": "datasetId",
"projectId": "projectId"
},
"location": "US"
}
Two Months ago, I wasrunning my following codes very well, but now,after I pip install google.cloud again, it is said : 'TableReference' object has no attribute "exists" , and either I can NOT use Dataset.name, it gone as well . so is there any big change about API sounds like I need to restructure my codes...
def createTable(client, ds, tb):
dataset = client.dataset(ds)
#assert not dataset.exists()
table = dataset.table(tb)
#assert not table.exists()
if not table.exists():
assert not table.exists()
table.schema = (bigquery.SchemaField('Name', 'STRING'),
bigquery.SchemaField('Age', 'INTEGER'),
bigquery.SchemaField('Weight', 'FLOAT'),)
table.create()
else:
print 'this table already existed in this dataset'
assert table.exists()
You probably haven't updated in a while, and you've brought in the breaking changes from 0.28.0, in particular:
Remove table.exists() (#4145)
and
Functions to create, get, update, delete datasets and tables moved to
the client class.
See:
https://github.com/GoogleCloudPlatform/google-cloud-python/releases/tag/bigquery-0.28.0
https://cloud.google.com/bigquery/docs/python-client-migration
In summary, you'll either need to migrate/port your code, or remain on an older version.
Does search_tweets function from rtweet package has parameters that allow me to filter by geolocation, equivalent to search_tweets parameter "geocode = '40.757343,-73.848261,40km'"?
Not really, you need to download the tweets via rtweet::search_tweets and then impose a geographical cut on coordinates. Furthermore there is also a column entitled bounding_box_coordinates that can be proven useful to your analysis.
I'm currently using IBM's UniObjects and I trying to retrieve multiple fields from a UniFile at once to increase efficiency.
UniFile uFile = uSession.CreateUniFile("fileName");
uFile.RecordID = inputID;
string[] fieldNames = {"I_Field_1", "D_Field_1", "I_Field_2", "D_Field_2"};
UniDynArray uFields = uFile.ReadNamedFields(fieldNames);
uFields value:
þvalue1þþvalue2
þ = delimiter for the UniDynArray
The problem is that half of these fields are I-descriptors and half are D-descriptors. The I-descriptors will not output unless only a single one of them is in the array fieldNames like so:
string[] fieldNames = {"I_Field_1"};
UniDynArray uFields = uFile.ReadNamedFields(fieldNames);
So I guess my question is why are the I-descriptor fields not being displayed and if there is a way they can be using this or a similar method.
I'm new to stackoverflow as well as an entry-level developer so thank you for any help you can provide.
It sounds like an defect with UniObject's. Since you say it is IBM's UniObject's, you most likely have an old version (UniData/UniVerse is owned by Rocket Software now).
Assuming you are on an old version, it is possible this works in a newer version. You should look into scheduling an update of your UniData server and hence client software such as UniObjects.
Outside of this, you can probably raise a bug with your VAR/Support Provider or Rocket Software directly.