Difference in methods of indexing multiple documents with Nest - nest

Using the 1.1.2 version of Nest, it seems there are at least 3 ways to index multiple documents:
IndexMany
client.IndexMany(documents, "index_name", "type_name");
Bulk with a BulkRequest parameter
client.Bulk(new BulkRequest(){
Index = "index_name",
Type = "type_name",
Operations = documents_as_list_of_BulkIndexOperation
});
Bulk with a selector
client.Bulk(s => s.IndexMany(documents,
(bulkDescriptor, record) =>
bulkDescriptor.Index("index_name").Type("type_name)));
If I want to perform the same operation on all the documents (i.e. I don't want to take advantage of the Bulk API's ability to perform a different operation for each document as detailed in the docs), is there any advantage to calling client.Bulk over client.IndexMany?

IndexMany() uses the BulkIndexDescriptor in it's implementation, so if you are only using Bulk() for indexing, then the two are functionally equivalent. IndexMany() is just a shorthand alternative to using Bulk, simply added for convenience.

Related

Feature and FeatureView versioning

my team is interested in a feature store solution that enables rapid experimentation of features, probably using feature versioning. In the Feast slack history, I found
#Benjamin Tan’s post that explains their feast workflow, and they explain FeatureView versioning:
insights_v1 = FeatureView(
features=[
Feature(name="insight_type", dtype=ValueType.STRING)
]
)
insights_v2 = FeatureView(
features=[
Feature(name="customer_id", dtype=ValueType.STRING)
Feature(name="insight_type", dtype=ValueType.STRING)
]
)
Is this the recommended best practice for FeatureView versioning? It looks like Features do not have a version field. Is there a recommended strategy for Feature versioning?
Creating a new column for each Feature version is one approach:
driver_rating_v1
driver_rating_v2
But that could get unwieldy if we want to experiment with dozens of permutations of the same Feature.
Featureform appears to have support for feature versions through the "variant" field, but their documentation is a bit unclear.
Adding additional clarity on Featureform: Variant is analogous to version. You'd supply a string which then becomes an immutable identifier for the version of the transformation, source, etc. Variant is one of the common metadata fields provided in the Featureform API.
Using the example of an ecommerce dataset & spark, here's an example of using the variant field to version a source (a parquet file in this case):
orders = spark.register_parquet_file(
name="orders",
variant="default",
description="This is the core dataset. From each order you might find all other information.",
file_path="path_to_file",
)
You can set the variant variable ahead of time:
VERSION="v1" # You can change this to rerun the definitions with with new variants
orders = spark.register_parquet_file(
name="orders",
variant=f"{VERSION}",
description="This is the core dataset. From each order you might find all other information.",
file_path="path_to_file",
)
And you can create versions or variants of the transformations -- here I'm taking a dataframe called total_paid_per_customer_per_day and aggregating it.
# Get average order value per day
#spark.df_transformation(inputs=[("total_paid_per_customer_per_day", "default")], variant="skeller88_20220110")
def average_daily_transaction(df):
from pyspark.sql.functions import mean
return df.groupBy("day_date").agg(mean("total_customer_order_paid").alias("average_order_value"))
There are some more details on the Featureform CLI here: https://docs.featureform.com/getting-started/interact-with-the-cli

How to run SPARQL queries in R (WBG Topical Taxonomy) without parallelization

I am an R user and I am interested to use the World Bank Group (WBG) Topical Taxonomy through SPARQL queries.
This can be done directly on the API https://vocabulary.worldbank.org/PoolParty/sparql/taxonomy but it can be done also through R by using the functions load.rdf (to load the taxonomy.rdf rdfxml file downloaded from https://vocabulary.worldbank.org/ ) and after using the sparql.rdf to perform the query. These functions are available on the "rrdf" package
These are the three lines of codes:
taxonomy_file <- load.rdf("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_1 <- sparql.rdf(taxonomy_file,query)
What I obtain from result_query_1 is exactly the same I get through the API.
However, the load.rdf function uses all the cores available on my computer and not only one. This function is somehow parallelizing the load task over all the core available on my machine and I do not want that. I haven't found any option on that function to specify its serialized usege.
Therefore, I am trying to find other solutions. For instance, I have tried "rdf_parse" and "rdf_query" of the package "rdflib" but without any encouraging result. These are the code lines I have used.
taxonomy_file <- rdf_parse("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_2 <- rdf_query(taxonomy_file , query = query)
Is there any other function that perform this task? The objective of my work is to run several queries simultaneously using foreach.
Thank you very much for any suggestion you could provide me.

PySpark map function - send n rows instead of one to build a list

I am using Spark 3.x in Python. I have some data (in millions) in CSV files that I have to index in Apache Solr.
I have deployed pysolr module for this purpose
import pysolr
def index_module(row ):
...
solr_client = pysolr.Solr(SOLR_URI)
solr_client.add(row)
...
df = spark.read.format("csv").option("sep", ",").option("quote", "\"").option("escape", "\\").option("header", "true").load("sample.csv")
df.toJSON().map(index_module).count()
index_module module simply get one row of data frame as json and then index in Solr via pysolr module. Pysolr support to index list of documents instead of one. I have to update my logic so that instead of sending one document in each request, I'll send a list of document. Definatelty, it will improve the performance.
How can I achieve this in PySpark ? Is there any alternative or best approach instead of map and toJSON ?
Also, My all activities are completed in transformation functions. I am using count just to start the job. Is there any alternative dummy function (of action type) in spark to do the same?
Finally, I have to create Solr Object each time, is there any alternative for this ?

How to Convert foor loop to NHibernate Futures for performance

NHibernate Version: 3.4.0.4000
I'm currently working on optimizing our code so that we can reduce the number of round trips to the database and am looking at a for loop that is one of the culprits. I'm having a hard time figuring out how to batch all of these iterations into a future that gets executed once when sent to SQL Server. Essentially each iteration of the loop causes 2 queries to hit the database!
foreach (var choice in lineItem.LineItemChoices)
{
choice.OptionVersion = _session.Query<OptionVersion>().Where(x => x.Option.Id == choice.OptionId).OrderByDescending(x => x.OptionVersionNumber).FirstOrDefault();
choice.ChoiceVersion = _session.Query<ChoiceVersion>().OrderByDescending(x => x.ChoiceVersionIdentity.ChoiceVersionNumber).Where(x => x.Choice.Id == choice.ChoiceId).FirstOrDefault();
}
One option is to extract OptionId and ChoiceId from all the LineItemChoices into two lists in local memory. Then issue just two queries, one for options and one for choices, giving these lists in .Where(x => optionIds.Contains(x.Option.Id)). This corresponds to SQL IN operator. This requires some postprocessing. You will get two result lists (transform to dictionary or lookup if you expect many results), that you need to process to populate the choice objects. This postprocessing is local and tends to be very cheap compared to database roundtrips. This option can be a bit tricky if the existing FirstOrDefault part is absolutely necessary. Do you expect there to be more than result for a single optionId? If not, this code could instead have used SingleOrDefault, which could just be dropped if converting to use IN-queries.
The other option is to use futures (https://nhibernate.info/doc/nhibernate-reference/performance.html#performance-future). For Linq it means to use ToFuture or ToFutureValue at the end, which also conflicts with FirstOrDefault I believe. The important thing is that you need to loop over all line item choices to initialize ALL queries BEFORE you access the value of any of them. So this is likely to also result in some postprocessing, where you would first store the future values in some list, and then in a second loop access the real value from each query to populate the line item choice.
If you to expect that the queries can yield more than one result (before applying FirstOrDefault), I think you can just use Take(1) instead, as that will still return an IQueryable where you can apply the future method.
The first option is probably the most efficient, since it will just be two queries and allow the database engine to make just one pass over the tables.
Keep the limit on the maximum number of parameters that can be given in an SQL query in mind. If there can be thousands of line item choices, you may need to split them in batches and query for at most 2000 identifiers per round trip.
Adding on the Oskar answer, NHibernate Futures was implement in NHibernate 2.1. It is available on method Future for collections and FutureValue for single values.
In your case, you could separate the IDs of the list in memory ...
var optionIds = lineItem.LineItemChoices.Select(x => x.OptionId);
var choiceIds = lineItem.LineItemChoices.Select(x => x.ChoiceId);
... and execute two queries using Future<T> to get two lits in one hit over the database.
var optionVersions = _session.Query<OptionVersion>()
.Where(x => optionIds.Contains(x.Option.Id))
.OrderByDescending(x => x.OptionVersionNumber)
.Future<OptionVersion>();
var choiceVersions = _session.Query<ChoiceVersion>()
.Where(x => choiceIds.Contains(x.Choice.Id))
.OrderByDescending(x => x.ChoiceVersionIdentity.ChoiceVersionNumber)
.Future<ChoiceVersion>();
After with all you need in memory, you could loop on the original collection you have and search in memory the data to fill up the choice object.
foreach (var choice in lineItem.LineItemChoices)
{
choice.OptionVersion = optionVersions.OrderByDescending(x => x.OptionVersionNumber).FirstOrDefault(x => x.Option.Id == choice.OptionId);
choice.ChoiceVersion = choiceVersions.OrderByDescending(x => x.ChoiceVersionIdentity.ChoiceVersionNumber).FirstOrDefault(x => x.Choice.Id == choice.ChoiceId);
}

Apache Flink Error Handing and Conditional Processing

I am new to Flink and have gone through site(s)/examples/blogs to get started. I am struggling with the correct use of operators. Basically I have 2 questions
Question 1: Does Flink support declarative exception handling, I need to handle parse/validate/... errors?
Can I use org.apache.flink.runtime.operators.sort.ExceptionHandler or similar
to handle errors?
or Rich/FlatMap function my best option?
If Rich/FlatMap the only option then is there a way to get handle to Stream inside Rich/FlatMap function so Sink(s) could be attached for error processing?
Question 2: Can I conditionally attach different Sink(s)?
Based on certain field(s) in keyed split streams I need to select different sink(s), do I split the stream again or use a Rich/FlatMap to handle that?
I am using Flink 1.3.2. Here is the relevant portion of my job
.....
.....
DataStream<String> eventTextStream = env.addSource(messageSource)
KeyedStream<EventPojo, Tuple> eventPojoStream = eventTextStream
// parse, transform or enrich
.flatMap(new MyParseTransformEnrichFunction())
.assignTimestampsAndWatermarks(new EventAscendingTimestampExtractor())
.keyBy("eventId");
// split stream based on eventType as different reduce and windowing functions need to be applied
SplitStream<EventPojo> splitStream = eventPojoStream
.split(new EventStreamSplitFunction());
// need to apply reduce function
DataStream<EventPojo> event1TypeStream = splitStream.select("event1Type");
// need to apply reduce function
DataStream<EventPojo> event2TypeStream = splitStream.select("event2Type");
// need to apply time based windowing function
DataStream<EventPojo> event3TypeStream = splitStream.select("event3Type");
....
....
env.execute("Event Processing");
Am I using the correct operators here?
Update 1:
Tried using the ProcessFunction as suggested by #alpinegizmo but that didn't work as it depends upon a keyed stream which I don't have until I parse/validate input. I get "InvalidProgramException: Field expression must be equal to '*' or '_' for non-composite types. ".
It's such a common use case where your first parse/validate input and won't have keyed stream yet, so how do you solve it?
Thanks for your patience and help.
There's one key building block that you've overlooked. Take a look at side outputs.
This mechanism provides a typesafe way to produce any number of additional output streams. This can be a clean way to report errors, among other uses. In Flink 1.3 side outputs can only be used with ProcessFunction, but 1.4 will add side outputs to ProcessWindowFunction.