What does it mean that a nodeset is unordered? - xslt-1.0

I often see it said that, in XSLT 1.0, a node-set is unordered but also that nodes in a node-set are processed in document order.
That sounds like a node set is ordered in document order.
If there is a difference between "unordered but processed in document order" and "ordered in document order", when must I actually worry about that difference?

A set and a sequence are very different.
In a sequence the same item may be present more than once.
By definition all items of a set are different -- there cannot exist a pair of items $it1 and $it2 in a set such that identical(#it1, $it2).
Let's have this XML document:
<a>
<b>
<a>
<c>
<a/>
</c>
</a>
</b>
</a>
and this XPath expression:
//a/ancestor-or-self::a
This selects three nodes, however if the result of evaluation were a sequence, the resulting sequence would contain six nodes.
If there is a difference between "unordered but processed in document
order" and "ordered in document order", when must I actually worry
about that difference?
There are at least two things to be aware of:
node-set is not the same as node-list. A node-list may have order that is different from the document order of the nodes it contains -- for example the node-list for xsl:apply-templates and/or xsl:for-each when these instructions have an xsl:sort child. This nodelist in general has different order than the document order.
The document order is not a total ordering relation. For example, the positions of two attribute nodes (of the same element) are implementation defined and may vary for different XPath implementations. Also, the "document order" between two nodes each of which belongs to a different document is undefined and varies with different XPath/XSLT implementations.
In XSLT 2.0 / XPath 2.0 one can get a very unexpected and confusing result if he uses a sequence of nodes in a place where a node-set is expected -- the nodes in the sequence will be deduplicated and will be further processed not in their sequence order, but in document order.

Related

How to handle a tree given in an array of pairs?

I'm struggling with finding the best of handling tree problems where the input is given as an array/list of pairs.
For example a tree is given as input in the format:
[(1,3),(1,2),(2,5)(2,4),(5,8)]
Where the first value in a pair is the parent, and the second value in a pair is the child.
I'm used to being given the root in tree problems. How would one go about storing this for problems such as "Lowest Common Ancestor"?
It depends on which problem you need to solve. For the problem of finding the lowest common ancestor of two nodes, you'll benefit most from a structure where you can find the parent of a given node in constant time. If it is already given that the nodes are numbered from 1 to n (without gaps), then an array is a good structure, such that arr[child] == parent. If the identifiers for the nodes are not that predictable, then use a hashmap/dictionary, such that map.get(child) == parent.

Natural way of indexing elements in Flink

Is there a built-in way to index and access indices of individual elements of DataStream/DataSet collection?
Like in typical Java collections, where you know that e.g. a 3rd element of an ArrayList can be obtained by ArrayList.get(2) and vice versa ArrayList.indexOf(elem) gives us the index of (the first occurence of) the specified element. (I'm not asking about extracting elements out of the stream.)
More specifically, when joining DataStreams/DataSets, is there a "natural"/easy way to join elements that came (were created) first, second, etc.?
I know there is a zipWithIndex transformation that assigns sequential indices to elements. I suspect the indices always start with 0? But I also suspect that they aren't necessarily assigned in the order the elements were created in (i.e. by their Event Time). (It also exists only for DataSets.)
This is what I currently tried:
DataSet<Tuple2<Long, Double>> tempsJoIndexed = DataSetUtils.zipWithIndex(tempsJo);
DataSet<Tuple2<Long, Double>> predsLinJoIndexed = DataSetUtils.zipWithIndex(predsLinJo);
DataSet<Tuple3<Double, Double, Double>> joinedTempsJo = tempsJoIndexed
.join(predsLinJoIndexed).where(0).equalTo(0)...
And it seems to create wrong pairs.
I see some possible approaches, but they're either non-Flink or not very nice:
I could of course assign an index to each element upon the stream's
creation and have e.g. a stream of Tuples.
Work with event-time timestamps. (I suspect there isn't a way to key by timestamps, and even if there was, it wouldn't be useful for
joining multiple streams like this unless the timestamps are
actually assigned as indices.)
We could try "collecting" the stream first but then we wouldn't be using Flink anymore.
The 1. approach seems like the most viable one, but it also seems redundant given that the stream should by definition be a sequential collection and as such, the elements should have a sense of orderliness (e.g. `I'm the 36th element because 35 elements already came before me.`).
I think you're going to have to assign index values to elements, so that you can partition the data sets by this index, and thus ensure that two records which need to be joined are being processed by the same sub-task. Once you've done that, a simple groupBy(index) and reduce() would work.
But assigning increasing ids without gaps isn't trivial, if you want to be reading your source data with parallelism > 1. In that case I'd create a RichMapFunction that uses the runtimeContext sub-task id and number of sub-tasks to calculate non-overlapping and monotonic indexes.

In Neo4j can lucene process data to modify/aggregate nodes before indexing it?

So lets say I am modelling a paragraph with chunks of text as children, but I want the indexer to operate on the whole paragraph text. Rather than duplicate the text into to the paragraph, or change the model, is there a method to get the indexer to reconstitute the paragraph (by simply joining all children) before it index's it? i.e. it does some processing before it index's it?
If you're using manual legacy indexing (also required to use FTS in lucene) you basically pass the value and the node you want it to point to. The value doesn't even need to be a property on the node.
http://docs.neo4j.org/chunked/milestone/rest-api-indexes.html#rest-api-add-node-to-index
In this case, you'd have to do that processing on your side, but it's doable.

Efficient management of hierarchyid values in MS SQL Server

With the hierarchyid datatype in SQL Server 2008 and onward, would there be any benefit to trying to optimize the issuing of the next child of /1/1/8/ [ /1/1/8/x/ ] such that x is the closest non-negative whole number to 1 possible?
An easy solution seems to be to find the maximum assigned child value and getting the sibling to the right but it seems like you'd eventually exhaust this (in theory if not in practice) since you're never reclaiming any of the values and to my understanding, negatives and non-wholes consume more space.
EXAMPLE: If I've got a parent /1/1/8/ who has these children (and order of the children doesn't matter and reassignment of the values is ok):
/1/1/8/-400/
/1/1/8/1/
/1/1/8/4/
/1/1/8/40/
/1/1/8/18/
/1/1/8/9999999999/
wouldn't I want the next child to have /1/1/8/2/ ?
Here's the thing.
What you are saying will be "optimal" is not necessarily optimal.
When I am inserting values into a hierarchy, I generally do not care what the order is for the child nodes of a particular node.
If I do, that is why there are two parameters in GetDescendant.
If I want to prepend the node into the order(i.e make it first), I use a first parameter of NULL and a second parameter that is the lowest value of the other children.
If I want to append the node into the order (i.e. make it last), I use a first parameter of the maximum value of the other children and a second parameter of NULL.
If I want to insert between two other child nodes, I need both the one that will be before and the one that will be after the node I am inserting.
In any case, generally the values in the hierarchy field don't really matter, because you will order by a different field like Name or something.
Ergo, the most "efficient" method of adding things into a hierarchy is to either prepend or append, since finding the MIN or MAX hierarchy value is easy, and doing what you are describing requires several queries to find the first "hole" in the tree.
In other words, don't put a lot of meaning onto the string representation of a hierarchy unless you are using them for an application in which you are using the hierarchy value to sort by.
Even in that case, you probably don't want to fill in hierarchy values as you describe, and probably want to append to the end anyway.
Hope this helped.

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.