get preceding-sibling relative to a node set - xslt-1.0

I query and sort alphabetically a bunch of XML elements, to which I apply a template, to produce an alphabetical list. I'd like to prefix the nodes with the same initial by this very initial:
A
Abe
Amel
Andrew
B
Bobby
Benny
...
The preceding-sibling axis is relative to the document, not the nodeset. What can I do?

What you've described here is a grouping problem. The standard way to handle grouping in xslt 1 is the "Muenchian" method. There's a very detailed explanation about it here:
http://www.jenitennison.com/xslt/grouping/muenchian.html
The basic idea is that you create a key which specifies what you want to group by. In this case, you'd create a key matching the person's node, using the first letter of their name.
Then you write a loop which loops through the people, and checks if each one is the first one to match that given key (first letter). If it is, then you put in one of your grouping dividers.
Then you have a nested loop (or apply-templates) which has a match rule that only picks up nodes with that first letter. You can sort them using xslt:sort, and output them.
If you can post a sample of your actual XML (rather than just your desired output), then I can write you an example stylesheet to parse it.

Related

How do I perform a query in Postgres using a URL slug?

Let's say I have a URL as Follows:
www.somewebsite.com/dining/caseys+grille
I have a business_listings table in Postgres that contains a column business_name. I have a record in the table with 'Casey's Grille'
How can I query 'caseys+grill' against 'Casey's Grille'?
Would I need to use full text search? How would I go about doing this?
Since you are not searching for regular words, but for proper names, and you probably also want to find results that are similar in spelling, you should use trigram GIN indexes and similarity search.
This problem looks simple at first, but it is a can of worms.
The solution should consider all the use cases: is it only a matter of removing/rewriting special characters? Do you need to consider typos (is casey grill the same)? Do you need to consider distinctive marks (is Casey's Grill #2 the same)? Do you need to consider abbreviations (is NY Grill the same as New-York Grill?) Do you need to consider numbers (is 1st av. Grill the same as first avenue grill)?
If it is your database + website, the simplest is to record/compare the URL slug directly.
Else, or if you don't control the URL (like if it is the result of a search box), you may want to store/compare a parsed name. Using both the DB title and the URL slug, you transform the name to common elements. For example you change common abbreviations to their full text, you remove all special characters, you remove/add space, if your language has accents you can remove them, standardize the casing etc. Only you can find and apply the suitable transformations.
Then you can compare the two parsed named, using any suitable comparison method (trigram, plain equality, like queries etc)
I assume you actually want a single slug of the text value in business_name and you want this to be a unique identifier for this particular business.
You can create an additional column business_name_slug and create a unique index on this column.
Then you can create a before insert or update trigger that writes the slug created from business_name into this column.
The tricky part is to create a logic that
generates an url friendly version of the the business name (there should be some example in Blog Posts,Githuhib Gists etc., for example)
avoids naming collisions so your unique constraint will not raise an error when inserting/updating

Openrefine: cross.cell for similar but not identical values

I have two dataset:
one dataset has names of countries, but dirty ones like
Gaule Cisalpine (province romaine)
Gaule belgique
Gaule , Histoire
Gaule
ecc.
the second dataset has two columns with the names of countries (clean) and a code like
Gaule | 1DDF
Is there a way to use cell.cross with value.contains() ? I tried to use reconcile-csv but it didn't work properly (it matches just the exact ones).
I've not been able to think of a great way of doing this, but given the substring you want to match between the two files is always the first thing in the 'messy' string, and if you want to do this in OpenRefine, I can see a way that might work by creating a 'match' column in each project for the cross matching.
In the 'clean' project use 'Add column based on this column' on the 'Country name' column, and in the GREL transform use:
value.fingerprint()
The 'fingerprint' transformation is the same as the one used when doing clustering with key collision/fingerprint and basically I'm just using it here to get rid of any minor differences between country names (like upper/lower case or special characters)
Then in the 'messy' project create a new column based on the dirty 'name of country' column again using the 'Add column based on this column' but in this case use the GREL transform something like:
value.split(/[\s,-\.\(\)]/)[0].fingerprint()
The first part of this "value.split(/[\s,-.()]/)" splits the string into individual words (using space, comma, fullstop, open or closed bracket as a separator). Then the '[0]' takes the first string (so the first word in the cell), then again uses the fingerprint algorithm.
Now you have columns in each of the projects which should match on the exact cell content. You can use this to do the look up between the two projects.
This isn't going to be completely ideal - for example if you have some country names which consist of multiple words it isn't going to work. However you could add some additional key columns to the 'messy' project which use the first 2,3,4 strings etc. rather than just the first one as given here.
e.g.
filter(value.split(/[\s,-\.\(\)]/),v,isNonBlank(v)).get(0,2).join(" ").fingerprint()
filter(value.split(/[\s,-\.\(\)]/),v,isNonBlank(v)).get(0,3).join(" ").fingerprint()
etc. (I've done a bit more work here to make sure blank entries are ignored - it's the get() command that's the key bit for getting the different numbers of words).
I'm guessing that most country names are going to be only a few words long, so it would only be a few columns needed.
I've not been able to come up with a better approach so far. I'll post some more here if I come up with anything else. You could also try asking on the OpenRefine forum https://groups.google.com/forum/#!forum/openrefine

Any way to use strings as the scores in a Redis sorted set (zset)?

Or maybe the question should be: What's the best way to represent a string as a number, such that sorting their numeric representations would give the same result as if sorted as strings? I devised a way that could sort up to 9 characters per string, but it seems like there should be a much better way.
In advance, I don't think using Redis's lexicographical commands will work. (See the following example.)
Example: Suppose I want to presort all of the names linked to some ID so that I can use ZINTERSTORE to quickly get an ordered list of IDs based on their names (without using redis' SORT command). Ideally I would have the IDs as the zset's members, and the numeric representation of each name would be the zset's scores.
Does that make sense? Or am I going about it wrong?
You're trying to use an order preserving hash function to generate a score for each id. While it appears you've written one, you've already found out that the score's range allows you to use only the first 9 characters (it would be interesting to see your function btw).
Instead of this approach, here's a simpler one that would be easier IMO - use set members of the form <name>:<id> and set the score to 0. You'll be able to use lexicographical ordering this way and use something like split(':') to get the id from the set's members.

pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

Suppose I have the following flat file on HDFS (let's call this key_value):
1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing
Here is the output I'm looking for:
(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)
In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).
Towards this end, I started putting together this pig script:
data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
sorted = ORDER data BY key, value;
GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;
The above script gives me the following output:
(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)
Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).
Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?
UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:
(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)
First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.
I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.
You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.

SEO and magic numbers in URL

which URL is more relevant, 1 or 2?
1: http://site.com/language/countrcy/city/category/title
2: http://site.com/language/country/city/category/articleId(number)/title
the thing is I have to design my DB in ineffective way for (1) doing textual search and table joins, but I'm not sure how (2) where I'm just putting a direct table ID loses relevance in search results.
The first would be the most relevant, as it doesn't contain any irrelevant data, such as the articleId.
If you are concerned about keeping unique titles, have a 2nd database column called filename for example, which is a URL encoded version of the title. If the title is already in use, then append an incremented value at the end.
For example, if the title 'SEO' was already in use by another article, loop through your string and call it SEO-1 etc..
That way you are only applying irrelevant values when two titles clash.