JOIN datasets via GROUP ALL - apache-pig

This question might sounds very weird, but it is always on my mind. Let's assume that I have a one-column set of data. How could I place a static string next to a second column for every line available? So if the first line of the single-column dataset says "hello", the two-column equivalent should say "hello", "world".
In case you are wondering why I what to do that, it is because, later in my script I need to join the single column dataset with another one, where the former has no point of reference. This is what I have done so far:
fnl2 = FOREACH fnl1 GENERATE
var1,
(var1 == var1 ? 'World' : 'World') AS var2;
In case this can be done by group all or something similar, please feel to provide your hints.

You're on the right track, but the bincond is unnecessary. You can just do
fnl2 = FOREACH fnl1 GENERATE var1, 'World' AS var2;
But even this is not necessary if you are doing this so you can perform a JOIN later. JOIN takes expressions as well as fields, so you can just do
joined = JOIN fnl1 BY 1, other BY 1;
But even THIS is unnecessary, because you are just performing a cross-product, and Pig is one step ahead of you:
crossed = CROSS fnl1, other;
The last statement is what I think you are looking for, but hopefully the others illustrate some helpful points for you.

Related

SQL - Similar Update Queries Produce Varying Results

I am super new to SQL and have two queries I think should produce the same output but they don't. Can someone figure out the difference between them?
The input table for this simple example has two columns, letter and extra. The data in the first column is a random letter from the list ['a', 'b', 'c', 'd', 'e'] and extra should not matter (I think?). These are the queries:
update
tbl
set
extra = letter;
and:
update
tbl
set
extra = (select
letter
from tbl);
The resulting tables these produce are:
e|e
e|e
c|c
e|e
b|b
...
and:
e|e
e|e
c|e
e|e
b|e
...
respectively.
I expect the first output for both queries, how come the second one turns out as it does?
EDIT:
The reason I ask this question is because what I want to do is a bit more involved than this simple example and I believe I need the subquery. I am trying to add a kind of normalisation column, like this:
update
tbl
set
extra = 1 / (select
norm
from
tbl
INNER JOIN
(SELECT
letter, count(*) as norm
FROM
tbl
GROUP BY letter) as tmp
ON
tbl.letter = tmp.letter);
Alas, this obviously doesn't work because of the above.
What your first query is saying:
Set the value of extra to the value of letter in the same row.
What the second query is saying:
Pick a value from the column "letter" in the table, and update every row in the table to have the column 'extra' contain that value.
They are different instructions, so you get different results.

What is LINQ operator to perform division operation on Tables?

To select elements belonging to a particular group in Table, if elements and their group type are contained in one table and all group types are listed in another table we perform division on tables.
I am trying LINQ query to perform the same operation. Please tell me how can I do perform it?
Apparently from the definition of that blog post you'd want to intersect and except.
Table1.Except(Table1.Intersect(Table2));
or rather in your case I'd guess
Table1.Where(d => !Table2.Any(t => t.Type == d.Type));
not so hard.
I don't think performance can be made much better, actually. Maybe with a groupby.
Table1.GroupBy(t => t.Type).Where(g => !Table2.Any(t => t.Type == g.Key)).SelectMany(g => g);
this should be better for performance. Only searches the second table for every kind of type once, not for every row in Table1.
It's a bit difficult to determine exactly what you're asking. But, it sounds like you are looking to determine the elements that are common in two tables or streams. If so, I think you want Intersect.
Take a look here
It works something like this:
int[] array1 = { 1, 2, 3 };
int[] array2 = { 2, 3, 4 };
var intersect = array1.Intersect(array2);
Returns 2 and 3.
The opposite of this would be Except().

Filtering simultaneously on count of related objects and on count of related objects that satisfy a condition in Django

So I have models amounting to this (very simplified, obviously):
class Mystery(models.Model):
name = models.CharField(max_length=100)
class Character(models.Model):
mystery = models.ForeignKey(Mystery, related_name="characters")
required = models.BooleanField(default=True)
Basically, in each mystery there are a number of characters, which can be essential to the story or not. The minimum number of actors that can stage a mystery is the number of required characters for that mystery; the maximum number is the number of characters total for the mystery.
Now I'm trying to query for mysteries that can be played by some given number of actors. It seemed straightforward enough using the way Django's filtering and annotation features function; after all, both of these queries work fine:
# Returns mystery objects with at least x characters in all
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(max_actors__gte=x)
# Returns mystery objects with no more than x required characters
Mystery.objects.filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x)
However, when I try to combine the two...
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x, max_actors__gte=x)
...it doesn't work. Both min_actors and max_actors come out containing the maximum number of actors. The relevant parts of the actual query being run look like this:
SELECT `mysteries_mystery`.`id`,
`mysteries_mystery`.`name`,
COUNT(DISTINCT `mysteries_character`.`id`) AS `max_actors`,
COUNT(DISTINCT `mysteries_character`.`id`) AS `min_actors`
FROM `mysteries_mystery`
LEFT OUTER JOIN `mysteries_character` ON (`mysteries_mystery`.`id` = `mysteries_character`.`mystery_id`)
INNER JOIN `mysteries_character` T5 ON (`mysteries_mystery`.`id` = T5.`mystery_id`)
WHERE T5.`required` = True
GROUP BY `mysteries_mystery`.`id`, `mysteries_mystery`.`name`
...which makes it clear that while Django is creating a second join on the character table just fine (the second copy of the table being aliased to T5), that table isn't actually being used anywhere and both of the counts are being selected from the non-aliased version, which obviously yields the same result both times.
Even when I try to use an extra clause to select from T5, I get told there is no such table as T5, even as examining the output query shows that it's still aliasing the second character table to T5. Another attempt to do this with extra clauses went like this:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).extra(select={'min_actors': "SELECT COUNT(*) FROM mysteries_character WHERE required = True AND mystery_id = mysteries_mystery.id"}).extra(where=["`min_actors` <= %s", "`max_actors` >= %s"], params=[x, x])
But that didn't work because I can't use a calculated field in the WHERE clause, at least on MySQL. If only I could use HAVING, but alas, Django's .extra() does not and will never allow you to set HAVING parameters.
Is there any way to get Django's ORM to do what I want?
How about combining your Count()s:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True),min_actors=Count('characters', distinct=True)).filter(characters__required=True).filter(min_actors__lte=x, max_actors__gte=x)
This seems to work for me but I didn't test it with your exact models.
It's been a couple of weeks with no suggested solutions, so here's how I ended up going about it, for anyone else who might be looking for an answer:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(max_actors__gte=x, id__in=Mystery.objects.filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x).values('id'))
In other words, filter on the first count and on IDs that match those in an explicit subquery that filters on the second count. Kind of clunky, but it works well enough for my purposes.

VFP sql prepass

I have no idea what the official name for it is so maybe that's why i can't find anything online.
Basically, when you use sql in vfp it does an initial pass through (sometimes 2?) without moving the record cursor or saving the results.
Unfortunately I have sub routines in my sql that run and change things during that initial pass.
Why am i using subroutines in sql queries? Because vfp doesn't support referencing outside a subquery within the select items (once again i don't know the official name).
Example: select id, (select detail.name from detail where master.id == detail.id) name from master
This does work though: select id, getname(id) from master
where getname() is a sub routine containing the sql from the first example.
You could also use a join, but the above is just an example and a join does not work in my case.
Is there any way to deal with initial pass throughs? Does vfp create a boolean like firstpass or something? I suppose i could add a count to my subroutine, but that seems messier than it already is.
Alternatively can someone explain or link me an explanation to vfp's initial pass? I believe it was only doing one initial pass before but now it's doing two after changing some code.
Edit: ok, i was wrong. The above example does work. What doesn't work is the following:
SELECT d2.id, (SELECT TOP 1 d1.lname l FROM dpadd d1 WHERE d1.id== d2.id ORDER BY l) FROM dpadd d2
It gives me a "SQL: Queries of this type are not supported" error.
Strangely it works if i do the following:
SELECT d2.id, (SELECT COUNT(d1.lname) FROM dpadd d1 WHERE d1.id == d2.id) FROM dpadd d2
About the subroutines, they are methods of my form. The databases are local .dbf files. I'm not interacting with any servers, just running straight sql commands with into cursor clauses and then generating reports (usually).
I'll post back in a few minutes with an actually useful select statement that "is not supported". I'm sure you've noticed the top 1 example is completely useless.
It appears that TOP is not permitted in projections. For that example, you can instead do this:
SELECT d2.id, (SELECT MAX(d1.lname) l FROM dpadd d1 WHERE d1.id== d2.id) FROM dpadd d2
What else is giving you a problem?
Tamar

LINQ exclusion

Is there a direct LINQ syntax for finding the members of set A that are absent from set B? In SQL I would write this
SELECT A.* FROM A LEFT JOIN B ON A.ID = B.ID WHERE B.ID IS NULL
See the MSDN documentation on the Except operator.
var results = from itemA in A
where !B.Any(itemB => itemB.Id == itemA.Id)
select itemA;
I believe your LINQ would be something like the following.
var items = A.Except(
from itemA in A
from itemB in B
where itemA.ID == itemB.ID
select itemA);
Update
As indicated by Maslow in the comments, this may well not be the most performant query. As with any code, it is important to carry out some level of profiling to remove bottlenecks and inefficient algorithms. In this case, chaowman's answer provides a better performing result.
The reasons can be seen with a little examination of the queries. In the example I provided, there are at least two loops over the A collection - 1 to combine the A and B list, and the other to perform the Except operation - whereas in chaowman's answer (reproduced below), the A collection is only iterated once.
// chaowman's solution only iterates A once and partially iterates B
var results = from itemA in A
where !B.Any(itemB => itemB.Id == itemA.Id)
select itemA;
Also, in my answer, the B collection is iterated in its entirety for every item in A, whereas in chaowman's answer, it is only iterated upto the point at which a match is found.
As you can see, even before looking at the SQL generated, you can spot potential performance issues just from the query itself. Thanks again to Maslow for highlighting this.