Comparing Elements across collections - sql

I have the following models:
class Collection(models.Model):
...
class Record(models.Model):
collection = models.ForeignKey(Collection, related_name='records')
filename = models.CharField(max_length=256)
checksum = models.CharField(max_length=64)
class Meta:
unique_together = (('filename', 'collection'),)
I want to perform the following query:
For each filename of Record I want to know the Collections that:
Do not provide a Record with that filename
or that provide such a Record but has a differing checksum
I have in mind an output like that:
| C1 C2 C3 <- collections
-----------+------------
file-1.txt | x
file-2.txt | x
file-3.txt | ! ! !
file-4.txt | x ! !
file-5.txt | ! ! x
x = missing
! = different checksum
What I've com up so far is that I create a query for each Collection, excluding all filenames that are within this collection but exist in others.
for collection in collections:
other_collections = [c for c in collections if c is not collection]
results[collection] = qs.filter(collection__in=other_collections).exclude(
filename__in=qs.filter(
collection=collection
).values_list('filename', flat=True)
).order_by('filename').values_list('filename', flat=True)
This somewhat solves the first part of my question, but is rather quirky and requires post-processing to get to the format I desire. And, more importantly, it does not address the checksum comparison.
Is it possible to perform the two queries in one combined step to get the results in the format I described above?
The solution would not necessarily have to use the QuerySet APIs, a fallback to raw SQL is fine by me too.

It is not possible to write a SQL query that returns a variable number of columns, although you can achieve that effect if you wrap everything in an array or JSON object.
If you know the collections, you could write SQL like this:
SELECT r.filename,
(SELECT r.checksum = r2.checksum FROM records r2 WHERE r.filename = r2.filename AND r2.collection_id = 1) AS c1,
(SELECT r.checksum = r2.checksum FROM records r2 WHERE r.filename = r2.filename AND r2.collection_id = 2) AS c2,
...
FROM records r
WHERE r.collection_id = 1
GROUP BY r.filename, r.checksum
For each filename/collection pair, you will get NULL if the collection doesn't have the record, true if the collection has it with the right checksum, or false if the collection has it with a different checksum.
I include WHERE r.collection_id = 1 because otherwise for the checksum comparison, you have to answer "different from what?"

Related

Filter neo4j result, return distinct combination of node IDs

I have a graph with Airport nodes and Flight relationships, and I want to find triangles from a specific node where the edges are all within 10% length of each other.
MATCH path = (first:Airport{ID: 12953})-[f1:Flight]->
(second:Airport)-[f2:Flight]->
(third:Airport)-[f3:Flight]->
(last:Airport{ID: 12953})
WHERE second.ID <>first.ID AND
third.ID <>first.ID AND
f1.Distance<=(1.1*f2.Distance) AND
f1.Distance<=(1.1*f3.Distance) AND
f2.Distance<=(1.1*f1.Distance) AND
f2.Distance<=(1.1*f3.Distance) AND
f3.Distance<=(1.1*f1.Distance) AND
f3.Distance<=(1.1*f2.Distance)
WITH (first.ID, second.ID, third.ID) as triplet
return count(DISTINCT triplet)
I only want to return a set of nodes once (no matter how many different flights exist between them), but the with line doesn't work. Basically what I want to create is a new type of variable "object" that has the three IDs as its properties and run distinct on that. Is that possible in neo4j? If not, is there some workaround?
You can use the APOC function apoc.coll.sort to sort each list of 3 IDs, so that the DISTINCT option will properly treat lists with the same IDs as being the same.
Here is a simplified query that uses the APOC function:
MATCH path = (first:Airport{ID: 12953})-[f1:Flight]->
(second:Airport)-[f2:Flight]->
(third:Airport)-[f3:Flight]->
(first)
WHERE second <> first <> third AND
f2.Distance<=(1.1*f1.Distance)>=f3.Distance AND
f1.Distance<=(1.1*f2.Distance)>=f3.Distance AND
f1.Distance<=(1.1*f3.Distance)>=f2.Distance
RETURN COUNT(DISTINCT apoc.coll.sort([first.ID, second.ID, third.ID]]))
NOTE: the second <> first test may not be necessary since there should not be any flights (if a "flight" is the same as a "leg") that fly from an airport back to itself.
You can return an object with keys or an array. For example:
UNWIND range(1, 10000) AS i
WITH
{
id1: toInteger(rand()*3),
id2: toInteger(rand()*3),
id3: toInteger(rand()*3)
} AS triplet
RETURN DISTINCT triplet
or
UNWIND range(1, 10000) AS i
WITH
[ toInteger(rand()*3), toInteger(rand()*3), toInteger(rand()*3) ] AS triplet
RETURN DISTINCT triplet
Update. You can simplify your query by reusing a variable in the query, specifying the length of the path and using the list functions:
MATCH ps = (A:Airport {ID: 12953})-[:Flight*3]->(A)
WITH ps
WHERE reduce(
total = 0,
rel1 IN relationships(ps) |
total + reduce(
acc = 0,
rel2 IN relationships(ps) |
acc + CASE WHEN rel1.Distance <= 1.1 * rel2.Distance THEN 0 ELSE 1 END
)) = 0
RETURN count(DISTINCT [n IN nodes(ps) | n.ID][0..3])

PIG filter out rows with improper number of columns

I have simple data loaded in a:
dump a
ahoeh,1,e32
hello,2,10
ho,3
I need to filter out all rows with number of columns/fields different than 3. How to do it?
In other words result should be:
dump results
ahoeh,1,e32
hello,2,10
I know there should be a FILTER built-in function. However I cannot figure out what condition (number of columns =3) should be defined.
Thanks!
Can you try this?
input
ahoeh,1,e32
hello,2,10
ho,3
3,te,0
aa,3,b
y,h,3
3,3,3
3,3,3,1,2,3,3,,,,,,4,44,6
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,','));
C = FOREACH B GENERATE COUNT(TOBAG(*)),$0..;
D = FILTER C BY $0==3;
E = FOREACH D GENERATE $1..;
DUMP E;
PigScript2:
A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
C = FILTER B BY (int)$0==3;
D = FOREACH C GENERATE $1..;
DUMP D;
Output:
(ahoeh,1,e32)
(hello,2,10)
(3,te,0)
(aa,3,b)
(y,h,3)
(3,3,3)
(It seems that I don't have enough karma to comment; that's why this is posted as a new answer.)
The accepted answer doesn't quite behave as expected if null/empty string is a valid field value; you need to use COUNT_STAR instead of COUNT to count empty/null fields in your schema.
See: https://pig.apache.org/docs/r0.9.1/func.html#count-star
For example, given the following input data:
1,2,3
1,,3
and this Pig script:
a = load 'input' USING PigStorage(',');
counted = foreach a generate COUNT_STAR(TOBAG(*)), $0..;
filtered = filter counted by $0 != 3;
result = foreach filtered generate $1..;
The filtered alias will contain both rows. The difference is that COUNT({(1),(),(3)}) returns 2 while COUNT_STAR({(1),(),(3)}) returns 3.
I see two ways to do this:
First, you can rephrase the filter I think, as it boils down to: Give me all lines that do not contain an NULL value. For lots of columns, writing this filter statement is rather tedious.
Second, you could convert your columns into a bag per line, using TOBAG (http://pig.apache.org/docs/r0.12.1/func.html#tobag) and then write a UDF that processes the input bag to check for null tuples in this bag and return true or false and use this in the filter statement.
Either way, some tediousness is required I think.

Using pig, how do I parse a mixed format line into tuples and a bag of tuples?

I'm new to pig, and I'm having an issue parsing my input and getting it into a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:
FF1|FF2|FF3|FF4|KVP1|KVP2|...|KVPn
My goal here is to count the number of unique fixed field combinations for each of the KV Pairs. So considering the following input lines:
1|2|3|4|key1=value1|key2=value2
2|3|4|5|key1=value7|key2=value2|key3=value3
When I'm done, I'd like to be able to generate the following results (the output format doesn't really matter at this point, I'm just showing you what I'd like the results to be):
key1=value1 : 1
key1=value7 : 1
key2=value2 : 2
key3=value3 : 1
It seems like I should be able to do this by grouping the fixed fields and flattening a bag of the KV Pairs to generate the cross product
I've tried reading this in with something like:
data = load 'myfile' using PigStorage('|');
A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()};
B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); };
C = filter B by ff3 matches 'somevalue';
D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair;
E = group D by (ff1, ff4, kvpair);
F = foreach E generate group, COUNT(E);
This generates records with a schema as follows:
A: {date: long,hms: long,id: long,ff1: chararray,ff2: long,ff3: chararray,ff4: chararray,kvpairs: {kvpair: (NULL)}}
While this gets me the schema that I want, there are several problems that I can't seem to solve:
By using the TOBAG with .., no schema can be applied to my kvpairs, so I can't ever filter on kvpair, and I don't seem to be able to cast this at any point, so it's an all or nothing query.
The filter in statement 'C' seems to return no data regardless of what value I use, even if I use something like '.*' or '.+'. I don't know if this is because there is no schema, or if this is actually a bug in pig. If I dump some data from statement B, I definitely see data there that would match those expressions.
So I've tried approaching the problem differently, by loading the data using:
data = load 'myfile' using PigStorage('\n') as (line:chararray);
init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray);
A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)};
The issue here is that the TOBAG( STRSPLIT( ... ) ) results in a bag of a single tuple, with each of the kvpairs being a field in that tuple. I really need the bag to contain, each of the individual kvpairs as a tuple of one field so that when I flatten the bag, I get the cross product of the bag and the group that I'm interested in.
I'm open to other ways of attacking this problem as well, but I can seem to find good way to transform my tuple of multiple fields into a bag of tuples, with each tuple having one field each.
I'm using Apache Pig version 0.11.1.1.3.0.0-107
Thanks in advance.
Your second approach is on the right track. Unfortunately, you'll need a UDF to convert a tuple to a bag, and as far as I know there is no builtin to do this. It's a simple matter to write one, however.
You won't want to group on the fixed fields, but rather on the key-value pairs themselves. So you only need to keep the tuple of key-value pairs; you can completely ignore the fixed fields.
The UDF is pretty simple. In Java, you can just do something like this in your exec method:
DataBag b = new DefaultDataBag();
Tuple t = (Tuple) input.get(0);
for (int i = 0; i < t.size(); i++) {
Object o = t.get(i);
Tuple e = TupleFactory.getInstance().createTuple(o);
b.add(e);
}
return b;
Once you have that, turn the tuple from STRSPLIT into a bag, flatten it, and then do the grouping and counting.

I'm looking for a LINQ solution to create a sub-list where items w/ duplicate property values are skipped

The problem is: I have a list of objects, with some containing the same PlanId property value. I want to only grab the first occurrence of those and ignore the next object with that PlanId. The root problem is a View in the database, but it's tied in everywhere and I don't know if changing it will break a ton of stuff nearing a deadline, so I'm tossing in a hack for now.
So, if I have a list of PlanObjects like such.
Plan1.PlanId = 1
Plan2.PlanId = 1
Plan3.PlanId = 2
Plan4.PlanId = 3
Plan5.PlanId = 4
Plan6.PlanId = 4
I want to take a sub-list from that with LINQ (italics mean an item is not included)
Plan1.PlanId = 1
Plan2.PlanId = 1
Plan3.PlanId = 2
Plan4.PlanId = 3
Plan5.PlanId = 4
Plan6.PlanId = 4
For my needs, it doesn't matter which one is taken first. The Id is used to update a datbase record.
If I didn't explain that well enough, let me know and I'll edit the question. I think it makes sense though.
PlanObjects.GroupBy(p => p.PlanId).Select(r => r.First());
The other answer (and its comments) supplies the fluent interface solution. Here's the query syntax:
From p In PlanObjects Group By p.PlanId Into First Select First

Performing a Django raw SQL (using "WHERE col IN" syntax) or translating raw SQL to .raw() or .extra()

Django 1.3-dev provides several ways to query the database using raw SQL. They are covered here and here. The recommended ways are to use the .raw() or the .extra() methods. The advantage is that if the retrieved data fits the Model you can still use some of it's features directly.
The page I'm trying to display is somewhat complex because it uses lots of information which is spread across multiple tables with different relationships (one2one, one2many). With the current approach the server has to do about 4K queries per page. This is obviously slow due to database to webserver communication.
A possible solution is to use raw SQL to retrieve the relevant data but due to the complexity of the query I couldn't translate this to an equivalent in Django.
The query is:
SELECT clin.iso as iso,
(SELECT COUNT(*)
FROM clin AS a
LEFT JOIN clin AS b
ON a.pat_id = b.pat_id
WHERE a.iso = clin.iso
) AS multiple_iso,
(SELECT COUNT(*)
FROM samptopat
WHERE samptopat.iso_id = clin.iso
) AS multiple_samp,
(SELECT GROUP_CONCAT(value ORDER BY snp_id ASC)
FROM samptopat
RIGHT JOIN samptosnp
USING(samp_id)
WHERE iso_id = clin.iso
GROUP BY samp_id
LIMIT 1 -- Return 1st samp only
) AS snp
FROM clin
WHERE iso IN (...)
or alternatively WHERE iso = ....
Sample output looks like:
+-------+--------------+---------------+-------------+
| iso | multiple_iso | multiple_samp | snp |
+-------+--------------+---------------+-------------+
| 7 | 19883 | 0 | NULL |
| 8 | 19883 | 0 | NULL |
| 21092 | 1 | 2 | G,T,C,G,T,G |
| 31548 | 1 | 0 | NULL |
+-------+--------------+---------------+-------------+
4 rows in set (0.00 sec)
The documentation explains how one can do a query using WHERE col = %s but not the IN syntax.
One part of this question is How do I perform raw SQL queries using Django and the IN statement?
The other part is, considering the following models:
class Clin(models.Model):
iso = models.IntegerField(primary_key=True)
pat = models.IntegerField(db_column='pat_id')
class Meta:
db_table = u'clin'
class SampToPat(models.Model):
samptopat_id = models.IntegerField(primary_key=True)
samp = models.OneToOneField(Samp, db_column='samp_id')
pat = models.IntegerField(db_column='pat_id')
iso = models.ForeignKey(Clin, db_column='iso_id')
class Meta:
db_table = u'samptopat'
class Samp(models.Model):
samp_id = models.IntegerField(primary_key=True)
samp = models.CharField(max_length=8)
class Meta:
db_table = u'samp'
class SampToSnp(models.Model):
samptosnp_id = models.IntegerField(primary_key=True)
samp = models.ForeignKey(Samp, db_column='samp_id')
snp = models.IntegerField(db_column='snp_id')
value = models.CharField(max_length=2)
class Meta:
db_table = u'samptosnp'
Is it possible to rewrite the above query into something more ORM oriented?
For a problem like this one, I'd split the query into a small number of simpler ones, I think it's quite possible. Also, I found that MySQL actually may return results faster with this approach.
edit ...Actually after thinking a bit I see that you need to "annotate on subqueries", which is not possible in Django ORM (not in 1.2 at least). Maybe you have to do plain sql here or use some other tool to build the query.
Tried to rewrite your models in more default django pattern, maybe it will help to understand the problem better. Models Pat and Snp are missing though...
class Clin(models.Model):
pat = models.ForeignKey(Pat)
class Meta:
db_table = u'clin'
class SampToPat(models.Model):
samp = models.ForeignKey(Samp)
pat = models.ForeignKey(Pat)
iso = models.ForeignKey(Clin)
class Meta:
db_table = u'samptopat'
unique_together = ['samp', 'pat']
class Samp(models.Model):
samp = models.CharField(max_length=8)
snp_set = models.ManyToManyField(Snp, through='SampToSnp')
pat_set = models.ManyToManyField(Pat, through='SaptToPat')
class Meta:
db_table = u'samp'
class SampToSnp(models.Model):
samp = models.ForeignKey(Samp)
snp = models.ForeignKey(Snp)
value = models.CharField(max_length=2)
class Meta:
db_table = u'samptosnp'
The following seems to mean - get count of unique patients per clinic ...
(SELECT COUNT(*)
FROM clin AS a
LEFT JOIN clin AS b
ON a.pat_id = b.pat_id
WHERE a.iso = clin.iso
) AS multiple_iso,
Sample count per clinic:
(SELECT COUNT(*)
FROM samptopat
WHERE samptopat.iso_id = clin.iso
) AS multiple_samp,
This part is harder to understand, but in Django there is no way to do GROUP_CONCAT in plain ORM.
(SELECT GROUP_CONCAT(value ORDER BY snp_id ASC)
FROM samptopat
RIGHT JOIN samptosnp
USING(samp_id)
WHERE iso_id = clin.iso
GROUP BY samp_id
LIMIT 1 -- Return 1st samp only
) AS snp
Could you explain exactly what you're trying to extract w/ the snp subquery? I see you're joining over the two tables, but it looks like what you really want is Snp objects which have an associated Clin which has the given id. If so, this becomes almost as straightforward to do as a separate query as the other 2:
Snp.objects.filter(samp__pat__clin__pk=given_clin)
or some such thing ought to do the trick. You may have to rewrite that a bit due to all the ways you're violating the conventions, unfortunately.
The others are something like:
Pat.objects.filter(clin__pk=given_clin).count()
and
Samp.objects.filter(clin__pk=given_clin).count()
if #Evgeny's reading is correct (which is how I read it as well).
Often, with Django's ORM, I find I get better results if I try to think about directly what I want in terms of the ORM, instead of trying to translate to or from the SQL I might use if I wasn't using the ORM.