Select records with highest values for each subset

Select records with highest values for each subset - ruby-on-rails-3

I have a set of records of which some, but not all, have a 'path' field, and all have a 'value' field. I wish to select only those which either do not have a path, or have the largest value of all the records with a particular path.
That is, given these records:
Name: Path: Value:
A foo 5
B foo 6
C NULL 2
D bar 2
E NULL 4
I want to return B, C, D, and E, but not A (because A has a path and it's path is the same as B, but A has a lower value).
How can I accomplish this, using ActiveRecord, ARel and Postgres? Ideally, I would like a solution which functions as a scope.

You could use something like this by using 2 subqueries (will do only one SQL query which has subqueries). Did not test, but should get you in the right direction. This is for Postgres.
scope :null_ids, -> { where(path: nil).select('id') }
scope :non_null_ids, -> { where('path IS NOT NULL').select('DISTINCT ON (path) id').order('path, value desc, id') }
scope :stuff, -> {
subquery = [null_ids, non_null_ids].map{|q| "(#{q.to_sql})"}.join(' UNION ')
where("#{table_name}.id IN (#{subquery})")
}
If you are using a different DB you might need to use group/order instead of distinct on for the non_nulls scope. If the query is running slow put an index on path and value.
You get only 1 query and it's a chainable scope.

A straightforward transliteration of your description to SQL would look like this:
select name, path, value
from (
select name, path, value,
row_number() over (partition by path order by value desc) as r
from your_table
where path is not null
) as dt
where r = 1
union all
select name, path, value
from your_table
where path is null
You could wrap that in a find_by_sql and get your objects out the other side.
That query works like this:
The row_number window function allows us to group the rows by path, order each group by value, and then number the rows in each group. Play around with the SQL a bit inside psql and you'll see how this works, there are other window functions available that will allow you to do all sorts of wonderful things.
You're treating NULL path values separately from non-NULL paths, hence the path is not null in the inner query.
We can peel off the first row in each of the path groups by selecting those rows from the derived table that have a row number of one (i.e. where r = 1).
The treatment of path is null rows is easily handled by the section query.
The UNION is used to join the result sets of the queries together.
I can't think of any way to construct such a query using ActiveRecord nor can I think of any way to integrate such a query with ActiveRecord's scope mechanism. If you could easily access just the WHERE component of an ActiveRecord::Relation then you could augment the where path is not null and where path is null components of that query with the WHERE components of a scope chain. I don't know how to do that though.
In truth, I tend to abandon ActiveRecord at the drop of a hat. I find ActiveRecord to be rather cumbersome for most of the complicated things I do and not nearly as expressive as SQL. This applies to every ORM I've ever used so the problem isn't specific to ActiveRecord.

I have no experience with ActiveRecord, but here's a sample with SQLAlchemy to silent the just-use-SQL crowd ;)
q1 = Session.query(Record).filter(Record.path != None)
q1 = q1.distinct(Record.path).order_by(Record.path, Record.value.desc())
q2 = Session.query(Record).filter(Record.path == None)
query = q1.from_self().union(q2)
# Further chaining, e.g. query = query.filter(Record.value > 3) to return B, E
for record in query:
print record.name

Related

SQL query find few strings in diferent columns in a table row (restrictive)

I have a table like this one (in a SQL SERVER):
field_name
field_descriptor
tag1
tag2
tag3
tag4
tag5
house
your home
home
house
null
null
null
car
first car
car
wheel
null
null
null
...
...
...
...
...
...
...
I'm developing a WIKI with a searchbar, which should be able to handle a query with more than one string for search. As an user enters a second string (spaced) the query should be able to return results that match restrictively the two strings (if exists) in any column, and so with a three string search.
Easy to do for one string with a simple SELECT with ORs.
Tried in the fronted in JS with libraries like match-sorter but it's heavy with a table with more than 100,000 results and more in the future.
I thought the query should do the heavy work, but maybe there is no simple way doing it.
Thanks in advance!
Tried to do the heavy work with all results in frontend with filtering and other libraries like match-sorter. Works but take several seconds and blocks the front.
Tried to create a simple OR/AND query but the posibilities with 3 search-strings (could be 1, 2 or 3) matching any column to any other possibility is overwhelming.

You can use STRING_SPLIT to get a separate row per search word from the search words string. Then only select rows where all search words have a match.
The query should look like this:
select *
from mytable t
where exists
(
select null
from (select value from string_split(#search, ' ')) search
having min(case when search.value in (t.tag1, t.tag2, t.tag3, t.tag4, t.tag5) then 1 else 0 end) = 1
);
Unfortunately, SQL Server seems to have a flaw (or even a bug) here and reports:
Msg 8124 Level 16 State 1 Line 8
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
Demo: https://dbfiddle.uk/kNL1PVOZ
I don't have more time at hand right now, so you may use this query as a starting point to get the final query.

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!

The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.

Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

Rails: Need to scope by max version

I have this problem, I've got database table that looks like this:
"63";"CLINICAL...";"Please...";Blah...;"2014-09-23 13:15:59";37;8
"64";"CLINICAL...";"Please...";Blah...;"2014-09-23 13:22:51";37;9
The values that matter are the second to last and last one.
As you can see, the second to last (abstract_category_numbers) are the same, but the last differs (version_numbers)
Here is the problem:
When I make a scope, it returns all of the records, which i need to focus on the one with the maximum version number.
In SQL i would do something like this:
'SELECT * FROM Category c WHERE
NOT EXISTS SELECT * FROM Category c1
WHERE c.version_number < c1.version_number
AND c.abstract_category_id = c1.abstract_category_id'
But i'm totally lost at Ruby, more specifically how to do this kind of select in the scope (I understand it should be a relation)
Thanks

We can create a scope to select the category with max version_number like this:
scope :with_max_version_number, -> {
joins("JOIN ( SELECT abstract_category_id, max(version_number) AS max_version
FROM categories
GROUP BY abstract_category_id
) AS temp
ON temp.abstract_category_id = categories.abstract_category_id
AND temp.max_version = categories.version_number"
)
}
Basically, we will select the category with the max_version value on temp table in the subquery.
Btw, I expect the table name is categories, you may correct it. Then the final query will be:
Category.with_max_version_number

Scopes are suppose to return an array of values even if there is only 1 record.
If you want to ALWAYS return 1 value, use a static method instead.
def self.max_abstract_category
<your_scope>.max_by{ |obj| obj.version_number }
end

If I understand your question: you have a database table with a version_number column, which rails represents using an Active Record model--that I'll call Category because I don't know what you've called it--and you want to find the single Category record with the largest version_number?
Category.all.order(version_numbers: :DESC).limit(1).first
This query asks for all Category records ordered by version_number from highest to lowest and limits the request to one record (the first record, a.k.a the highest). Because the result of this request is an array containing one record, we call .first on the request to simply return the record.
As far as I'm aware, a scope is simply a named query (I don't actually use scopes). I think you can save this query as a scope by adding the following to your Category model. This rails guide explains more about Scopes.
scope :highest_version, -> { all.order(version_numbers: :DESC).limit(1).first }

I join implementation with baby_squeel but for some reason it was very slow on mysql. So I ended up with something like:
scope :only_latest, -> do
where(%{
NOT EXISTS (SELECT * FROM Category c
WHERE categories.version_number < version_number
AND categories.abstract_category_id = abstract_category_id')
}
end
I filed a BabySqueel bug as I spent a long time trying to do in a code proper way to no avail.

What do OrientDB's functions do when applied to the results of another function?

I am getting very strange behavior on 2.0-M2. Consider the following against the GratefulDeadConcerts database:
Query 1
SELECT name, in('written_by') AS wrote FROM V WHERE type='artist'
This query returns a list of artists and the songs each has written; a majority of the rows have at least one song.
Query 2
Now try:
SELECT name, count(in('written_by')) AS num_wrote FROM V WHERE type='artist'
On my system (OSX Yosemite; Orient 2.0-M2), I see just one row:
name num_wrote
---------------------------
Willie_Cobb 224
This seems wrong. But I tried to better understand. Perhaps the count() causes the in() to look at all written_by edges...
Query 3
SELECT name, in('written_by') FROM V WHERE type='artist' GROUP BY name
Produces results similar to the first query.
Query 4
Now try count()
SELECT name, count(in('written_by')) FROM V WHERE type='artist' GROUP BY name
Wrong path -- So try LET variables...
Query 5
SELECT name, $wblist, $wbcount FROM V
LET $wblist = in('written_by'),
$wbcount = count($wblist)
WHERE type='artist'
Produces seemingly meaningless results:
You can see that the $wblist and $wbcount columns are inconsistent with one another, and the $wbcount values don't show any obvious progression like a cumulative result.
Note that the strange behavior is not limited to count(). For example, first() does similarly odd things.

count(), like in RDBMS, computes the sum of all the records in only one value. For your purpose .size()seems the right method to call:
in('written_by').size()

Aggregate multiple columns without groupBy in Slick 2.0

I would like to perform an aggregation with Slick that executes SQL like the following:
SELECT MIN(a), MAX(a) FROM table_a;
where table_a has an INT column a
In Slick given the table definition:
class A(tag: Tag) extends Table[Int](tag, "table_a") {
def a = column[Int]("a")
def * = a
}
val A = TableQuery[A]
val as = A.map(_.a)
It seems like I have 2 options:
Write something like: Query(as.min, as.max)
Write something like:
as
.groupBy(_ => 1)
.map { case (_, as) => (as.map(identity).min, as.map(identity).max) }
However, the generated sql is not good in either case. In 1, there are two separate sub-selects generated, which is like writing two separate queries. In 2, the following is generated:
select min(x2."a"), max(x2."a") from "table_a" x2 group by 1
However, this syntax is not correct for Postgres (it groups by the first column value, which is invalid in this case). Indeed AFAIK it is not possible to group by a constant value in Postgres, except by omitting the group by clause.
Is there a way to cause Slick to emit a single query with both aggregates without the GROUP BY?

The syntax error is a bug. I created a ticket: https://github.com/slick/slick/issues/630
The subqueries are a limitation of Slick's SQL compiler currently producing non-optimal code in this case. We are working on improving the situation.
As a workaround, here is a pattern to swap out the generated SQL under the hood and leave everything else intact: https://gist.github.com/cvogt/8054159

I use the following trick in SQL Server, and it seems to work in Postgres:
select min(x2."a"), max(x2."a")
from "table_a" x2
group by (case when x2.a = x2.a then 1 else 1 end);
The use of the variable in the group by expression tricks the compiler into thinking that there could be more than one group.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas