Getting fields that take less than certain distinct values - sql

IF I have data with two columns feature and feature_value, just like the example data set below
feature feature_value
X 1
X 1
X 2
Y 7
Y 8
Y 9
Z 100
and I want to get only feature,feature_value columns for features that have less than 3 distinct values (in this case only columns having X and Z), what is the efficient way to do it? Using Count(Distinct) and applying the where condition or is there any faster way?

Please note that this answer uses generic SQL, and since your question isn't clear as to why you'd only get the "X" and "Y" records when you want those that appear 3 or less, I've also taken the liberty of understanding your answer to mean you're looking only for the features that appear less than three times in the feature column, per your question saying you're looking for "columns [sic] for features that have less than 3 distinct values." If you meant 3 or more, you can easily adjust that in the below subquery by changing < to >=.
Subquery to GROUP BY feature and get the count, then select only those records.
SELECT * FROM my WHERE feature IN
(SELECT feature
FROM my
GROUP BY feature
HAVING COUNT(*) < 3)
;
http://sqlfiddle.com/#!2/29c05/1

Related

A more efficient way to sum the difference between columns in postgres?

For my application I have a table with these three columns: user, item, value
Here's some sample data:
user item value
---------------------
1 1 50
1 2 45
1 23 35
2 1 88
2 23 44
3 2 12
3 1 27
3 5 76
3 23 44
What I need to do is, for a given user, perform simple arithmetic against everyone else's values.
Let's say I want to compare user 1 against everyone else. The calculation looks something like this:
first_user second_user result
1 2 SUM(ABS(50-88) + ABS(35-44))
1 3 SUM(ABS(50-27) + ABS(45-12) + ABS(35-44))
This is currently the bottleneck in my program. For example, many of my queries are starting to take 500+ milliseconds, with this algorithm taking around 95% of the time.
I have many rows in my database and it is O(n^2) (it has to compare all of user 1's values against everyone else's matching values)
I believe I have only two options for how to make this more efficient. First, I could cache the results. But the resulting table would be huge because of the NxN space required, and the values need to be relatively fresh.
The second way is to make the algorithm much quicker. I searched for "postgres SIMD" because I think SIMD sounds like the perfect solution to optimize this. I found a couple related links like this and this, but I'm not sure if they apply here. Also, they seem to both be around 5 years old and relatively unmaintained.
Does Postgres have support for this sort of feature? Where you can "vectorize" a column or possibly import or enable some extension or feature to allow you to quickly perform these sorts of basic arithmetic operations against many rows?
I'm not sure where you get O(n^2) for this. You need to look up the rows for user 1 and then read the data for everyone else. Assuming there are few items and many users, this would be essentially O(n), where "n" is the number of rows in the table.
The query could be phrased as:
select t1.user, t.user, sum(abs(t.value - t1.value))
from t left join
t t1
on t1.item = t.item and
t1.user <> t.user and
t1.user = 1
group by t1.user, t.user;
For this query, you want an index on t(item, user, value).

select top N for each category w/o sorting if there are less than N rows

Given the following table, the question is to find for example the top N C2 from each C1.
C1 C2
1 1
1 2
1 3
1 4
1 ...
2 1
2 2
2 3
2 4
2 ...
....
So if N = 3, the results are
C1 C2
1 1
1 2
1 3
2 1
2 2
2 3
....
The proposed solutions use the window function and partition by
Select top 10 records for each category
https://www.the-art-of-web.com/sql/partition-over/
For example,
SELECT rs.Field1,rs.Field2
FROM (
SELECT Field1,Field2, Rank()
over (Partition BY Section
ORDER BY RankCriteria DESC ) AS Rank
FROM table
) rs WHERE Rank <= 3
I guess what it does is sorting then picking the top N.
However if some categories have less N elements, we can get the top N w/o sorting because the top N must include all elements in the category.
The above query uses Rank(). My question applies to other window functions like row_num() or dense_rank().
Is there a way to ignore the sorting at the case?
Also I am not sure if the underlying engine can optimize the case: whether the inner partition/order considers the outer where constraints before sorting.
Using partition+order+where is a way to get the top-N element from each category. It works perfect if each category has more than N element, but has additional sorting cost otherwise. My question is if there is another approach that works well at both cases. Ideally it does the following
for each category {
if # of element <= N:
continue
sort and get the top N
}
For example, but is there a better SQL?
WITH table_with_count AS (
SELECT Field1, Field2, RankCriteria, count() over (PARTITION BY Section) as c
FROM table
),
rs AS (
SELECT Field1,Field2, Rank()
over (Partition BY Section
ORDER BY RankCriteria DESC ) AS Rank
FROM table_with_count
where c > 10
)
(SELECT Field1,Field2e FROM rs WHERE Rank <= 10)
union
(SELECT Field1,Field2 FROM table_with_count WHERE c <= 10)
No, an there really shouldn't be. Overall what you describe here is the XY-problem.
You seem to:
Worry about sorting, while in fact sorting (with optional secondary sort) is the most efficient way of shuffling / repartitioning data, as it doesn't lead to proliferation of file descriptors. In practice Spark strictly prefers sort over alternatives (hashing) for exactly that reason.
Worry about "unnecessary" sorting of small groups, when in fact the problem is intrinsic inefficiency of window functions, which require full shuffle of all data, therefore exhibit the same behavior pattern as infamous groupByKey.
There are more efficient patterns (MLPairRDDFunctions.topByKey being the most prominent example) but these haven't been ported to Dataset API, and would require custom Aggregator It is also possible to approximate selection (for example through quantile approximation), but this increases the number of passes over data, and in many cases won't provide any performance gains.
This is too long for a comment.
There is no such optimization. Basically, all the data is sorted when using windowing clauses. I suppose that a database engine could actually use a hash algorithm for the partition by and a sort algorithm for the order by, but I don't think that is a common approach.
In any case, the operation is over the entire set, and it should be optimized for this purpose. Trying not to order a subset would add lots of overhead -- for instance, running the sort multiple times for each subset and counting the number of rows in each subset.
Also note that the comparison to "3" occurs (logically) after the window function. I don't think window functions are typically optimized for such post-filtering (although once again, it is a possible optimization).

What is the best way to reassign ordinal number of a move operation

I have a column in the sql server called "Ordinal" that is used to indicate the display order of the rows. It starts from 0 and skips 10 for the next row. so we have something like this:
Id Ordinal
1 0
2 20
3 10
It skips 10 because we wanted to be able to move item in between items (based on ordinal) without having to reassign ordinal number for the entire table.
As you can imagine eventually, Ordinal number will need to be reassign somehow for a move in between operation either on surrounding rows or for the entire table as the unused ordinal numbers between the target items are all used up.
Is there any algorithm that I can use to effectively reorder the ordinal number for the move operation taken in the consideration like long term maintainability of the table and minimizing update operations of the table?
You can re-number the sequences using a somewhat complicated UPDATE statement:
UPDATE u
SET u.sequence = 10 * (c.num_below-1)
FROM test u
JOIN (
SELECT t.id, count(*) AS num_below
FROM test t
JOIN test tr ON tr.sequence <= t.sequence
GROUP BY t.id
) c ON c.id=u.id
The idea is to obtain a count of items with the sequence lower than that of the current row, multiply the count by ten, and assign it as the new count.
The content of test before the UPDATE:
ID Sequence
__ ________
1 0
2 10
3 20
4 12
The content of test after the UPDATE:
ID Sequence
__ ________
1 0
2 30
3 10
4 20
Now the sequence numbers are evenly spread again, so you can continue inserting in the middle until you run out of new sequence numbers; then you can re-number again.
Demo.
These won't answer your question directly--I just thought I might suggest some other approaches:
One possibility--don't try to do it by hand. Have your software manage the numbers. If they need re-writing, just save them with new numbers.
a second--use a "Linked List" instead. In each record store the index of the next record you want displayed, then have your code load that directly into a linked list.
Yet another simple approach. Let's say you're inserting a new record with an ordinal equal x.
First, check if there's a row having ordinal value equal x. In case there's one, just update all the records having the ordinal value equal or bigger than x increasing them by y. Then, you are safe to insert a new record.
This way you're sure you'll not run update every time and of course, you'll keep the order.

How does order by clause works if two values are equal?

This is my NEWSPAPER table.
National News A 1
Sports D 1
Editorials A 12
Business E 1
Weather C 2
Television B 7
Births F 7
Classified F 8
Modern Life B 1
Comics C 4
Movies B 4
Bridge B 2
Obituaries F 6
Doctor Is In F 6
When i run this query
select feature,section,page from NEWSPAPER
where section = 'F'
order by page;
It gives this output
Doctor Is In F 6
Obituaries F 6
Births F 7
Classified F 8
But in Kevin Loney's Oracle 10g Complete Reference the output is like this
Obituaries F 6
Doctor Is In F 6
Births F 7
Classified F 8
Please help me understand how is it happening?
If you need reliable, reproducible ordering to occur when two values in your ORDER BY clause's first column are the same, you should always provide another, secondary column to also order on. While you might be able to assume that they will sort themselves based on order entered (almost always the case to my knowledge, but be aware that the SQL standard does not specify any form of default ordering) or index, you never should (unless it is specifically documented as such for the engine you are using--and even then I'd personally never rely on that).
Your query, if you wanted alphabetical sorting by feature within each page, should be:
SELECT feature,section,page FROM NEWSPAPER
WHERE section = 'F'
ORDER BY page, feature;
In relational databases, tables are sets and are unordered. The order by clause is used primarily for output purposes (and a few other cases such as a subquery containing rownum).
This is a good place to start. The SQL standard does not specify what has to happen when the keys on an order by are the same. And this is for good reason. Different techniques can be used for sorting. Some might be stable (preserving original order). Some methods might not be.
Focus on whether the same rows are in the sets, not their ordering. By the way, I would consider this an unfortunate example. The book should not have ambiguous sorts in its examples.
When you use the SELECT statement to query data from a table, the order which rows appear in the result set may not be what you expected.
In some cases, the rows that appear in the result set are in the order that they are stored in the table physically. However, in case the query optimizer uses an index to process the query, the rows will appear as they are stored in the index key order. For this reason, the order of rows in the result set is undetermined or unpredictable.
The query optimizer is a built-in software component in the database
system that determines the most efficient way for an SQL statement to
query the requested data.

SQL Server SQL Select: How do I select rows where sum of a column is within a specified multiple?

I have a process that needs to select rows from a Table (queued items) each row has a quantity column and I need to select rows where the quantities add to a specific multiple. The mulitple is the order of between around 4, 8, 10 (but could in theory be any multiple. (odd or even)
Any suggestions on how to select rows where the sum of a field is of a specified multiple?
My first thought would be to use some kind of MOD function which I believe in SQL server is the % sign. So the criteria would be something like this
WHERE MyField % 4 = 0 OR MyField % 8 = 0
It might not be that fast so another way might be to make a temp table containing say 100 values of the X times table (where X is the multiple you are looking for) and join on that