I need to do the following selection.
I have a list of 100.000 user with different status (1 - 4). I need to choose an amount of 10.000 user out of this, first all with status 4 (but if there are less than 10.000 with status 4 than choose all with status 3 and than with 2 and than with 1)
Help is highly appreciated.
Thanks for replies, I tried to go with Gordons Version. As I need to union I have the following now. But this allows me only to prioritize the score for the second selection (%#test2.com) but I would need it for every selection that I create. If I put the "order by" before the union I receive an invalid syntax notification:
SELECT TOP 10000 *
FROM [table]
WHERE Email like '%#test1.com' and score in ( 1, 2, 3, 4)
UNION
SELECT TOP 5000 *
FROM [table]
WHERE Email like '%#test2.com' and score in ( 1, 2, 3, 4)
order by score desc
This is a prioritization query. You can handle it using the ANSI standard function row_number():
select t.*
from (select t.*,
row_number() over (order by status desc) as seqnum
from t
where status in (1, 2, 3, 4)
) t
where seqnum <= 10000;
You can also simplify this to:
select t.*
from t
where status in (1, 2, 3, 4)
order by status desc
fetch first 10000 rows only;
Related
I have a table with two columns BRANCH and ACTIVITIES, where BRANCH is a unique id of location and ACTIVITIES are number of records belong to respective BRANCH. These records to be distributed in 5 buckets in a way that all buckets should contain almost equal number of records. (no matter if difference is +/-1000)
The challenge is if one branch is selected in a bucket then all activities of same branch will also be selected in same bucket, in other words, number of activities belong to one BRANCH cannot be split. Lets take a very simple example so that I can explain what I am trying to achieve
Total Branches=10
Total Number of activities (records) = 55,000
Average (total activities/total buckets) = 11,000
Sample Data
After Distribution
All buckets contain 11,000 records but things are not such straight forward when we look into real data real data
All Oracle query masters are requested to please look into this. Your expert opinion will highly be appreciated.
Unfortunately, this is a bin-packing problem and a "perfect" solution requires -- essentially -- searching through all possible assignments of buckets and then choosing the "best" one. And such an approach is not really suitable for SQL.
For a "good-enough" solution, though, something like a round-robin approach often works well. Simply enumerate the branches from biggest to smallest and assign them to buckets:
select a.branch,
1 + mod(seqnum, 5) as bucket
from (select a.branch, count(*) as cnt,
row_number() over (order by count(*)) desc as seqnum
from activities a
group by a.branch
) a;
Because of the ordering, this is going to generally create buckets of different sizes. So, a slight variation assigns the buckets as 1-2-3-4-5-5-4-3-2-1:
select a.branch,
(case when mod(seqnum, 10) in (0, 9) then 1
when mod(seqnum, 10) in (1, 8) then 2
when mod(seqnum, 10) in (2, 7) then 3
when mod(seqnum, 10) in (3, 6) then 4
when mod(seqnum, 10) in (4, 5) then 5
end) as bucket
from (select a.branch, count(*) as cnt,
row_number() over (order by count(*)) desc as seqnum
from activities a
group by a.branch
) a;
You could also try below query. I added some stats columns in this inline view stats_cols_added_tab before I applied dense_rank analytic function to that inline view. Finally I used NTILE analytic function to get five groups.
with sample_data (branch, activities) as (
select 1, 1000 from dual union all
select 2, 2000 from dual union all
select 3, 3000 from dual union all
select 4, 4000 from dual union all
select 5, 5000 from dual union all
select 6, 6000 from dual union all
select 7, 7000 from dual union all
select 8, 8000 from dual union all
select 9, 9000 from dual union all
select 10, 10000 from dual
)
,
stats_cols_added_tab as (
select s.*
, count(*)over() total_branches
, sum(activities)over() total_number_of_activities
, avg(activities)over() * 2 Average
, case when row_number()over(order by s.branch) <= count(*)over() / 2 then 1 else 2 end grp
from sample_data s
)
SELECT BRANCH, ACTIVITIES, NTILE(5) OVER (ORDER BY ranked_grp, BRANCH) AS bucket
FROM (
select BRANCH, ACTIVITIES
, dense_rank()over(
PARTITION BY grp
order by decode(grp, 1, activities, -1 * activities)
) ranked_grp
from stats_cols_added_tab t
) t
order by ranked_grp, BRANCH
;
I have a bigQuery table with 30+ columns and I want to SELECT * where session is unique.
I have this query:
SELECT *
FROM `table.id`
WHERE session IN (
SELECT session
FROM `table.id`
GROUP BY session
HAVING COUNT(*) = 1
)
And it works, but I just learned from another question that HAVING COUNT(*) = 1 excludes the duplicate row:
Note that DISTINCT is used to show distinct records including 1 record from duplicate too. On the other hand HAVING COUNT() = 1 is checking only records which are not duplicate.
For a simple example, if session has : 1, 1, 2, 3
DISTINCT will result in: 1, 2, 3
HAVING COUNT() = 1 will result in: 2, 3
I need the DISTINCT result, the one that includes one entry of the duplicate.
Anyone can help me? Thanks in advance, kind regards
Maybe ROW_NUMBER?
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY session) as row_num
FROM `table.id`
)
WHERE row_num = 1
To view records without duplicated ones, I use this SQL
SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 1
What is the best practice to display only duplicated records from a single table?
Below is for BigQuery Standard SQL
Me personally, I prefer not to rely on ROW_NUMBER() whenever it is possible because with big volume of data it tends to lead to Resource Exceeded error
So, from my experience I would recommend below options:
To view records for those orderid with only one entry:
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY orderid
HAVING COUNT(1) = 1
to view records for those orderid with more than one entry:
#standardSQL
SELECT * EXCEPT(flag) FROM (
SELECT *, COUNT(1) OVER(PARTITION BY orderid) > 1 flag
FROM `project.dataset.table`
)
WHERE flag
note: behind the hood - COUNT(1) OVER() can be calculated using as many workers as available while ROW_NUMBER() OVER() requires all respective data to be moved to one worker (thus Resource related issue)
OR
#standardSQL
SELECT *
FROM `project.dataset.table`
WHERE orderid IN (
SELECT orderid FROM `project.dataset.table`
GROUP BY orderid HAVING COUNT(1) > 1
)
Why not just change the row_number ? You have partitionned by order id, creating partitions of duplicates, ranked the records and take only the first element to remove the duplicates. But if you take only the row_number = 2, you'll have only elements from partitions with at least 2 elements, i.e only duplicates.
SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 2
Note :Use row_number = 2 will give you only 1 element of duplicates. If you go with row_number > 1, the result may contain duplicates again (for example if you had 3 identical elements in the first table).
You can display the duplicated row by showing only raw with row_number greater than 1.
select
* except(row_number)
from (
select
*, row_number() over (partition by ) as row_number
from `TABLE`)
where row_number > 1
If your table has not primary key column, you are obliged to define it. Asuming my table contains 12 columns in BigQuery, I do not find shorter than:
SELECT *, sum(1) as rowcount
FROM `TABLE`
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
HAVING rowcount>1;
I need a way to put results into # of groups that I specify.
I have tried ntile() function, which I thought would use but it's not working:
WITH CTE AS (
SELECT 1 as Number
UNION ALL
SELECT Number+1
FROM CTE
WHERE Number < 100
)
SELECT *, ntile(80) over (order by number desc) as 'test'
FROM CTE
For the expected results, the Quartile column should output a number for every 2 entries (as specified in NTILE(80)), but it can be 2, 4, 10, or any number I specify.
Maybe NTILE() is not the right function but is there a function that does what I want?
So, if I specify 3, then the result should group every 3 records. If I specify 15, then the result should group every 15 records and move onto next group.
Hope I'm being clear
...should output a number for every 2 entries...
No, you have 100 entries and you want to divide them in 80 groups. You'll get some groups with 1 entry and other groups with 2 entries.
Read the definition of NTILE(). If you want groups with 2 entries you can do it as shown below by dividing it in 50 groups:
WITH recursive
CTE AS (
SELECT 1 as Number
UNION ALL
SELECT Number + 1
FROM CTE
WHERE Number < 100
)
SELECT *,
ntile(50) -- changed here
over (order by number desc) as test
FROM CTE
You didn't say what database engine you are using, so I assumed PostgreSQL.
I think you simply want the modulus operator:
WITH CTE AS (
SELECT 1 as Number
UNION ALL
SELECT Number+1
FROM CTE
WHERE Number < 100
)
SELECT cte.*,
(ROW_NUMBER() OVER (ORDER BY Number DESC) - 1) % 3 -- or however many groups that you want
FROM CTE
I want to be able to perform an avg() on a column after removing the 5 highest values in it and see that the stddev is not above a certain number. This has to be done entirely as a PL/SQL query.
EDIT:
To clarify, I have a data set that contains values in a certain range and tracks latency. I want to know whether the AVG() of those values is due to a general rise in latency, or due to a few values with a very high stddev. I.e - (1, 2, 1, 3, 12311) as opposed to (122, 124, 111, 212). I also need to achieve this via an SQL query due to our monitoring software's limitations.
You can use row_number to find the top 5 values, and filter them out in a where clause:
select avg(col1)
from (
select row_number() over (order by col1 desc) as rn
, *
from YourTable
) as SubQueryAlias
where rn > 5
select column_name1 from
(
select column_name1 from table_name order by nvl(column_name,0) desc
)a
where rownum<6
(the nvl is done to omit the null value if there is/are any in the column column_name)
Well, the most efficient way to do it would be to calculate (sum(all values) - sum(top 5 values)) / (row_count - 5)
SELECT SUM(val) AS top5sum FROM table ORDER BY val DESC LIMIT 5
SELECT SUM(val) AS allsum FROM table
SELECT (COUNT(*) - 5) AS bottomCount FROM table
The average is then (allsum - top5sum) / bottomCount
First, get the MAX 5 values:
SELECT TOP 5 RowId FROM Table ORDER BY Column
Now use this in your main statement:
SELECT AVG(Column) FROM Table WHERE RowId NOT IN (SELECT TOP 5 RowId FROM Table ORDER BY Column)