SQL remove rows where top 1% of values

SQL remove rows where top 1% of values - sql

I am using the following query to remove all rows where volume is in the top 1%. I am structuring my query using the following stackoverflow question: Select top 10 percent, also bottom percent in SQL Server.
However, my query is generating an error. I was hoping for some input on what has to be changed.
CREATE TABLE TEST AS
WITH PERCENTILE AS
(
SELECT SEGMENT,
VOLUME,
OUTLIER_VOL = NTILE(100) OVER (ORDER BY VOLUME)
FROM OFFER_PERIOD_SEGMENT
)
SELECT *
FROM PERCENTILE
WHERE OUTLIER_VOL NOT IN (99,100)
I am receiving the following error:
CLI prepare error: [Oracle][ODBC][Ora]ORA-00923: FROM keyword not found where expected

Try to change
OUTLIER_VOL = NTILE(100) OVER (ORDER BY VOLUME)
to:
NTILE(100) OVER (ORDER BY VOLUME) OUTLIER_VOL
That <column alias> = <value> syntax is special to SQL Server I believe.

If someone stumbles upon this in the future, I want to add that I was calculating the percentiles incorrectly. The code below is used to calculate the percentiles, whereas in the above scenario you are creating 100 equal sized buckets in which your data is placed:
CREATE TABLE TEST AS
WITH PERCENTILE AS
(
SELECT SEGMENT,
VOLUME,
PERCENT_RANK() OVER (ORDER BY VOLUME) AS OUTLIER_VOL
FROM OFFER_PERIOD_SEGMENT
)
SELECT *
FROM PERCENTILE
WHERE OUTLIER_VOL < 0.99

Related

How to enforce random selection of rows from each of the different countries/cities in PostgreSQL?

I'm working on PostgreSQL in dbeaver. The database has a column addr:country and a column addr:city. The data has around 500 million rows so I have to do a ramdom sampling for testing. I intended to randomly select 1% of the data. However, the data itself could be highly biased (as there are big countries, and small countries, thus there are more rows for big countries, and less for smaller countries), and I'm thinking of a way to sample fairly. So I want to randomly select one or two rows from each city in each country.
The script I'm using is modified from someone else's query, and my script is:
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version
ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
COUNT(*)
OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt"
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT NULL
It returns error message: SQL Error [42601]: ERROR: syntax error at or near "(" Position: 1683.
I am very new to SQL so probably there are many mistakes in the script. Is there any way to enforce random selection of one/two rows from each addr:city in each addr:country?

you can use the window function dense_rank() to randomly number records in a partition:
with base_data as
(
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version,
ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
COUNT(*) OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt",
dense_rank() over (partition by "addr:country", "addr:city" order by random()) as ranking,
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT null
)
select
*
from base_data
where ranking between 1 and 2

How to write SQL to calculate running average with some additional formulae?

Following is the image that has running average calculated by me. But the requirement is a bit extra on top of the running average.
Following is the image where the requirement is in the Microsoft Excel sheet.
So, in order to calculate the running average with formulae like =(3*C4+2*C5+1*C6)/6 that have been gathered in excel sheet, what SQL Query could be written?
Also, if it's not feasible through SQL, then how could I use the Column D from the second image as my measure in SSAS?

use LAG() with offset and follow your formula accordingly
avg_val = ( (3.0 * lag(Open_, 2) over (order by M, [WEEK]) )
+ (2.0 * lag(Open_, 1) over (order by M, [WEEK]) )
+ (1.0 * Open_) ) / 6

Split the results of a query in half

I'm trying to export rows from one database to Excel and I'm limited to 65000 rows at a shot. That tells me I'm working with an Access database but I'm not sure since this is a 3rd party application (MRI Netsource) with limited query ability. I've tried the options posted at this solution (Is there a way to split the results of a select query into two equal halfs?) but neither of them work -- in fact, they end up duplicating results rather than cutting them in half.
One possibly related issue is that this table does not have a unique ID field. Each record's unique ID can be dynamically formed by the concatenation of several text fields.
This produces 91934 results:
SELECT * from note
This produces 122731 results:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY notedate) AS rn FROM note
) T1
WHERE rn % 2 = 1
EDIT: Likewise, this produces 91934 results, half of them with a tile_nr value of 1, the other half with a value of 2:
SELECT *, NTILE(2) OVER (ORDER BY notedate) AS tile_nr FROM note
However this produces 122778 results, all of which have a tile_nr value of 1:
SELECT bldgid, leasid, notedate, ref1, ref2, tile_nr
FROM (
SELECT *, NTILE(2) OVER (ORDER BY notedate) AS tile_nr FROM note) x
WHERE x.tile_nr = 1
I know that I could just use a COUNT to get the exact number of records, run one query using TOP 65000 ORDER BY notedate, and then another that says TOP 26934 ORDER BY notedate DESC, for example, but as this dataset changes a lot I'd prefer some way to automate this to save time.

SQL Percercentile Calculation

I have the following query, which even without a ton of data (~3k rows) is still a bit slow to execute, and the logic is a bit over my head - was hoping to get some help optimizing the query or even an alternate methodology:
Select companypartnumber, (PartTotal + IsNull(Cum_Lower_Ranks, 0) ) / Sum(PartTotal) over() * 100 as Cum_PC_Of_Total
FROM PartSalesRankings PSRMain
Left join
(
Select PSRTop.Item_Rank, Sum(PSRBelow.PartTotal) as Cum_Lower_Ranks
from partSalesRankings PSRTop
Left join PartSalesRankings PSRBelow on PSRBelow.Item_Rank < PSRTop.Item_Rank
Group by PSRTop.Item_Rank
) as PSRLowerCums on PSRLowerCums.Item_Rank = PSRMain.Item_Rank
The PartSalesRankings table simply consists of CompanyPartNumber(bigint) which is a part number designation, PartTotal(decimal 38,5) which is the total sales, and Item_Rank(bigint) which is the rank of the item based on total sales.
I'm trying to end up with my parts into categories based on their percentile - so an "A" item would be top 5%, a "B" item would be the next 15%, and "C" items would be the lower 80th percentile. The view I created works fine, it just takes almost three seconds to execute, which for my purposes is quite slow. I narrowed the bottle neck to the above query - any help would be greatly appreciated.

The problem you are having is the calculation of the cumulative sum of PartTotal. If you are using SQL Server 2012, you can do something like:
select (case when ratio <= 0.05 then 'A'
when ratio <= 0.20 then 'B'
else 'C'
end),
t.*
from (select psr.companypartnumber,
(sum(PartTotal) over (order by PartTotal) * 1.0 / sum(PartTotal) over ()) as ratio
FROM PartSalesRankings psr
) t
SQL Server 2012 also have percentile functions and other functions not in earlier versions.
In earlier versions, the question is how to get the cumulative sum efficiently. Your query is probably as good as anything that can be done in one query. Can the cumulative sum be calculated when partSalesRankings is created? Can you use temporary tables?

how to select lines in Mysql while a condition lasts

I have something like this:
Name.....Value
A...........10
B............9
C............8
Meaning, the values are in descending order. I need to create a new table that will contain the values that make up 60% of the total values. So, this could be a pseudocode:
set Total = sum(value)
set counter = 0
foreach line from table OriginalTable do:
counter = counter + value
if counter > 0.6*Total then break
else insert line into FinalTable
end
As you can see, I'm parsing the sql lines here. I know this can be done using handlers, but I can't get it to work. So, any solution using handlers or something else creative will be great.
It should also be in a reasonable time complexity - the solution how to select values that sum up to 60% of the total
works, but it's slow as hell :(
Thanks!!!!

You'll likely need to use the lead() or lag() window function, possibly with a recursive query to merge the rows together. See this related question:
merge DATE-rows if episodes are in direct succession or overlapping
And in case you're using MySQL, you can work around the lack of window functions by using something like this:
Mysql query problem

I don't know which analytical functions SQL Server (which I assume you are using) supports; for Oracle, you could use something like:
select v.*,
cumulative/overall percent_current,
previous_cumulative/overall percent_previous from (
select
id,
name,
value,
cumulative,
lag(cumulative) over (order by id) as previous_cumulative,
overall
from (
select
id,
name,
value,
sum(value) over (order by id) as cumulative,
(select sum(value) from mytab) overall
from mytab
order by id)
) v
Explanation:
- sum(value) over ... computes a running total for the sum
- lag() gives you the value for the previous row
- you can then combine these to find the first row where percent_current > 0.6 and percent_previous < 0.6

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL remove rows where top 1% of values - sql

Try to change OUTLIER_VOL = NTILE(100) OVER (ORDER BY VOLUME) to: NTILE(100) OVER (ORDER BY VOLUME) OUTLIER_VOL That <column alias> = <value> syntax is special to SQL Server I believe.

Related

How to enforce random selection of rows from each of the different countries/cities in PostgreSQL?

How to write SQL to calculate running average with some additional formulae?

Split the results of a query in half

SQL Percercentile Calculation

how to select lines in Mysql while a condition lasts

Categories

Resources