I have a large data set with about 100 million rows that I want to 'compress' the data set and get a 1% sample of the entire dataset while ensuring relativity.
How can such query be implemented?
Step 1: create the helper table
You can use aggregation to group records by visit_id, and CROSS JOIN with a query that computes the total number of records in the table to compute the distribution percent:
CREATE TABLE my_helper AS
SELECT
t.visit_number,
COUNT(*) visit_count,
SUM(t.purchase_id) sum_purchase,
COUNT(*)/total.cnt distribution
FROM
mytable t
CROSS JOIN (SELECT COUNT(*) cnt FROM mytable) total
GROUP BY t.visit_number
Step 2: sample the main table using the helper table
Within a subquery, you can use ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) to assign a random rank to each record within groups of records sharing the same visit_id. Then, in the outer query, you can join on the helper table to select the corect amount of records for each visit_id:
SELECT x.*
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) rn
FROM mytable t
) x
INNER JOIN my_helper h ON h.visit_number = x.visit_number
WHERE x.rn <= 1000000 * h.distribution
Side notes:
this only works if there are indeed more than 1 million record in the source table
the exact number of records in the output might be slightly below or above 1 million (depending on the distribution in the original table)
it should be possible to combine both queries into a single one, which would avoid the need to use a helper table
This is doable. A quick way is to take every nth record only.
1) order by a random column (probably ID)
2) apply a nownum() attribute
3) apply a mod(rownum) = 0 on whatever percent makes sense (e.g. 1% would be rownum mod 100)
You may need steps 1/2 in a sub query and step 3 on the outside.
Enjoy and good luck!
Related
I have the following code which is taking a looong time to get executed. What I need to do is select the column having row number equals 1 after partitioning it by three columns (col_1, col_2, col_3) [which are also the key columns] and ordering by some columns as mentioned below. The number of records in the table is around 90 million. Am I following the best approach or is there any other better one?
with cte as (SELECT
b.*
,ROW_NUMBER() OVER ( PARTITION BY col_1,col_2,col_3
ORDER BY new_col DESC, new_col_2 DESC, new_col_3 DESC ) AS ROW_NUMBER
FROM (
SELECT
*
,CASE
WHEN update_col = ' ' THEN new_update_col
ELSE update_col
END AS new_col_1
FROM schema_name.table_name
) b
)
select top 10 * from cte WHERE ROW_NUMBER=1
Currently you are applying CASE on different columns which is impacting all rows in the database table. CASE (String Comparison) Is a costly method.
At the end, you are keeping only records with ROW NUMBER = 1. If I guess this filter keeping Half of your all records, this will increase the query execution time if you filter (Generate ROW NUMBER First and Keep Rows with RN=1) first and then apply CASE method on columns.
I need to perform TREAMMEAN in Access, which does not have this function.
In a table I have many Employees, each has many records.
I need to TRIMMEAN Values for each Employee separately.
Following queries perform TOP 10 percent for all records:
qry_data_TOP10_ASC
qry_data_TOP10_DESC
unionqry_TOP10_ASCandDESC
qry_data_ALL_minus_union_qry
After that, I can use Avg (Average).
But I don't know how to do it for each employee.
Visualization:
Note:
This question is edited to simplify problem.
You don't really give information in your pseudo code about your data fields but using your example that DOES have basic field information I can suggest the following should work as you described
It assumes field1 is your unique record ID - but you make no mention of which fields are keys
SELECT AVG(qry_data.field2) FROM qry_data WHERE qry_data.field1 NOT IN
(SELECT * FROM
(SELECT TOP 10 PERCENT qry_data.field1, qry_data.field2
FROM qry_data
ORDER BY qry_data.field2 ASC)
UNION
(SELECT TOP 10 PERCENT qry_data.field1, qry_data.field2
FROM qry_data
ORDER BY qry_data.field2 DESC)
)
This should give you what you want, the two sub-queries should correlate the TOP 10s (ascending and descending) for every employee. The two NOT INs should then remove those from the Table1 records and then you group the Employees and Average the Scores.
SELECT Table1.Employee, AVG(Table1.Score) AS AvgScore
FROM Table1
WHERE ID NOT IN
(
SELECT TOP 10 ID
FROM Table1 a
WHERE a.Employee = Table1.Employee
ORDER BY Score ASC, Employee, ID
)
AND ID NOT IN
(
SELECT TOP 10 ID
FROM Table1 b
WHERE b.Employee = Table1.Employee
ORDER BY Score DESC, Employee, ID
)
GROUP BY Table1.Employee;
I'm still learning the tricks and trade of PostgreSQL. I need a way of taking data and giving me the first 100 rows of each dataset.
My problem:
I have a table on the server that has over 60 columns. One column has the country. I need a breakdown where I get the first 100 rows by each country alphabetically.
There are 70 countries in this table. So the total results should be 7,000. How do I break this down?
Use ROW_NUMBER analytic function:
CREATE VIEW my_View AS
SELECT col1, col2, col3, ...... col60
FROM (
SELECT *,
row_number() over (Partition by country ) as Rn
FROM table
) x
WHERE rn <= 100
ORDER BY country
Thie above will give 100 random records for each country. If you do not want such a randomness, then please use ORDER BY some_columnclause in that way:
row_number() over (Partition by country ORDER BY country) as Rn
I am trying to get a row number of the row. Since the table doesn't have any id column, I have used ROW_NUMBER() without any order which is shown below.
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS SNO, *
FROM [table1]
Now the challenge is i need to find a row with a condition which is just a select statement with where clause but with a original row number.
SELECT TOP 1 *
FROM table1
WHERE [Total Sales] = 2555
This statement returns a single record. I have tried to use INTERSECT to combine both statements to get result with row number.
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT 100)) AS SNO, *
FROM [table1]
INTERSECT
SELECT TOP 1 *
FROM table1
WHERE [Total Sales] = 2555
Of course, this throws errors since number of columns are different. So what is the correct way to get the actual row number ?
When you run this query:
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS SNO, t.*
FROM [table1] t;
The SNO values are unstable. That means that the same query run multiple times might return different numbers. Sorting in SQL is not stable. That means that identical keys can be in an arbitrary order when the query is run multiple times. Why? SQL tables and result sets represent unordered sets. There is nothing to base a stable sort on.
The simplistic answer to your question is to use a subquery:
SELECT t.*
FROM (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS SNO, t.*
FROM [table1] t
) t
WHERE [Total Sales] = 2555;
However, the real answer is that you should be using multiple columns to create a stable sort, if you want to use this value for more than one query.
SQL does not have an initial "row number" for the entries. The table order shown is all based on the query results. If you are looking to keep them in the order they are put into the DB then maybe add a time stamp that's generated with a trigger and attached to the row when it's inserted. Then using this times tamp you can have them sorted by that.
What's the primary key if there is no I'd?
In my SP I have the following:
with Paging(RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging WHERE RowNo BETWEEN 1 and 50
SELECT COUNT(*) FROM Paging
The result is that I get the error: invalid object name 'Paging'.
Can I query again the Paging table? I don't want to include the count for all results as a new column ... I would prefer to return as another data set. Is that possible?
Thanks, Radu
After more research I fond another way of doing this:
with Paging(RowNo, ID, Name, TotalOccurrences) AS
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
select RowNo, ID, Name, TotalOccurrences, (select COUNT(*) from Paging) as TotalResults from Paging where RowNo between (#PageNumber - 1 )* #PageSize + 1 and #PageNumber * #PageSize;
I think that this has better performance than calling two times the query.
You can't do that because the CTE you are defining will only be available to the FIRST query that appears after it's been defined. So when you run the COUNT(*) query, the CTE is no longer available to reference. That's just a limitation of CTEs.
So to do the COUNT as a separate step, you'd need to not use the CTE and instead use the full query to COUNT on.
Or, you could wrap the CTE up in an inline table valued function and use that instead, to save repeating the main query, something like this:
CREATE FUNCTION dbo.ufnExample()
RETURNS TABLE
AS
RETURN
(
with Paging(RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging
)
SELECT * FROM dbo.ufnExample() x WHERE RowNo BETWEEN 1 AND 50
SELECT COUNT(*) FROM dbo.ufnExample() x
Please be aware that Radu D's solution's query plan shows double hits to those tables. It is doing two executions under the covers. However, this still may be the best way as I haven't found a truly scalable 1-query design.
A less scalable 1-query design is to dump a completed ordered list into a #tablevariable , SELECT ##ROWCOUNT to get the full count, and select from #tablevariable where row number between X and Y. This works well for <10000 rows, but with results in the millions of rows, populating that #tablevariable gets expensive.
A hybrid approach is to populate this temp/variable up to 10000 rows. If not all 10000 rows are filled up, you're set. If 10000 rows are filled up, you'll need to rerun the search to get the full count. This works well if most of your queries return well under 10000 rows. The 10000 limit is a rough approximation, you can play around with this threshold for your case.
Write "AS" after the CTE table name Paging as below:
with Paging AS (RowNo, ID, Name, TotalOccurrences) as
(
ROW_NUMBER() over (order by TotalOccurrences desc) as RowNo, V.ID, V.Name, R.TotalOccurrences FROM dbo.Videos V INNER JOIN ....
)
SELECT * FROM Paging WHERE RowNo BETWEEN 1 and 50
SELECT COUNT(*) FROM Paging