How to get value frequencies for large data

How to get value frequencies for large data - sql

I have a table with millions of rows and 940 columns. I'm really hoping there is a way to summarize this data. I want to see frequencies for each value for EVERY column. I used this code with a few of the columns, but I won't be able to get many more columns in before the processing is too large.
SELECT
f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
,count(1) AS Frequency
FROM
(SELECT a.account, ntile(3) over (order by sum(a.seconds) desc) as ntile
,f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
FROM demo as c
JOIN aggregates a on c.customer_account = a.account
WHERE a.month IN ('201804', '201805', '201806')
GROUP BY a.account
,f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
)
WHERE ntile = 1
GROUP BY
f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
The problem is that the GROUP BY will be far too cumbersome. Is there any other way??? It would be really helpful to be able to see where the high frequencies are in such a large dataset.

Using index can help you to get much faster result in this kind of queries .The best thing to do would depend on what other fields the table has and what other queries run against that table.Without more details, a non-clustered index on month,account that included the
f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64 on aggregates or demo or customer(because I dont know which table includes these fields ) for example this index:
CREATE NONCLUSTERED INDEX IX_fasterquery
ON aggregates(month,accoun)
INCLUDE (f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64);
That's because if you have that index in place, then SQL will not access the actual table at all when running the query, as it can find all rows with a given "month,accoun, createddate" in the index and it will be able to do that really fast as the index allows precisely for fast access when using the fields that define the key, and it will also have the "f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64" value for each row and in your case by making this query as proc may you get bether result and the reason why I suggest this is here.

Related

Fastest way to get the dimensions of a table?

I am so used to python's df.shape which just gives you the number of rows and columns in a dataframe with no fuss.
However I am having a difficult time finding a simple way of doing this in SQL.
When I search for a solution on Stack Overflow, they either output the number of rows using count(*) or output the number of columns using [INFORMATION_SCHEMA].[COLUMNS].
I have never seen an output with both dimensions.
I have hacked this together, testing on my table Employee_Location:
SELECT count(*) as Dims
FROM [INFORMATION_SCHEMA].[COLUMNS]
WHERE table_name = 'Employee_Location'
union
select count(*) from Employee_Location;
Which outputs:
Dims
7
314636
Is there a simpler way to get this information? Some sort of function like python/pandas df.shape. ?
I guess I could wrap my piece of code above in a stored procedure or function to make it easier.

I don't believe there is any built in function for this.
Although you have your data, will just offer an alternative method which will be more performant (ie doesn't actually scan the table). The rows count from the partitions table is documented as approximate although it's never been different in my experience. If your tables are partitioned then rows will need to us sum.
select Count(*) Cols, Max(x.rows) rows
from information_schema.columns c
cross apply (
select rows
from sys.tables t join sys.partitions p on p.object_id=t.object_id
where t.name=c.TABLE_NAME and p.index_id<=1
)x
where TABLE_NAME='Employee_Location'

Oracle FIRST_ROWS optimizer hint

I'm writing a query against what is currently a small table in development. In production, we expect it to grow quite large over the life of the table (the primary key is a number(10)).
My query does a selection for the top N rows of my table, filtered by specific criteria and ordered by date ascending. Essentially, we're assigning records, in bulk, to a specific user for processing. In my case, N will only be 10, 20, or 30.
I'm currently selecting my primary keys inside a subselect, using rownum to limit my results, like so:
SELECT log_number FROM (
SELECT
il2.log_number,
il2.final_date
FROM log il2
INNER JOIN agent A ON A.agent_id = il2.agent_id
INNER JOIN activity lat ON il2.activity_id = lat.activity_id
WHERE (p_criteria1 IS NULL OR A.criteria1 = p_criteria1)
WHERE lat.criteria2 = p_criteria2
AND lat.criteria3 = p_criteria3
AND il2.criteria3 = p_criteria4
AND il2.current_user IS NULL
GROUP BY il2.log_number, il2.final_date
ORDER BY il2.final_date ASC)
WHERE ROWNUM <= p_how_many;
Although I have a stopkey due to the rownum, I'm wondering if using an Oracle hint here (/*+ FIRST_ROWS(p_how_many) */) on the inner select will affect the query plan in the future. I'd like to know more about what the database does when this hint is specified; does it actually make a difference if you have to order the table? (Seems like it wouldn't.) Or does it only affect the select portion, after the access and join parts?
Looking at the explain plan now doesn't get me much as the table hasn't grown yet.
Thanks for your help!

Even with an ORDER BY, different execution plans could be selected when you limit the number of rows returned. It can be easier to select the top n rows by some order key, then sort those, than to sort the entire table then select the top n rows.
However, the GROUP BY is likely to restrict the benefit of this sort of optimization. Grouping (or a DISTINCT operation) generally prevents the optimizer from using a plan that can pipe individual rows into a STOPKEY operation.

How to make this SQL query using IN (with many numeric IDs) more efficient?

I've been waiting over an hour already for this query, so I know I'm probably doing something wrong. Is there efficient way to tailor this query: ?
select RespondentID, MIN(SessionID) as 'SID'
from BIG_Sessions (nolock)
where RespondentID in (
1418283,
1419863,
1421188,
1422101,
1431384,
1435526,
1437284,
1441394,
/* etc etc THOUSANDS */
1579244 )
and EntryDate between
'07-11-2011' and '07-31-2012'
GROUP BY RespondentID
I kknow that my date range is pretty big, but I can't change that part (the dates are spread all over) .
Also, the reason for MIN(SessionID) is because otherwise we get many SessionID's for each Respondent, and one suffices(it's taking MIN on an alphanumeric ID like ach2a23a-adhsdx123... and getting the first alphabetically)
Thanks

Put your thousands of numbers in a temporary table.
Index the number field in that table.
Index the RespondentID field in BIG_SESSIONS
Join the two tables
eg:
select RespondentID, MIN(SessionID) as 'SID'
from BIG_Sessions (nolock)
inner join RespondentsFilterTable
on BIG_SESSIONS.RespondentID = RespondentsFilterTable.RespondentID
where EntryDate between '07-11-2011' and '07-31-2012'
GROUP BY BIG_Sessions.RespondentID
You could add indexes to EntryDate and SessionID as well, but if you're adding to big_sessions frequently, this could be counter productive elsewhere
In general, you can can get hints of how performance of a query can be improved by studying the estimated (or if possible actual) execution plans.

If the smallest and largest ids in the IN statement are known beforehands and depending on how many ids are in the table then adding a respondedID > [smallest_known_id-1] AND respondedID < [largest_known_id+1] prior to the IN statement would help limiting the problem

Optimize SQL Query having SUM and COUNT functions

I have the following query which takes too long to retrieve around 70000 records. I noticed that the execution time is proportional to the number of the records retrieved. I need to optimize this query so that the execution time is not proportional to the number of records retrieved. Any idea?
;WITH TT AS (
SELECT TaskParts.[TaskPartID],
PartCost,
LabourCost,
VendorPaidPartAmount,
VendorPaidLabourAmount,
ROW_NUMBER() OVER (ORDER BY [Employees].[EmpCode] asc) AS RowNum
FROM [TaskParts],[Tasks],[WorkOrders], [Employees], [Status],[Models]
,[SubAccounts]WHERE 1=1 AND (TaskParts.TaskLineID = Tasks.TaskLineID)
AND (Tasks.WorkOrderID = [WorkOrders].WorkOrderID)
AND (Tasks.EmpID = [Employees].EmpID)
AND (TaskParts.StatusID = [Status].StatusID)
And (Models.ModelID = Tasks.FailedModelID)
And (SubAccounts.SubAccountID = Tasks.SubAccountID)AND (SubAccounts.GLAccountID = 5))
SELECT --*
COUNT(0)--,
SUM(ISNULL(PartCost,0)),
SUM(ISNULL(LabourCost,0)),
SUM(ISNULL(VendorPaidPartAmount,0)),
SUM(ISNULL(VendorPaidLabourAmount,0))
FROM TT

As Lieven noted, you can remove TD0, TD1 and TP1 as they are redundant.
You can also remove the row_number column, as that is not used and windowing functions are relatively expensive.
It may also be possible to remove some of the tables from the TT CTE if they are not used; however, as table names have not been included with each column selected, it isn't possible to tell which tables are not being used.
Aside from that, your query's response will always be proportional to the number of rows returned, because the RDBMS has to read each row returned to calculate the results.

Make sure that you have support index for each Foreign Key also most probably it is not the issue in this case but MS SQL optimization better works with inner joins.
Also I don't see any reason why you need RowNum if you need only totals.

Avoiding a nested subquery in SQL

I have a SQL table that contains data of the form:
Id int
EventTime dateTime
CurrentValue int
The table may have multiple rows for a given id that represent changes to the value over time (the EventTime identifying the time at which the value changed).
Given a specific point in time, I would like to be able to calculate the count of distinct Ids for each given Value.
Right now, I am using a nested subquery and a temporary table, but it seems it could be much more efficient.
SELECT [Id],
(
SELECT
TOP 1 [CurrentValue]
FROM [ValueHistory]
WHERE [Ids].[Id]=[ValueHistory].[Id] AND
[EventTime] < #StartTime
ORDER BY [EventTime] DESC
) as [LastValue]
INTO #temp
FROM [Ids]
SELECT [LastValue], COUNT([LastValue])
FROM #temp
GROUP BY [LastValue]
DROP TABLE #temp

Here is my first go:
select ids.Id, count( distinct currentvalue)
from ids
join valuehistory vh on ids.id = vh.id
where vh.eventtime < #StartTime
group by ids.id
However, I am not sure I understand your table model very clearly, or the specific question you are trying to solve.
This would be: The distinct 'currentvalues' from valuehistory before a certain date that for each Id.
Is that what you are looking for?

I think I understand your question.
You want to get the most recent value for each id, group by that value, and then see how many ids have that same value? Is this correct?
If so, here's my first shot:
declare #StartTime datetime
set #StartTime = '20090513'
select ValueHistory.CurrentValue, count(ValueHistory.id)
from
(
select id, max(EventTime) as LatestUpdateTime
from ValueHistory
where EventTime < #StartTime
group by id
) CurrentValues
inner join ValueHistory on CurrentValues.id = ValueHistory.id
and CurrentValues.LatestUpdateTime = ValueHistory.EventTime
group by ValueHistory.CurrentValue
No guarantee that this is actually faster though - for this to work with any decent speed you'll need an index on EventTime.

Let us keep in mind that, because the SQL language describes what you want and not how to get it, there are many ways of expressing a query that will eventually be turned into the same query execution plan by a good query optimizer. Of course, the level of "good" depends on the database you're using.
In general, subqueries are just a syntactically different way of describing joins. The query optimizer is going to recognize this and determine the most optimal way, to the best of its knowledge, to execute the query. Temporary tables may be created as needed. So in many cases, re-working the query is going to do nothing for your actual execution time -- it may come out to the same query execution plan in the end.
If you're going to attempt to optimize, you need to examine the query plan by doing a describe on that query. Make sure it's not doing full-table scans against large tables, and is picking the appropriate indices where possible. If, and only if, it is making sub-optimal choices here, should you attempt to manually optimize the query.
Now, having said all that, the query you pasted isn't entirely compatible with your stated goal of "calculat[ing] the count of distinct Ids for each given Value". So forgive me if I don't quite answer your need, but here's something to perf-test against your current query. (Syntax is approximate, sorry -- away from my desk).
SELECT [IDs].[Id], vh1.[CurrentValue], COUNT(vh2.[CurrentValue]) FROM
[IDs].[Id] as ids JOIN [ValueHistory] AS vh1 ON ids.[Id]=vh1.[Id]
JOIN [ValueHistory] AS vh2 ON vh1.[CurrentValue]=vh2.[CurrentValue]
GROUP BY [Id], [LastValue];
Note that you'll probably see better performance increases by adding indices to make those joins optimal than re-working the query, assuming you're willing to take the performance hit to update operations.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get value frequencies for large data - sql

Related

Fastest way to get the dimensions of a table?

Oracle FIRST_ROWS optimizer hint

How to make this SQL query using IN (with many numeric IDs) more efficient?

Optimize SQL Query having SUM and COUNT functions

Avoiding a nested subquery in SQL

Categories

Resources