I am so used to python's df.shape which just gives you the number of rows and columns in a dataframe with no fuss.
However I am having a difficult time finding a simple way of doing this in SQL.
When I search for a solution on Stack Overflow, they either output the number of rows using count(*) or output the number of columns using [INFORMATION_SCHEMA].[COLUMNS].
I have never seen an output with both dimensions.
I have hacked this together, testing on my table Employee_Location:
SELECT count(*) as Dims
FROM [INFORMATION_SCHEMA].[COLUMNS]
WHERE table_name = 'Employee_Location'
union
select count(*) from Employee_Location;
Which outputs:
Dims
7
314636
Is there a simpler way to get this information? Some sort of function like python/pandas df.shape. ?
I guess I could wrap my piece of code above in a stored procedure or function to make it easier.
I don't believe there is any built in function for this.
Although you have your data, will just offer an alternative method which will be more performant (ie doesn't actually scan the table). The rows count from the partitions table is documented as approximate although it's never been different in my experience. If your tables are partitioned then rows will need to us sum.
select Count(*) Cols, Max(x.rows) rows
from information_schema.columns c
cross apply (
select rows
from sys.tables t join sys.partitions p on p.object_id=t.object_id
where t.name=c.TABLE_NAME and p.index_id<=1
)x
where TABLE_NAME='Employee_Location'
Related
I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application
You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.
I have a table with millions of rows and 940 columns. I'm really hoping there is a way to summarize this data. I want to see frequencies for each value for EVERY column. I used this code with a few of the columns, but I won't be able to get many more columns in before the processing is too large.
SELECT
f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
,count(1) AS Frequency
FROM
(SELECT a.account, ntile(3) over (order by sum(a.seconds) desc) as ntile
,f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
FROM demo as c
JOIN aggregates a on c.customer_account = a.account
WHERE a.month IN ('201804', '201805', '201806')
GROUP BY a.account
,f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
)
WHERE ntile = 1
GROUP BY
f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
The problem is that the GROUP BY will be far too cumbersome. Is there any other way??? It would be really helpful to be able to see where the high frequencies are in such a large dataset.
Using index can help you to get much faster result in this kind of queries .The best thing to do would depend on what other fields the table has and what other queries run against that table.Without more details, a non-clustered index on month,account that included the
f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64 on aggregates or demo or customer(because I dont know which table includes these fields ) for example this index:
CREATE NONCLUSTERED INDEX IX_fasterquery
ON aggregates(month,accoun)
INCLUDE (f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64);
That's because if you have that index in place, then SQL will not access the actual table at all when running the query, as it can find all rows with a given "month,accoun, createddate" in the index and it will be able to do that really fast as the index allows precisely for fast access when using the fields that define the key, and it will also have the "f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64" value for each row and in your case by making this query as proc may you get bether result and the reason why I suggest this is here.
Is it possible to determine the type of data of each column after a SQL selection, based on received results? I know it is possible though information_schema.columns, but the data I receive comes from multiple tables and is joint together and the data is renamed. Besides that, I'm not able to see or use this query or execute other queries myself.
My job is to store this received data in another table, but without knowing beforehand what I will receive. I'm obviously able to check for example if a certain column contains numbers or text, but not if it is originally stored as a TINYINT(1) or a BIGINT(128). How to approach this? To clarify, it is alright if the data-types of the columns of the source and destination aren't entirely the same, but I don't want to reserve too much space beforehand (or too less for that matter).
As I'm typing, I realize I'm formulation the question wrong. What would be the best approach to handle described situation? I thought about altering tables on the run (e.g. increasing size if needed), but that seems a bit, well, wrong and not the proper way.
Thanks
Can you issue the following query about your new table after you create it?
SELECT *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'JoinedQueryResults'
Is the query too big to run before knowing how big the results will be? Get a idea of how many rows it may return, but the trick with queries with joins is to group on the columns you are joining on, to help your estimate return more quickly. Here's of an example of just returning a row count from the query above which would have created the JoinedQueryResults table above.
SELECT SUM(A.NumRows * B.NumRows)
FROM (SELECT ID, COUNT(*) AS NumRows
FROM TableA
GROUP BY ID) AS A
INNER JOIN (SELECT ID, COUNT(*) AS NumRows
FROM TableB
GROUP BY ID) AS B ON A.ID = B.ID
The query above will run faster if all you need is a record count to help you estimate a size.
Also try instantiating a table for your results with a query like this.
SELECT TOP 0 *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID
Please take a look at the attached image for the question as well as the corresponding data table and an example of what the output should look like.
You can solve this problem by using nested SQL with AVG group function and left outer join :
select z2.zip_code, z2.measurement_date, z2.noon_temp
from zip_temps z2 left outer join
(
select avg(z1.noon_temp) noon_temp, z1.zip_code
from zip_temps z1
group by z1.zip_code ) z3
on (z2.zip_code=z3.zip_code)
where z3.noon_temp < z2.noon_temp;
D e m o
Here is a solution using analytic functions.
select zip_code, measurement_date, noon_temp
from (
select zip_code, measurement_date, noon_temp,
avg(noon_temp) over (partition by zip_code) as avg_temp
)
where noon_temp > avg_temp
;
(Add an order by clause at the end if needed.)
The older approach is to have an aggregate subquery, to process the entire input table once and produce the average temperature for each zip code. Then the main query reads the main table a second time, it joins to this aggregate subquery by zip code, and outputs the needed rows. So the base table is read twice, and we have a join as well. (Other approaches, such as correlated subqueries, would require even more work, but the optimizer is smart enough to transform them into a join.)
Analytic functions were introduced specifically for this kind of problems, to reduce the amount of work needed. The average temperature (by zip code) is calculated essentially the same way (by partitioning or grouping by zip code), it is attached to every row in the input, and then the where clause in the outer query compares two values in the same row output by the subquery. There is no need to read the base data a second time, and there is no join.
Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan
Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.