I have a requirement to check the space occupied by a specific table per day per system, just a short back ground i have some 10 systems from each system we process the daily etl loads and the counts can be observed based in the date field.
Database oracle 11g
size per GB
example
SYSTEM PROCESS_DATE count(*)
RETAIL 26.02.2021 100
PHARMACY 26.02.2021 200
BANKING 26.02.2021 300
query 1 - to check the daily counts per system
select distinct system,count(*) from AUDIT_SCH.DWH_ADT_TBL
where trunc(process_date)=trunc(sysdate)
group by system
order by count(*) desc;
but what i want is how to capture the space of daily loads consumed per system from this table ? , is this possible
it's confusing checking various suggestions below is the reference
How do I calculate tables size in Oracle
any suggestions with query ?
Use the function VSIZE to sum the number of bytes used in each column, per system:
select
system, count(*),
round(sum
(
nvl(vsize(system), 0) +
nvl(vsize(process_date), 0) +
nvl(vsize(column1), 0) +
nvl(vsize(column2), 0)
--Add all other columns here
)/1024/1024/1024) gb
from DWH_ADT_TBL
where trunc(process_date)=trunc(sysdate)
group by system
order by count(*) desc;
Unfortunately, calculating the size of things in a database can be ridiculously complicated. You may need to worry about:
Overhead. The VSIZE function does not account for row overhead, block overhead, segment overhead, and unused space in files/ASM diskgroups/volumes, etc.
Compression. If the table, tablespace, or LOBs are compressed or encrypted, the VSIZE will incorrectly return the uncompressed size.
Indexes. VSIZE does not include index sizes. But if you're only interested in comparing systems, then the percentage of data will still be the same even if the absolute sizes are off. (Unless you have indexed columns that are only used by one system.)
LOBs. You may need to use DBMS_LOB.GETLENGTH to calculate the size of LOBs. For CLOBs you may need to multiple the result by 2 depending on the characterset - for UCS2, each character uses 2 bytes.
But in practice the above query is still good enough to give you a decent understanding of where the space is used.
If you have multiple tables with many columns you could generate the queries using the data dictionary, by querying from DBA_TAB_COLUMNS.
I've been trying to visualise how to do this with a CTE as it appears on the surface to be the best way but just can't get it going. Maybe it needs a temp table as well. I am using SQL Server 2008 R2.
I need to create intercepts (a length along a line essentially) with the following parameters.
The average of the intercept must be greater than .7
The aim is to get the largest intercept possible
There can be up to 2 consecutive meters of values less than .7 (internal waste) but no more
There is no limit to the total internal waste within an intercept
There is no minimum intercept length (well there is but I'll take care of it later)
Note: there will be no gaps as I has taken care of that and the from and to can be decimal
An example is shown here:
enter image description here
In space seen here with the assay on the left and depth on right:
enter image description here
So for a little more clarity if needed, intervals 6 to 7 and 17 to 18 do not make part of the larger intercept as the internal waste (7-9 and/or 15-17) would bring the average below 0.7 - not because of the amount of internal waste.
However the result for 21-22 is not included because there are 3 meters of internal waste between it and the result for 17-18.
Note that there are multiple sites and areas which make part of the original tables primary key so I imagine that a partition of area and site would be used in any ROW_NUMBER OVER statements
Edit: the original code had errors in the from and to (multiple 14 to 15) which would have been confusing sorry. There will be no overlapping from-to's which hopefully simplifies things.
Example values to use:
create table #temp_inter (
area nvarchar(10),
site_ID nvarchar(10),
d_from decimal (18,3),
d_to decimal (18,3),
assay decimal (18,3))
insert into #temp_inter
values ('area_1','abc','0','5','0'),
('area_1','abc','5','6','0.165'),
('area_1','abc','6','7','0.761'),
('area_1','abc','7','8','0.321'),
('area_1','abc','8','9','0.292'),
('area_1','abc','9','10','1.135'),
('area_1','abc','10','11','0.225'),
('area_1','abc','11','12','0.983'),
('area_1','abc','12','13','0.118'),
('area_1','abc','13','14','0.438'),
('area_1','abc','14','15','0.71'),
('area_1','abc','15','16','0.65'),
('area_1','abc','16','17','2'),
('area_1','abc','17','18','0.367'),
('area_1','abc','18','19','0.047'),
('area_1','abc','19','20','0.71'),
('area_1','abc','20','21','0'),
('area_1','abc','21','22','0'),
('area_1','abc','22','23','0'),
('area_1','abc','23','24','2'),
('area_1','abc','24','25','0'),
('area_1','abc','25','26','0'),
('area_1','abc','26','30','0'),
('area_2','zzz','0','5','0'),
('area_2','zzz','5','6','1.165'),
('area_2','zzz','6','7','0.396'),
('area_2','zzz','7','8','0.46'),
('area_2','zzz','8','9','0.111'),
('area_2','zzz','9','10','0.053'),
('area_2','zzz','10','11','0.057'),
('area_2','zzz','11','12','0.055'),
('area_2','zzz','12','13','0.03'),
('area_2','zzz','13','14','0.026'),
('area_2','zzz','14','15','0.194'),
('area_2','zzz','15','16','0.367'),
('area_2','zzz','16','17','0.431'),
('area_2','zzz','17','18','0.341'),
('area_2','zzz','18','19','0.071'),
('area_2','zzz','19','20','0.26'),
('area_2','zzz','20','21','0.659'),
('area_2','zzz','21','22','0.602'),
('area_2','zzz','22','23','2.436'),
('area_2','zzz','23','24','0.874'),
('area_2','zzz','24','25','3.173'),
('area_2','zzz','25','26','0.179'),
('area_2','zzz','26','27','0.065'),
('area_2','zzz','27','28','0.024'),
('area_2','zzz','28','29','0')
The problem I have is simple:
I have a set of datasets. Each dataset has within it a set of points. Each set of points is an identical a 6km spaced grid (this grid never changes). Each point has an associated value.Each dataset is unrelated, so the problem can be seen as just a single set of points.
If the value of a point exceeds a predefined threshold value then the point has to be queried against an oracle spatial database to find all line segments within a certain distance of the point.
Which is a simple enough problem to solve.
The line segments have a non-unique ID, which allow them to be grouped together into features of size 1 to 700 segments (it's all predefined topology).
Ultimately I need to know which feature IDs match against which points as well as the number of line segments for each feature match against each point.
In terms of dataset sizes:
There are around 200 datasets.
There are 56,000 points per dataset.
There is a little over 180,000 line segments in the spatially indexed database.
The line segments can be grouped into a total of 1900 features.
Usually there aren't many more than in the order of 10^3 points that exceed the threshold per dataset.
I have created a solution and it works adequately,
however I'm unhappy with the overall run times - it takes around 3min per dataset.
Normally I wouldn't mind if a precomputation task takes that long, but due to constraints this task cannot take more than an hour to run, and ideally would only take 1/2 an hour.
Currently I use SDO_WITHIN_DISTANCE to do the query, and I run this query for each and every point that exceeds the threshold:
SELECT id, count(shape) AS segments, sum(length) AS length
FROM (
SELECT shape, id, length
FROM lines_1
UNION ALL
SELECT shape, id, length
FROM lines_2
)
WHERE SDO_WITHIN_DISTANCE(
shape,
sdo_geometry(
3001,
8307,
SDO_POINT_TYPE(:lng,:lat, 0),
null,
null
),
'distance=4 unit=km'
) = 'TRUE'
GROUP BY id
This query takes around 0.4s to execute, which isn't all that bad, but it adds up for a single dataset, and is compounded over all of the datasets.
I am not overly experienced with Oracle spatial databases, so I'm not sure how to improve the speed.
Note that I cannot change the format of the incoming set of points, nor can I change the format of the database.
The only way to speed it up that I can think of is by pre computing the query for each point and storing that in a separate table, but I'd rather not do that as it more or less creates another copy of the data.
So the question is - is there a better way to do query?
I ended up precomputing my query into the following table.
+---------+---------+
| LINE_ID | VARCHAR |
| LAT | FLOAT |
| LNG | FLOAT |
+---------+---------+
There were just too many multiline segments for it to be efficient.
By precomputing it I can just lookup in the table for the relevant IDs (which ultimately was all I cared about).
The query takes less than 1/10th of the time, so it works out a lot faster.
Ultimately the tradeoff of having to recompute the point to ID mapping every week (takes about 2 hours) was worth the speed up.
Which one of the following queries will be faster and more optimal (and why):
SELECT * FROM items WHERE w = 320 AND h = 200 (w and h are INT)
SELECT * FROM items WHERE dimensions = '320x200'(dimensions is VARCHAR)
Here are some actual measurements. (Using SQLite; may try it with MySQL later.)
Data = All 1,000,000 combinations of w, h ∈ {1...1000}, in randomized order.
CREATE TABLE items (id INTEGER PRIMARY KEY, w INTEGER, h INTEGER)
Average time (of 20 runs) to execute SELECT * FROM items WHERE w = 320 and h = 200 was 5.39±0.29 µs.
CREATE TABLE items (id INTEGER PRIMARY KEY, dimensions TEXT)
Average time to execute SELECT * FROM items WHERE dimensions = '320x200' was 5.69±0.23 µs.
There is no significant difference, efficiency-wise.
But
There is a huge difference in terms of usability. For example, if you want to calculate the area and perimeter of the rectangles, the two-column approach is easy:
SELECT w * h, 2 * (w + h) FROM items
Try to write the corresponding query for the other way.
Intuitively, if you do not create INDEXes on those columns, integer comparison seems faster.
In integer comparison, you compare directly 32-bit values equality with logical operators.
On the other hand, strings are character arrays, it will be difficult to compare them. Character-by-character.
However, another point is that, in 2nd query you have 1 field to compare, in 1st query you have 2 fields. If you have 1,000,000 records and no indexes on columns, that means you may have 1,000,000 string comparisons on worst case (unluckily last result is the thing you've looking for or not found at all)
On the other hand you have 1,000,000 records and all are w=320, then you'll be comparing them for h,too. That means 2,000,000 comparisons. However you create INDEXes on those fields IMHO they will be almost identical since VARCHAR will be hashed (takes O(1) constant time) and will be compared using INT comparison and take O(logn) time.
Conclusion, it depends. Prefer indexes on searchable columns and use ints.
Probably the only way to know that is to run it. I would suspect that if all columns used are indexed, there would be basically no difference. If INT is 4 bytes, it will be almost the same size as the string.
The one wrinkle is in how VARCHAR is stored. If you used a constant string size, it might be faster than VARCHAR, but mostly because your select * needs to go get it.
The huge advantage of using INT is that you can do much more sophisticated filtering. That alone should be a reason to prefer it. What if you need a range, or just width, or you want to do math on width in the filtering? What about constraints based on the columns, or aggregates?
Also, when you get the values into your programming language, you won't need to parse them before using them (which takes time).
EDIT: Some other answers are mentioning string compares. If indexed, there won't be many string compares done. And it's possible to implement very fast compare algorithms that don't need to loop byte-by-byte. You'd have to know the details of what mysql does to know for sure.
Second query, as the chances to match the exact string is smaller (which mean smaller set of records but with greater cardinality)
First query, chances matching first column is higher and more rows are potentially matched (lesser cardinality)
of course, assuming index are defined for both scenario
first one because it is faster to compare numeric data.
It depends on the data and the available indexes. But it is quite possible for the VARCHAR version to be faster because searching a single index can be faster than two. If the combination of values provides a unique (or "mostly" unique) result while each individual H/W value has multiple entries, then it could narrow the down to a much smaller set using the single index.
On the other hand, if you have a multiple column index on the to integer columns, that would likely be the most efficient.