Supose I would like to store a table with 440 rows and 138,672 columns, as SQL limit is 1024 columns I would like to transform rows into columns, I mean to convert the
440 rows and 138,672 columns to 138,672 rows and 440 columns.
Is this possible?
SQL Server limit is actually 30000 columns, see Sparse Columns.
But creating a query that returns 30k columns (not to mention +138k) will be basically uncontrollable, the sheer size of the metadata on each query result would halt the client to a crawl. One simply does not design databases like that. Go back to the drawing board, when you reach 10 columns stop and think, when you reach 100 column erase the board and start anew.
And read this: Best Practices for Semantic Data Modeling for Performance and Scalability.
The description of the data is as follows....
Each attribute describes the measurement of the occupancy rate
(between 0 and 1) of a captor location as recorded by a measuring
station, at a given timestamp in time during the day.
The ID of each station is given in the stations_list text file.
For more information on the location (GPS, Highway, Direction) of each
station please refer to the PEMS website.
There are 963 (stations) x 144 (timestamps) = 138,672 attributes for
each record.
This is perfect for normalision.
You can have a stations table and a measurements table. Two nice long thin tables.
Related
I have a requirement to check the space occupied by a specific table per day per system, just a short back ground i have some 10 systems from each system we process the daily etl loads and the counts can be observed based in the date field.
Database oracle 11g
size per GB
example
SYSTEM PROCESS_DATE count(*)
RETAIL 26.02.2021 100
PHARMACY 26.02.2021 200
BANKING 26.02.2021 300
query 1 - to check the daily counts per system
select distinct system,count(*) from AUDIT_SCH.DWH_ADT_TBL
where trunc(process_date)=trunc(sysdate)
group by system
order by count(*) desc;
but what i want is how to capture the space of daily loads consumed per system from this table ? , is this possible
it's confusing checking various suggestions below is the reference
How do I calculate tables size in Oracle
any suggestions with query ?
Use the function VSIZE to sum the number of bytes used in each column, per system:
select
system, count(*),
round(sum
(
nvl(vsize(system), 0) +
nvl(vsize(process_date), 0) +
nvl(vsize(column1), 0) +
nvl(vsize(column2), 0)
--Add all other columns here
)/1024/1024/1024) gb
from DWH_ADT_TBL
where trunc(process_date)=trunc(sysdate)
group by system
order by count(*) desc;
Unfortunately, calculating the size of things in a database can be ridiculously complicated. You may need to worry about:
Overhead. The VSIZE function does not account for row overhead, block overhead, segment overhead, and unused space in files/ASM diskgroups/volumes, etc.
Compression. If the table, tablespace, or LOBs are compressed or encrypted, the VSIZE will incorrectly return the uncompressed size.
Indexes. VSIZE does not include index sizes. But if you're only interested in comparing systems, then the percentage of data will still be the same even if the absolute sizes are off. (Unless you have indexed columns that are only used by one system.)
LOBs. You may need to use DBMS_LOB.GETLENGTH to calculate the size of LOBs. For CLOBs you may need to multiple the result by 2 depending on the characterset - for UCS2, each character uses 2 bytes.
But in practice the above query is still good enough to give you a decent understanding of where the space is used.
If you have multiple tables with many columns you could generate the queries using the data dictionary, by querying from DBA_TAB_COLUMNS.
I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.
Let's say I want to get the last 50 records of a query that returns around 10k records, in a table with 1M records. I could do (at the computational cost of ordering):
data = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
I could also do (at the cost of 2 database hits):
# assume I don't care about new records being added between
# the two queries being executed
index = MyModel.objects.filter(criteria=something).count()
data = MyModel.objects.filter(criteria=something)[index-50:]
Which is better for just an ordinary relational database with no indexing on the criteria (eg postgres in my case; no columnar storage or anything fancy)? Most importantly, why?
Does the answer change if the table or queryset is significantly bigger (eg 100k records from a 10M row table)?
This one is going to be very slow
data = MyModel.objects.filter(criteria=something)[index-50:]
Why because it translates into
SELECT * FROM myapp_mymodel OFFEST (index-50)
You are not enforcing any ordering here, so the server is going to have to calulcate the result set and jump to the end of it and that's going to involve a lot of reading and will be very slow. Let us not forgot that count() queries aren't all that hot either.
OTH, this one is going to be fast
data = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
You are reverse ordering on the primary key and getting the first 50. And the first 50 you can fetch equally quickly with
data = MyModel.objects.filter(criteria=something).order_by('pk')[:50]
So this is what you really should be doing
data1 = MyModel.objects.filter(criteria=something).order_by('-pk')[:50]
data2 = MyModel.objects.filter(criteria=something).order_by('pk')[:50]
The cost of ordering on the primary key is very low.
tables
I want to calculate the minimum and maximum number of rows that will fit in one database page for Microsoft Access for each one of the tables in the image, string for customerName and orderMessage, long integer for all ID and two primary keys, single for the taxrate, doubles for orderTotal and taxableAmount, and Boolean for taxableFlag.
The minimum length for the customerName and orderMessage should be 0 characters for both and maximum length of 71 characters for customerName and 255 for orderMessage.
Also, is there a page overhead for Microsoft Access (Jet 4)?
Thanks.
Generally, such exercises yield unreliable results. If you are trying to do capacity planning, the best thing to do is create your table and load it up with 1% or 10% of your estimated total, and see how big it is. You have to consider room for indexes, how many of your rows have null values for some fields, etc.
The problem I have is simple:
I have a set of datasets. Each dataset has within it a set of points. Each set of points is an identical a 6km spaced grid (this grid never changes). Each point has an associated value.Each dataset is unrelated, so the problem can be seen as just a single set of points.
If the value of a point exceeds a predefined threshold value then the point has to be queried against an oracle spatial database to find all line segments within a certain distance of the point.
Which is a simple enough problem to solve.
The line segments have a non-unique ID, which allow them to be grouped together into features of size 1 to 700 segments (it's all predefined topology).
Ultimately I need to know which feature IDs match against which points as well as the number of line segments for each feature match against each point.
In terms of dataset sizes:
There are around 200 datasets.
There are 56,000 points per dataset.
There is a little over 180,000 line segments in the spatially indexed database.
The line segments can be grouped into a total of 1900 features.
Usually there aren't many more than in the order of 10^3 points that exceed the threshold per dataset.
I have created a solution and it works adequately,
however I'm unhappy with the overall run times - it takes around 3min per dataset.
Normally I wouldn't mind if a precomputation task takes that long, but due to constraints this task cannot take more than an hour to run, and ideally would only take 1/2 an hour.
Currently I use SDO_WITHIN_DISTANCE to do the query, and I run this query for each and every point that exceeds the threshold:
SELECT id, count(shape) AS segments, sum(length) AS length
FROM (
SELECT shape, id, length
FROM lines_1
UNION ALL
SELECT shape, id, length
FROM lines_2
)
WHERE SDO_WITHIN_DISTANCE(
shape,
sdo_geometry(
3001,
8307,
SDO_POINT_TYPE(:lng,:lat, 0),
null,
null
),
'distance=4 unit=km'
) = 'TRUE'
GROUP BY id
This query takes around 0.4s to execute, which isn't all that bad, but it adds up for a single dataset, and is compounded over all of the datasets.
I am not overly experienced with Oracle spatial databases, so I'm not sure how to improve the speed.
Note that I cannot change the format of the incoming set of points, nor can I change the format of the database.
The only way to speed it up that I can think of is by pre computing the query for each point and storing that in a separate table, but I'd rather not do that as it more or less creates another copy of the data.
So the question is - is there a better way to do query?
I ended up precomputing my query into the following table.
+---------+---------+
| LINE_ID | VARCHAR |
| LAT | FLOAT |
| LNG | FLOAT |
+---------+---------+
There were just too many multiline segments for it to be efficient.
By precomputing it I can just lookup in the table for the relevant IDs (which ultimately was all I cared about).
The query takes less than 1/10th of the time, so it works out a lot faster.
Ultimately the tradeoff of having to recompute the point to ID mapping every week (takes about 2 hours) was worth the speed up.