What does "width" mean in Hive Query Plan? - hive

TableScan [TS_0] (rows=1217 width=292)
Rows is how many rows are in the table (1217 in this case). But what does "width" mean ?

Based on this Statistics class, The width is the datasize/numrows, so it is essentially the means data size in bytes

Related

Does a query against a nested field in BigQuery count only the size of the subfield as the "Processed Data Amount" in on-demand pricing?

The other possibility is that the "Processed Data Amount" has the size of the top, enclosing STRUCT/RECORD type even though only one subfield of the STRUCT/RECORD column is selected.
The online doc has that "0 bytes + the size of the contained fields", which is not explicit to me. Can someone help to clarify? Thanks.
Think of a record as a storage mechanism. When you query against it (like you would a regular table), you are still only charged for the columns you use (select, filter, join, etc).
Check out the following query estimates for these similar queries.
-- This query would process 5.4GB
select
* -- everything
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
-- This query would process 33.7MB
select
visitorId, -- integer
totals -- record
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
-- This query would process 6.9MB
select
visitorId, -- integer
totals.hits -- specific column from record
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
On-demand pricing is based on the number of bytes processed by a query, in the Struct/Record data type you are charged according to the columns you select within the Record.
The expression “0 bytes + the size of the contained fields” means that the size will depend on the columns data types within the record.
Moreover, you can estimate costs before running your query using the query validator

Does SQL Server Table-Scan Time depend on the Query?

I observed that doing a full table scan takes a different time based on the query. I believed that under similar conditions (set of columns under select, column data types) a table scan should take a somewhat similar time. Seems like it's not the case. I just want to understand the reason behind that.
I have used "CHECKPOINT" and "DBCC DROPCLEANBUFFERS" before querying to make sure there is no impact from the query cache.
Table:
10 Columns
10M rows Each column has different densities ranging from 0.1 to 0.000001
No indexes
Queries:
Query A: returned 100 rows, time took: ~ 900ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL07 = 50000
Query B: returned 910595 rows, time took: ~ 15000ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL01 = 5
** Where column COL07 was randomly populated with integers ranging from 0 to 100000 and column COL01 was randomly populated with integers ranging from 0 to 10
Time Taken:
Query A: around 900 ms
Query B: around 18000 ms
What's the point I'm missing here?
Query A: (returned 100 rows, time took: ~ 900ms)
Query B: (returned 910595 rows, time took: ~ 15000ms)
I believe that what you are missing is that there are about x100 more rows to fetch in the second query. That only could explain why it took 20 times longer.
The two columns have different density of the data.
Query A, COL07: 10000000/100000 = 100
Query B, COL05: 10000000/10 = 1000000
The fact that both the search parameters are in the middle of the data range doesn't necessarily impact the speed of the search. This is depending on the number of times the engine scans the column to return the values of the search predicate.
In order to see if this is indeed the case, I would try the following:
COL04: 10000000/1000 = 10000. Filtering on WHERE COL04 = 500
COL08: 10000000/10000 = 1000. Filtering on WHERE COL05 = 5000
Considering the times from the initial test, you would expect to see COL04 at ~7200ms and COL05 at ~3600ms.
An interesting article about SQL Server COUNT() Function Performance Comparison
Full Table Scan (also known as Sequential Scan) is a scan made on a database where each row of the table under scan is read in a sequential (serial) order
Reference
In your case, full table scan scans sequentially (in ordered way) so that it does not need to scan whole table in order to advance next record because Col7 is ordered.
but in Query2 the case is not like that, Col01 is randomly distributed so full table scan is needed.
Query 1 is optimistic scan where as Query 2 is pessimistic can.

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

transform rows into columns in a sql table

Supose I would like to store a table with 440 rows and 138,672 columns, as SQL limit is 1024 columns I would like to transform rows into columns, I mean to convert the
440 rows and 138,672 columns to 138,672 rows and 440 columns.
Is this possible?
SQL Server limit is actually 30000 columns, see Sparse Columns.
But creating a query that returns 30k columns (not to mention +138k) will be basically uncontrollable, the sheer size of the metadata on each query result would halt the client to a crawl. One simply does not design databases like that. Go back to the drawing board, when you reach 10 columns stop and think, when you reach 100 column erase the board and start anew.
And read this: Best Practices for Semantic Data Modeling for Performance and Scalability.
The description of the data is as follows....
Each attribute describes the measurement of the occupancy rate
(between 0 and 1) of a captor location as recorded by a measuring
station, at a given timestamp in time during the day.
The ID of each station is given in the stations_list text file.
For more information on the location (GPS, Highway, Direction) of each
station please refer to the PEMS website.
There are 963 (stations) x 144 (timestamps) = 138,672 attributes for
each record.
This is perfect for normalision.
You can have a stations table and a measurements table. Two nice long thin tables.

SQL Execution Plan shows an "Actual Number of Rows" which is larger than the table size

I have an execution plan for a fairly complex join which shows an index seek being performed on a table with the "Actual Number of Rows" reading ~70,000, when there are in fact only ~600 rows in the table in total (the estimated number of rows is only 127).
Note that all of the statistics are up to date and the input parameters to the query are exactly the same as the parameters that were entered when the proc was compiled.
Why is the actual number of rows so high, and what does the number "Actual Number of Rows" really mean?
My only theory is that high number of rows is related to the nested loops, and that this index seek is being executed a number of times - the "Actual Number of Rows" really represents the total number of rows over all executions. If this is the case is the estimated number of rows also meant to be the total number of rows over all executions?
ActualRows counts the number of times GetNext() was called on a physical operator.
You should also look at the ActualRebinds, ActualRewinds and ActualEndOfScans to get an idea how many times the inner loop was re-evaluated:
A rebind means that one or more of the
correlated parameters of the join
changed and the inner side must be
reevaluated. A rewind means that none
of the correlated parameters changed
and the prior inner result set may be
reused.
the actual number of rows values is the result of all processed values for that node in the exec plan. so yes it takes into account the nested loops join.