How to know data is coming form partitioned filegroup or it is reading total records in table - sql

I applied a partition on a DateTime column in a MSSQL table .
Created Partition function, Scheme and 4 file groups and given boundary values.
I have queried a result in this table with where condition on partitioned column.
In this how-to know, the query is reading total records or related filegroup.
How to know the query is using partition or not ?.

One way is with the actual query execution plan. The Actual Partition Count of the seek/scan operator will show the actual number of partitions touched.
Another method is to run the query with SET STATISTICS IO ON, where the scan count of the table will reflect the number of partitions used.

Related

Why does BigQuery do a full table scan for `SELECT *` with a WHERE clause referencing just one column?

Why does BigQuery perform a full table scan for SELECT * when a WHERE clause is provided?
SELECT *
FROM `githubarchive.month.202012`
WHERE login='__ThisUserDoesNotExist__'
This query performs a full table scan, even though it really just needs to do a full scan of the login column to determine that there are no records to return. Interested in references to relevant sections of BQ docs as well as papers on query planning for columnar databases.
BigQuery Is a columnar storage, with out partition/clustering on table it will do a full table scan.
BigQuery is doing a full table scan because the query asks for all the column to be output.
You'll find only the login column is scanned with query below:
SELECT login
FROM `githubarchive.month.202012`
WHERE login='__ThisUserDoesNotExist__'
Check public documentation here: https://cloud.google.com/bigquery/pricing#on_demand_pricing
BigQuery uses a columnar data structure. You're charged according to the total data processed in the columns you select, and the total data per column is calculated based on the types of data in the column. For more information about how your data size is calculated, see Data size calculation.

Need suggestion on splitting table in bigquery based on non-date column along with date partition

We are having a date partitioned table with 5 yrs(with daily incremental load) data running into millions & millions of records. To improve the performance, thinking of splitting the table based on a non-date field(id) as all the queries will include a where clause on that column(id). And also partition each of split tables with date partition so that we can query on a smaller dataset with a date range. we will not be using wildcarded table as we will know the id and planning to append that to the table and run a query against that specific table. Need to know whether that would be good option to pursue to improve performance and reduce the query cost.
[Update]: We went ahead and split the tables based on id column(tablename_id) and made the table date partitioned and clustered with 4 other columns(max supported) which are commonly used in queries. With that we were able to get a better performance and also reduced the data accessed for each query. Based on the testing looks like it is a good option to puruse as long as wildcarded querying of tables is avoided and till Bigquery supports partitioning based on non-date/non-datetime columns.
We split the tables based on the id columns creating multiple tables. Each of the split tables are partition on date column. Apart from that we had it as clustered table on 4 other columns as needed. Find below the performance on a sample dataset. Old Table(UserInfo) has more than 500,000 rows. The stats we captured is for a given date range and id, the performance of old table(non split/combined table) and split table(split based on the ID) in terms of amount of data processed and the time taken for the same query.
This is not possible. BigQuery doesn't support partitioned on non-date columns.
There's a feature request for it. I suggest subscribing to it to keep receiving information regarding its availability.

How to limit SQL Server query to single table partition

I want to query a massive table, and need to get my query runtime down.
I'm trying to breakup my target query, into many steps, by running my summarizing query against each individual table partition (I will then aggregate the outputs). All the columns in my where clauses are indexed (nonclustered) -- all the columns I'm pulling in my query are indexed. The "Month" column is our partition index.
How do I write my query, so that I'm explicitly telling SQL Server to only use one "Month" partition?
edit to include Execution plan:
Per the comment, used this site: https://www.brentozar.com/pastetheplan/?id=SJRAIUD3V
Assuming that you have read the suggestions and still want to use partitions the query would be as follows once you have portioned the table with Month as the key_column.
SELECT <Your_select_list>
FROM <dbo.partitioned_table>
WHERE key_column = <target_month>

Google BigQuery clustered table not reducing query size when running query with WHERE clause on clustered field

I have a Google BigQuery table of 500,000 rows that I have setup to be partitioned by a TIMESTAMP field called Date and clustered by a STRING field called EventCategory (this is just a sample of a table that is over 500 million rows).
I have a duplicate of the table that is not partitioned and not clustered.
I run the following query on both tables:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "email"
There are only 2400 rows where EventCategory is "email". When I run the query on the non clustered table I get the following:
When I run the query on the clustered table I get the following:
Here is the schema of both the non clustered and the clustered table:
Date TIMESTAMP NULLABLE
UserId STRING NULLABLE
EventCategory STRING NULLABLE
EventAction STRING NULLABLE
EventLabel STRING NULLABLE
EventValue STRING NULLABLE
There is basically no difference between the two queries and how much data they look through and I can't seem to figure out why? I have confirmed that the clustered table is partitioned and clustered because in the BigQuery UI in table details it actually says so and running a query by filtering by Date greatly reduces the size of the data searched and shows the estimated query size to be much smaller.
Any help here would be greatly appreciated!
UPDATE:
If I change the query to:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "ad"
I get the following result:
There are 53640 rows with EventCategory is "ad" and it looks like clustering did result in less table data being scanned, albeit not much less (529.2MB compared to 586MB).
So it looks like clustering is working but the data is not clustered properly in the table? How would I fix that? I have tried re-creating the table multiple times using DDL and even saving the table data to a JSON in GCS and then importing it into a new partitioned and clustered table but it hasn't changed anything.
Does the date partitioning sit on top of the clustering? Meaning that BigQuery first groups by date and then groups by cluster within those date groups? If so, I think that would probably explain it but it would render clustering not very useful.
If you have less than 100MB of data per day, clustering won't do much for you - you'll probably get one <=100MB cluster of data for each day.
You haven't mentioned how many days of data you have (# of partitions, as Mikhail asked), but since the total data scanned is 500MB, I'll guess that you have at least 5 days of data, and less than 100MB per day.
Hence the results you are getting seem to be the expected results.
See an example of this at work here:
How can I improve the amount of data queried with a partitioned+clustered table?
The reason clustering wasn't helping very much was specific to the table data. The table was event based data that was partitioned by day and then clustered by EventCategory (data is clustered on each day's partition). Since every day would have a large amount of rows for each EventCategory type, querying the entire table for a specific EventCategory would still have to search every single partition, which would then almost definitely have some data with that EventCategory meaning almost every cluster would have to be searched too.
The data are partitioned by day and inside that they are clustered,
the clustering works best when you load whole partitions (days) at once or export the partition (day) to Google Storage (which should be free) and import it again to another table, when we tried loading something like 4GB JSONS the difference was something like 60/10.

Best Index For Partitioned Table

I am querying a fairly large table that has been range partitioned (by someone else) by date into one partition per day. On average there are about 250,000 records per day. Frequently queries will be by a range of days -- usually looking for one day, 7 day week or a calendar month. Right now querying for more than 2 weeks is not performing well--have a normal date index created. If I query for more than 5 days it doesn't use the index, if I use an index hint it performs o.k. from about 5 days to 14 days but beyond that the index hint doesn't help much.
Given that the hint does better than the optimizer I am doing a gather statistics on the table.
However, my question going forward is, in general, if I wanted to create an index on the date field in the table, is it best to create a range partitioned index? Is it best to create a range index with a daily range similar to the table partition? What would be the best strategy?
This is Oracle 11g.
Thanks,
related to your question, partitioning strategy will depend on how you are going to query the data, the best strategy would be to query as fewer partitions as possible. e.g. if you are going to run monthly reports you'd rather create montly range partitioning and not daily range partitioning. If all your queryies will be around data that's within a couple of days then daily range partitioning would be fine.
Given numbers you provided in my opininon you overpartition data.
p.s. quering each partition requires additional reading(than if it were just one partition), so optimizer opts for table access full to reduce reading of indexes.
Try to create a global index on date column. If the index is partitioned and you select -let's say- 14 days, then Oracle has to read 14 indexes. Having a single index on the entire table, i.e. "global index" it has to read only 1 index.
Note, when you truncate or drop a partition then you have to rebuild the index afterwards.
I'm guessing that you could be writing your SQL wrong.
You said you're querying by date. If your date column has time part and you want to extract records from one day, from specific time of the day, e.g. 20:00-21:00, then yes, an index would be beneficial and I would recommend a local index for this (partitioned by day as just like table).
But since your queries span a range of days, it seems this is not the case and you just want all data (maybe filtered by some other attributes). If so, a partition full scan will always be much faster than index access... provided you benefit from partition pruning! Because if not - and you're actually performing a full table scan - this is expected to be very, very slow (in most cases).
So what could go wrong? Are you using plain date in WHERE clause? Note that:
SELECT * FROM trx WHERE trx_date = to_date('2014-04-03', 'YYYY-MM-DD');
will scan only one partition, whereas:
SELECT * FROM trx WHERE trunc(trx_date) = to_date('2014-04-03', 'YYYY-MM-DD');
will scan all partitions, as you apply a function to partitioning key and the optimizer can no longer determine which partitions to scan.
It would be much easier to tell for sure if you provided table definition, total number of partitions, sample data and your queries with explain plans. If possible, please edit your question and include more details.