which one is faster: to query with criteria in one shot or to subset a large table into a smaller table and then apply criteria - sql

I have a large table (TB size, ~10 billion rows, ~100 million IDs).
I want to run a query to get some specific IDs' counts (say, 100k IDs). List of needed IDs are in another table.
I know that I can run a join query to get the results, but it is extremely time consuming (~ 5 days processing).
I am wondering if I break the script into 2 phase (1- subset the whole table based on just the IDs, 2-apply the selection criteria on the subset table), will it make any process improvement?

Related

Designing a Cloud BigTable: Millions of Rows X Millions of Columns?

I'm wondering if the following table design for BigTable is legit. From what I read, having millions of sparse columns should work, but would it work well?
The idea is to keep time-based "samples" in columns (each is a few Kb). I expect to have millions of rows, where each would have a limited number of entries (~10-50) as values in the table. Each column in the table represents a timespan of (say, ) 10 seconds. and since there are roughly 2.6 seconds in a month, a year would take about 3M columns. I intend to use row-scans to fetch rows by prefix - usually just a handful of rows per fetch.
so, to sum:
the table will contain (million rows X 50 samples per row, each a few kb): 50M items
but the table's dimensions are (million rows X million columns): a trillion cells.
Now, I know that empty cells don't take space and the whole "table" metaphor isn't really apt to BT, but I'm still wondering: does the above represent a valid use-case for BigTable?
Based on the Google docs, Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns. About the limitation of Cloud Bigtable rows and columns, Cloud Bigtable Rows can be big but are not infinite, the rows can contain ~100 column families and millions of columns but the recommendation is 100MB for row size then 10MB for column value.
Therefore, In BigTable, the limit of the data within table is based on data size instead of the number of columns or rows (except for "Column families per table"). I believe your use-case is valid and could have a million of rows and columns as long as the values is within the hard limit. As a best practice, design your schema to keep the size of your data.

SQL Server filtered index "Estimated number of rows to be read" is size of full table

I have a table in an Azure SQL database with ~2 million rows. On this table I have two pairs of columns that I want to filter for null - so I have filtered indexes on each pairs, checking for null on each column.
One of the indexes appears to be about twice the size of the other - (~400,000 vs ~800,000 rows).
However it seems to take about 10x as long to query.
I'm running the exact same query on both:
SELECT DomainObjectId
FROM DomainObjects
INNER JOIN
(SELECT <id1> AS Id
UNION ALL
SELECT<id2> etc...
) AS DomainObjectIds ON DomainObjectIds = DonorIds.Id
WHERE C_1 IS NULL
AND C_2 IS NULL
Where C_1/C_2 are the columns with the filtered index (and get replaced with other columns in my other query).
The query plans both involve an Index Seek - but in the fast one, it's 99% of the runtime. In the slow case, it's a mere 50% - with the other half spent filtering (which seems suspect, given the filtering should be implicit from the filtered index), and then joining to the queried IDs.
In addition, the "estimated number of rows to be read" for the index seek is ~2 million, i.e. the size of the full table. Based on that, it looks like what I'd expect a full table scan to look like. I've checked the index sizes and the "slow" one only takes up twice the space of the "fast" one, which implies it's not just been badly generated.
I don't understand:
(a) Why so much time is spent re-applying the filter from the index
(b) Why the estimated row count is so high for the "slow" index
(c) What could cause these two queries to be so different in speed, given the similarities in the filtered indexes. I would expect the slow one to be twice as slow, based purely on the number of rows matching each filter.

Estimating result count of a query

I'd like to estimate the count of results of a query which is a JOIN between 5-10 tables containing large amount of data (~500 milion rows each table).
Such query takes for eg. 20-30 minutes to execute in the environment in which I was testing.
I want to get an estimate within 5 seconds. What are the ways to get this in MSSQL?
I'm thinking to take a sample out of each table and run the query against the sample but not sure if this will return a fair estimate.

Inner join return duplicates data which does not exist in table

When I execute this it return more than million rows, in first table I have 315 000 rows, in second about 14 000. What should I do to get all rows from both tables? Also if I don't stop server it breakdown during listing unexcited rows.
select *
from tblNormativiIspratnica
inner join tblNormativiSubIspratnica on tblNormativiIspratnica.ZaklucokBroj = tblNormativiSubIspratnica.ZaklucokBroj
If the first table has 315000 rows and the second one has 14000 rows, then the fields that you are using are not forming part of a good primary-key-foreign-key relationship, with the result you are getting cartesian product duplicates. If you want to have a well-defined result, you must have well-defined fields that serve these purposes. btw, Take care of your server-breakdown etc, do not try queries that fetch large resultsets, if you have no idea what you are doing. Quickly look-up the basics, and understand the design, by writing simple queries with more specific criteria that fetch small resultsets, before you run large ones.
If you do not have issues of server performance and breaking down, I would have suggested DISTINCT as a (poor) quick solution for getting unique rows, but remember, DISTINCT queries can, at times, take a performance toll.

Checking a large set of columns for null-ness

On my current project (a redesign), I'm tasked with checking whether or not a series of soon-to-be-deleted columns have data, so we can decide if and how we should migrate them into new and improved tables / columns. This task is - per se - not the problem, merely the background.
The problem is, there are about 30 columns to check, out of a total of 150. The table is fairly large so I fear that a chained select * from table where x is not null or y is not null or... is a bit..slow.
Is there a better, or more elegant way to check multiple columns for null-ness?
Am I better adviced to just check the columns independetly, or in smaller groups and don't bother with an optimal solution?
It's just one table. It will get read record by record (full table scan) and the criteria checked. This is not slow. No sorting, no joining, no sub-selects or intermidiate results. This can't be slow. Don't worry.
BTW: shouldn't that be select * from table where x is not null OR y is not null ...?
You want to find all records that contain data in any of the columns, right?