SSAS Tabular Data Cube - Fact Table with partition of 160K row partition takes almost an hour to process...Why? - ssas

I have an SSAS Tabular data model I developed with VS. In this cube there are many fact and dimension tables with lots of measures. HOWEVER, there is one fact table that in total has 158 million rows and to process all 158 million rows in this one fact table takes over an hour... To speed up that processing time I decided to create two partitions based on a date column. Partition 1 has historical data and when loaded has 157 million rows, Partition 2 (One month of data) has about 160,000 rows so VERY VERY small. I only want to process the Partition 2 daily. Unfortunately, when I process just Partition 2 the processing time is still almost an hour?? How can it be that simply refreshing a 160K partition takes 58 minutes... seems like it is still trying to process the full tableā€¦.
I will say when I try to process a separate table that only has 200K rows in total I am able to process this in under 30 seconds. Shouldn't the partition 2 above also process in under a minute?? What would I be doing wrong here and why would this take so long to process a small partition.
In Summary:
Table A = 158,000,000 Rows = 1 Hour 13 Min to Process total table Partition 1 = 157,840,000 Rows = 1 hour to process FULL Partition 2 = 160,000 Rows = 58 minutes to process FULL
Table B = 200,000 Rows = 30 Seconds to process FULL Partition 1 = 200,000 Rows = 30 Seconds to process!
Shouldn't Table A/Partition 2 take 30 seconds to process just like Table B?
I just want to process full the Partition 2 of Table A... Expected process time I expected to be was under 5 min... similar to Table B result time. Instead to process Partition 2 with 160K rows takes almost same time as entire table A (Partition 1 + 2).

If you have calculated columns or DAX tables that refer to this table, they will have to process after the partition loads and could result in an extended load time. You might be able to test for this by creating a new table with a filter like the partition and see how long it takes to load.
I would also make sure the sort on the table is set to the date.

Related

PL/SQL check time period, repeat, up until 600 records, from large database

What would be the best way to check if there has been data within a 3 month period up until a maximum of 600 records, then repeat for the 3 months before that if 600 hasn't been reached? Also it's a large table so querying the whole thing could take a few minutes or completely hang Oracle SQL Developer.
ROWNUM seems to give row numbers to the whole table before returning the result of the query, so that seems to take too long. The way we are currently doing it is entering a time period explicitly that we guess there will be enough records within and then limiting the rows to 600. This only takes 5 seconds, but needs to be changed constantly.
I was thinking to do a FOR loop through each row, but am having trouble storing the number of results outside of the query itself to check whether or not 600 has been reached.
I was also thinking about creating a data index? But I don't know much about that. Is there a way to sort the data by date before grabbing the whole table that would be faster?
Thank you
check if there has been data within a 3 month period up until a maximum of 600 records, then repeat for the 3 months before that if 600 hasn't been reached?
Find the latest date and filter to only allow the rows that are within 6 months of it and then fetch the first 600 rows:
SELECT *
FROM (
SELECT t.*,
MAX(date_column) OVER () AS max_date_column
FROM table_name t
)
WHERE date_column > ADD_MONTHS( max_date_column, -6 )
ORDER BY date_column DESC
FETCH FIRST 600 ROWS ONLY;
If there are 600 or more within the latest 3 months then they will be returned; otherwise it will extend the result set into the next 3 month period.
If you intend to repeat the extension over more than two 3-month periods then just use:
SELECT *
FROM table_name
ORDER BY date_column DESC
FETCH FIRST 600 ROWS ONLY;
I was also thinking about creating a data index? But I don't know much about that. Is there a way to sort the data by date before grabbing the whole table that would be faster?
Yes, creating an index on the date column would, typically, make filtering the table faster.

Slow query performance in sqlite but sqlite studio reports fast execution

I have two large (~100 million rows) tables I'm trying to join. I have indices on both columns used in the join. Selecting the first 1000 rows takes several hours, but when it's done, SQLite Studio reports that it only took a minute. Then, it takes another several hours for SQLite Studio to count the rows for my results and if I try to open another query window, it becomes unresponsive for these hours. The entire time, task manager shows around 25% CPU usage and 7-8 MB/s disk usage for the process. I also tried selecting the top 10k rows and it took 11 hours to complete and another 11 hours to get the row count, but reported that the query finished in 4 minutes. Here is the query:
Select d.PRC, s.prccd, abs(abs(d.PRC)-s.prccd), *
from dsf d
join secd.secd s
on s.datadate=d.DATE and substr(s.cusip,1,8)=d.CUSIP
where abs(abs(d.PRC)-s.prccd)>.0006
limit 10000
Why is this taking so long? I know 100 million rows is a lot, but with sorted indices, shouldn't joining happen in linear time? Adding the indices took several minutes, not hours, and that should be O(n log n) since it has to sort. I get the same results without using substr(). So why is it taking so long?
Why is SQLite Studio reporting that it only takes a minute or two?
Why does SQLite Studio take so long to count the result rows, after the results are already displayed?
EDIT:
Output of EXPLAIN QUERY PLAN
5 0 0 SCAN TABLE dsf AS d
7 0 0 SEARCH TABLE secd AS s USING INDEX secd_datadate (datadate=?)

Should time partition column be set as clustered column in BigQuery? [duplicate]

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?
SELECT sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"
Takes 1 second, and processes 270.7MB.
Same, with partitions:
SELECT sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"
Takes 2 seconds and processes 14.3 MB.
So we see a benefit in MBs processed, but the query is slower.
What's a good strategy to decide when to partition?
(from an email I received today)
When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.
In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.
For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:
Partition by week/month//quarter/year to get over the partition limit?
Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.
If you want to run your own benchmarks, do this:
CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
vs no partitions, just clustering by date:
CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS
SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
Then my query over the clustered table would be:
SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"
And it took 0.5 seconds, 17 MB processed.
Compared:
Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB
We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.
It's also interesting to look at the execution details for each query on these tables:
Slot time consumed
Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec
As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.
See also:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

What's a good balance to decide when to partition a table in BigQuery?

We are using a public dataset to benchmark BigQuery. We took the same table and partitioned it by day, but it's not clear we are getting many benefits. What's a good balance?
SELECT sum(score)
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
WHERE creation_date > "2019-01-01"
Takes 1 second, and processes 270.7MB.
Same, with partitions:
SELECT sum(score)
FROM `temp.questions_partitioned`
WHERE creation_date > "2019-01-01"
Takes 2 seconds and processes 14.3 MB.
So we see a benefit in MBs processed, but the query is slower.
What's a good strategy to decide when to partition?
(from an email I received today)
When partitioning a table, you need to consider having enough data for each partition. Think of each partition like being a different file - and opening 365 files might be slower than having a huge one.
In this case, the table used for the benchmark has 1.6 GB of data for 2019 (until June in this one). That's 1.6GB/180 = 9 MB of data for each daily partition.
For such a low amount of data - arranging it in daily partitions won't bring much benefits. Consider partitioning the data by year instead. See the following question to learn how:
Partition by week/month//quarter/year to get over the partition limit?
Another alternative is not partitioning the table at all, and instead using clustering to sort the data by date. Then BigQuery can choose the ideal size of each block.
If you want to run your own benchmarks, do this:
CREATE TABLE `temp.questions_partitioned`
PARTITION BY DATE(creation_date)
AS
SELECT *
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
vs no partitions, just clustering by date:
CREATE TABLE `temp.questions_clustered`
PARTITION BY fake_date
CLUSTER BY creation_date
AS
SELECT *, DATE('2000-01-01') fake_date
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
Then my query over the clustered table would be:
SELECT sum(score)
FROM `temp.questions_clustered`
WHERE creation_date > "2019-01-01"
And it took 0.5 seconds, 17 MB processed.
Compared:
Raw table: 1 sec, 270.7MB
Partitioned: 2 sec, 14.3 MB
Clustered: 0.5 sec, 17 MB
We have a winner! Clustering organized the daily data (which isn't much for this table) into more efficient blocks than strictly partitioning it by day.
It's also interesting to look at the execution details for each query on these tables:
Slot time consumed
Raw table: 10.683 sec
Partitioned: 7.308 sec
Clustered: 0.718 sec
As you can see, the query over raw table used a lot of slots (parallelism) to get the results in 1 second. In this case 50 workers processed the whole table with multiple years of data, reading 17.7M rows. The query over the partitioned table had to use a lot of slots - but this because each slot was assigned smallish daily partitions, a reading that used 153 parallel workers over 0.9M rows. The clustered query instead was able to use a very low amount of slots. Data was well organized to be read by 57 parallel workers, reading 1.12M rows.
See also:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
How can I improve the amount of data queried with a partitioned+clustered table?
how clustering works in BigQuery

Postgresql- divide sum by total already in table

I have a table with several time intervals as rows with one "total" row. I have four columns; car, bus, truck, and total, that refer to the number of vehicles leaving a warehouse at each time interval by category and the total number of vehicles at each time interval. My table looks like this:
time car truck bus total
12-6am 10 15 10 35
7am-12pm 8 12 8 28
Total 18 27 18 63
I want to create a percent total row that takes the total value in each row (35 and 28) and divides it by the maximum value in the total row (63).
How do I do this?
If you look at the schema of your table, it doesn't make sense to have an extra row in it, but only an extra column.
However, even that is a bad idea. A database is not a spreadsheet, where you have largely free-form data. It's a collection of tables. Total rows should be calculated with SELECT statements, not make some attempt to have them in the table. Unlike a spreadsheet, Postgres won't auto-update that as rows are added and deleted. (Note: Yes, sometimes you need to materialize this summary stuff for efficiency, but that's the advanced course.)