SSAS tabluar mode processing fails with "a lot of rows" - ssas

I have a SSAS tabular mode cube that reads data from an Actian Matrix database using ODBC. The project processes fine when I'm using a data set with 1 Million rows but when I try to use a bigger one (300 Million rows), the process runs for around 15 minutes and fails with the message:
The operation failed because the source database does not exist, the source table does not exist, or because you do not have access to the data source.
More Details:
OLE DB or ODBC error: [ParAccel][ODBC Driver][PADB]57014:ERROR: Query (25459) cancelled on user's request
DETAIL: Query (25459) cancelled on user's request
; 57014.
An error occurred while processing the partition 'XXXX' in table 'YYYY'.
The current operation was cancelled because another operation in the transaction failed.
The message says that the database doesn't exist but it doesn't make sense because it works perfectly fine on the first case (and the difference is just a "where clause" to limit the number of rows)
I'm using a server that has 96 Gb of FREE ran and I can see all the memory being consumed while the "processing" process is running. When it is all consumed, it runs for a few axtra seconds and fails. Also, I know for a fact that the 300 Million row dataset exported to a csv file has 36 Gb on its raw format, so it should fit full in memory without any compression.
I can also guarantee that the query works fine on its own on the source database so the "Query (25459) cancelled on user's request" message also doesn't make much sense.
Does anyone have any idea on what may be going on?

Memory consumption on the derivative of the input rows (the resulting cube) cannot be estimated on the byte size of the input. It is a function of the Cartesian graph product of all distinct values of the cube dimensions.
If you were building a cube with 2 input rows over 2 dimensions and 2 measurements:
State|City|Population
---------------------
NY|New York|8406000
CA|Los Angeles|3884000
State|City|Population|Number of records
---------------------------------------
NULL|NULL|12290000|2
NY|NULL|8406000|1
NY|New York|8406000|1
CA|NULL|3884000|1
CA|Los Angeles|3884000|1
NULL|Los Angeles|3884000|1
NULL|New York|8406000|1
You can't expect the output being generated as the input data rows are processed to be equivalent in size. If the ODBC driver keeps the entire input in memory before it lets you read it, then you would have to account for both the input and the output to reside in memory until the cube generation is complete.
This answer is much clearer on the subject: How to calculate the likely size of an OLAP cube

Related

Big Query External table - Query performance degrades with increased number of files in the Source URI

I have an external big query table created to read "Parquet" files from a GCS bucket.
The folder layout in the GCS bucket is as follows:
gs://mybucket/root/year=2022/model=abc/
gs://mybucket/root/year=2022/model=.../
gs://mybucket/root/year=2021/model=abc/
gs://mybucket/root/year=2021/model=.../
The layout is organized in such a way that it follows hive partitioning layout as explained the big query documentation. The columns "year" and "model" are seen as partition columns in the external table.
**External Data Configuration**
Source URI(s)- gs://mybucket/root/*
Source format - PARQUET
Hive Partitioning Mode - CUSTOM
Hive Partitioning Source URI Prefix - gs://mybucket/root/{year:INTEGER}/{model:STRING}
Hive Partitioning Column(s)- year, model
Problem: When I run queries on the external table as given below, I have observed that every query runs for an initial 2-3 minutes before the actual run happens. Big Query console shows "Query pending" during this time and as soon as it turns "Query Running" the output gets displayed with minimal slot time consumption (Slot time shows in 1-2 seconds.)
Select * from myTable Where year = 2022 and model = 'abc'
The underlying file count will vary and increases for every year and model. For years with more parquet files the initial time sometimes is around 4-5 minutes.
My understanding as per the documentation is that , if the partition columns are present in the query, some sort of partition pruning happens and I expect the query to be responsive immediately as per the documentation.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs#partition_pruning
But the observations made by me is contrary to this. If the source URIs are restricted to 1 year, the table reads the data from one year, the query initial time (where it remains "Query pending" on console) is reduced to 1-2 minute (or even less)
Source URI(s)- gs://mybucket/root/year=2022/*
Question: Is this the expected behavior ? because as volume of files increase in the GCS bucket, the query takes even longer to run (esp. the initial time, and the actual run time doesn't change much), though in the where clause we have the year and model partition columns applied.
This is likely expected behavior. Before partition pruning can happen objects in GCS need to be listed which is likely where the time is being taken. We are working on improvements in this area. The fact that slot time is so low is a good indicator that pruning is in fact happening (since most files are not being read there isn't a lot of slot time to consume).

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

Connection Timeout Error while reading the table having more than 100 columns in Mosaic Decisions

I am reading a table via snowflake reader node having less number of columns/attributes(around 50-80),the table is getting read on the Mosaic decisions Canvas. But when the attributes of table increases (approx 385 columns),Mosaic reader node fails. As a workaround I tried using the where clause with 1=2,in that case it is pulling the structure of the Table. But when I am trying to read the records even by applying the limit (only 10 records) to the query, it is throwing connection timeout Error.
Even I faced similar issue while reading (approx. 300 columns) table and I managed it with the help of input parameters available in Mosaic. In your case you will have to change the copy field variable to 1=1 used in the query at run time.
Below steps can be referred to achieve this -
Create a parameter (e.g. copy_variable) that will contain the default value 2 for the copy field variable
In reader node, write the SQL with 1 = $(copy_variable) So while validating, it’s same as 1=2 condition and it should validate fine.
Once validated and schema is generated, update the default value of $(copy_variable) to 1 so that while running, you will still get all records.

How to interpret query process GB in Bigquery?

I am using a free trial of Google bigquery. This is the query that I am using.
select * from `test`.events where subject_id = 124 and id = 256064 and time >= '2166-01-15T14:00:00' and time <='2166-01-15T14:15:00' and id_1 in (3655,223762,223761,678,211,220045,8368,8441,225310,8555,8440)
This query is expected to return at most 300 records and not more than that.
However I see a message like this as below
But the table on which this query operates is really huge. Does this indicate the table size? However, I ran this query multiple times a day
Due to this, it resulted in error below
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
How long do I have to wait for this error to go-away? Is the daily limit 1TB? If yes, then I didn't not use close to 400 GB.
How to view my daily usage?
If I can edit quota, can you let me know which option should I be editing?
Can you help me with the above questions?
According to the official documentation
"BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read)", regardless of how large the output size is. What this means is that if you do a count(*) on a 1TB table, you will supposedly be charged $5, even though the final output is very minimal.
Note that due to storage optimizations that BigQuery is doing internally, the bytes processed might not equal to the actual raw table size when you created it.
For the error you're seeing, browse the Google Console to "IAM & admin" then "Quotas", where you can then search for quotas specific to the BigQuery service.
Hope this helps!
Flavien

SQL connection lifetime

I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)
The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.
But I am wondering if this is the best practice since it has some issues:
The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.
On the other side, it has some advantages:
I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.
In response to Gandalf, I add some more information:
I will always have to process the entire result set
I am not doing any aggregation of rows
I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)
There is no universal answer. I personally implemented both solutions dozens of times.
This depends of what matters more for you: memory or network traffic.
If you have a fast network connection (LAN) and a poor client machine, then fetch data row by row from the server.
If you work over the Internet, then batch fetching will help you.
You can set prefetch count or your database layer properties and find a golden mean.
Rule of thumb is: fetch everything that you can keep without noticing it
if you need more detailed analysis, there are six factors involved:
Row generation responce time / rate(how soon Oracle generates first row / last row)
Row delivery response time / rate (how soon can you get first row / last row)
Row processing response time / rate (how soon can you show first row / last row)
One of them will be the bottleneck.
As a rule, rate and responce time are antagonists.
With prefetching, you can control the row delivery response time and row delivery rate: higher prefetch count will increase rate but decrease response time, lower prefetch count will do the opposite.
Choose which one is more important to you.
You can also do the following: create separate threads for fetching and processing.
Select just ehough rows to keep user amused in low prefetch mode (with high response time), then switch into high prefetch mode.
It will fetch the rows in the background and you can process them in the background too, while the user browses over the first rows.