I want to break up one of my partitions of my measure group that has 450 million rows into sub partitions to make processing and querying faster. I currently split my measure group using attr1 of dimension1. Is it possible to split each partition further based on a different attr2 of the same dimension? I know that I can write a SQL query to do this but what I want to know is how to create the slice hint such that SSAS knows to look at the sub partitions.
I tried something like this in slice expression for partition
{dim1.Attr1&val1, dim1.Attr2&value2} but the processing failed with error saying tuple can't contain two different dimensions!?!
What can I do here? Can multiple partitions have same slice hint ? Will that solve my problem if I give same slice hint for all related partitions ?
I verified that setting the same slice property for all the related partitions solves my problem.
Here are links to what the "Slice" property of partition means
link1
link2
Related
They say there are no stupid questions, but this might be an exception.
I understand that BigQuery, being a columnar database, does a full table scan for any query over a specific column.
I also understand that query results can be cached or a named table can be created with the results of a query.
However I also see tabledata.list() in the documentation, and I'm unsure of how this fits in with query costs. Once a table is created from a query, am I free to access that table without cost through the API?
Let's say, for example, I run a query that is grouped by UserID, and I want to then present the results of that query to individual users based on that ID. As far as I understand there are two obvious ways of getting out the appropriate row for doing so.
I can write another query over the destination table with a WHERE userID=xxx clause
I can use the tabledata.list() endpoint to get all the (potentially paginated) data and get the appropriate row myself in my code
Where situation 1 would incur a query cost, and situation 2 would not? Am I getting this right?
Tabledata.list API is free as it actually does not use BigQuery Engine at all
so you are right for both 1 and 2
Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to
I am testing Pentaho 5.3 with Impala. My schema defines a dimension by relating it to its dimension table. However, when I try to use the Filter mondrian tries to obtain the dimensions from the fact table and since the table it is big it takes a long time just to load the dimensions to filter by. I do use the approxRowCount in my dimension definition.
I also have an instillation of Pentaho 5.0 with PG using a similar dataset and exact same schema and when I use the filter dimensions are loaded instantaneously. So it seems to me the issue is not schema related.
Could anyone tell me if this behavior (when trying to use Filter mondrian aggregating dimension data from the fact table instead of dimension table) is due to Pentaho settings or what could cause it?
Thank you in advance!
If anyone else ever wonders about this behavior it could be related to the fact that joins are much more efficient in PG than a NoSql database like Impala. I though that when using the Filter mondrian obtains the dimensions from the dimension table if provided with one. However, it seems it does a join of dimension and fact table before displaying dimensions.
Hope that helps!
Im having 260 columns table in SQL server. When we run "Select count(*) from table" it is taking almost 5-6 to get the count. Table contains close 90-100 million records with 260 columns where more than 50 % Column contains NULL. Apart from that, user can also build dynamic sql query on to table from the UI, so searching 90-100 million records will take time to return results. Is there way to improve find functionality on a SQL table where filter criteria can be anything , can any1 suggest me fastest way get aggregate data on 25GB data .Ui should get hanged or timeout
Investigate horizontal partitioning. This will really only help query performance if you can force users to put the partitioning key into the predicates.
Try vertical partitioning, where you split one 260-column table into several tables with fewer columns. Put all the values which are commonly required together into one table. The queries will only reference the table(s) which contain columns required. This will give you more rows per page i.e. fewer pages per query.
You have a high fraction of NULLs. Sparse columns may help, but calculate your percentages as they can hurt if inappropriate. There's an SO question on this.
Filtered indexes and filtered statistics may be useful if the DB often runs similar queries.
As the guys state in the comments you need to analyse a few of the queries and see which indexes would help you the most. If your query does a lot of searches, you could use the full text search feature of the MSSQL server. Here you will find a nice reference with good examples.
Things that came me up was:
[SQL Server 2012+] If you are using SQL Server 2012, you can use the new Columnstore Indexes.
[SQL Server 2005+] If you are filtering a text column, you can use Full-Text Search
If you have some function that you apply frequently in some column (like SOUNDEX of column, for example), you could create PERSISTED COMPUTED COLUMN to not having to compute this value everytime.
Use temp tables (indexed ones will be much better) to reduce the number of rows to work on.
#Twelfth comment is very good:
"I think you need to create an ETL process and start changing this into a fact table with dimensions."
Changing my comment into an answer...
You are moving from a transaction world where these 90-100 million records are recorded and into a data warehousing scenario where you are now trying to slice, dice, and analyze the information you have. Not an easy solution, but odds are you're hitting the limits of what your current system can scale to.
In a past job, I had several (6) data fields belonging to each record that were pretty much free text and randomly populated depending on where the data was generated (they were search queries and people were entering what they basically would enter in google). With 6 fields like this...I created a dim_text table that took each entry in any of these 6 tables and replaced it with an integer. This left me a table with two columns, text_ID and text. Any time a user was searching for a specific entry in any of these 6 columns, I would search my dim_search table that was optimized (indexing) for this sort of query to return an integer matching the query I wanted...I would then take the integer and search for all occourences of the integer across the 6 fields instead. searching 1 table highly optimized for this type of free text search and then querying the main table for instances of the integer is far quicker than searching 6 fields on this free text field.
I'd also create aggregate tables (reporting tables if you prefer the term) for your common aggregates. There are quite a few options here that your business setup will determine...for example, if each row is an item on a sales invoice and you need to show sales by date...it may be better to aggregate total sales by invoice and save that to a table, then when a user wants totals by day, an aggregate is run on the aggreate of the invoices to determine the totals by day (so you've 'partially' aggregated the data in advance).
Hope that makes sense...I'm sure I'll need several edits here for clarity in my answer.
We have a problem with a Fact table in our cube.
We know it is not the best practice of developing a Dimensional db but we have dimension and fact table combined in 1 table.
This is because Dimensional data isn't much (5 fields). But moving on to the problem.
We have added this table to our cube and for testing we added 1 measure(count of rows). . Like the image says we have the grand total for every sub category this isn't correct.
.
Does anyone have an idea where we have to look for the problem.
Kind regards,
Phoenix
You have not defined a relationship between your sub category dimension and your fact table. That has resulted in the full count mapping to all of the sub category attributes, hence the same value repeating
Add a relation between cube measure group and your dimension on the second tab (Dimension Usage) in cube (It's 'regular' and key-level on both sides in most cases).
If this relation is exist try to recreate it again. Sometimes it happens after several manual changes in 'advanced' mode.
Check dimension mapping in fact table. If everything is ok, try to add new dimension with only one level the first time, than add another one etc. I know it sounds like shaman tricks but still...
And always use SQL Server Profiler on both servers (SQL, SSAS) to capture exact query that returns wrong value. Maybe the mistake is somewhere else.