I am having some trouble with SSAS and data mining - specifically the Microsoft Clustering package.
I intend to ultimately do my work in AMO and MDX, but for now, just happy to understand how it works in the BIDS via Visual Studio. One step at a time!
The whole problem is around clustering both "vertically" and "horizontally" (separately) from a table that is organized vertically. My main source data table in my OLTP database looks like =>
ID_NUM
{numbers 1 - 20,000}
TECK_ID
{numbers 1-500, {for each ID_NUM}}
(though just grabbed a few of these for playing around with the data in the screencaps)
TECK_VALUE
{a double, the 'fact' bit}
So- 10 million rows, of two ints and a double.
Which looks like this- http://i.imgur.com/KG1LhaJ.jpg
So I create a new Analysis Services project in Visual Studio, set up a Data Source, and bring in the above table, as well as two "dimension tables" (identity of what id_num is, names of what each teck_id is) into a Data Source View and link it up, matching up the appropriate keys.
Which looks like this- http://i.imgur.com/Q0vgwIc.jpg
Next I want to manipulate how my data is represented, so I go to set up a cube from this Data Source View. I create dimensions based on my two "dimension" tables (the above "id_num" primary key one, and the "teck_id" primary key one), and create a single measure (as sum) of the teck_value column from my main table. This all seems to compile successfully.
Which looks like this- http://i.imgur.com/y5pUSjh.jpg
The reason I think everything has worked well is I can arrange my data how I want by browsing the cube. I am able to define my "rows" as both the id_num, or as the "teck_id", with the other one filling up the columns. The measure "Teck_value" always makes up the dataset of the table. This is exactly how I want it, the flexibility to arrange my data both ways.
Which looks like this- http://i.imgur.com/ugLUkgg.jpg
And this- http://i.imgur.com/RwQgj58.jpg
Beautiful! Now I wish to do some mining on this basis!
I wish to, quite simply, using Microsoft Clustering to (separately) -
Assign each TECK_ID a cluster number based on how it varies on each ID_NUM
Assign each ID_NUM a cluster based on how it varies on each TECK_ID
Seemingly a simple requirement - just changing what is represented as "rows" and what as "columns" - which I already appear to be able to do through the cube browser. This seems to be one of the main points of OLAP rather than OLTP from my uneducated perspective!
Yet when I try to set this up I fail utterly!
The Clustering Wizard leaves me confounded and I come up with nonsense results. I am given the option of selecting a key (for which I can choose either of the above), but no option to parse by the other dimension. Indeed, the only thing I can choose to mine on is TECK_VALUE, which isn't any good as that doesn't separate out the different fields!
My wizard looks like this- http://i.imgur.com/lHfasv0.jpg
So, I am left in a pickle. I really don't want to go back and line up my OLTP databases horizontally because 1) this would mean having 20k columns when I try to categorize my TECK_IDs. and 2) I was hoping SSAS and OLAP can give me the flexibility I need to mine the fields that I want - isn't that part of the reason you set up a cube "chop up the data how you like" ?
Bonus points for helping me with the AMO / MDX side as well! :)
Related
I'm trying to add row numbers (by some sorting column) or at least ranks into a calculated table inside azure analysis services tabular model (I need this rank column to use in TOPNSKIP, it should be precalculated column in the model's table. Really it will not be precalculated column for one sorting column only. It will be some expression I want to have precalculated). I used some thoughts from here DAX - Get current row number
I use Visual Studio with Microsoft Analysis Services Projects extension to deploy the model to ssas and SSMS to debug the requests. But it seems there are differences between dax requests in the SSMS and in the tabular models definitions.
The SSMS uses evaluate syntaxis and the tabular constructor = VAR RETURN. And it seems some functions are not available for the tabular models. It seems like ADDCOLUMNS and RANKX are not available for tabular models. Also , is used as operand separator in the SSMS and ; in the model DAX editor, (possible due to some internationalization options on my computer) and that is not very convenient. I could not find a way to get error report for the bad DAX request for the model definition. I only see for incorrect request that has been retrieved 0 records while I deploy a model and if I try to do some request to a table after deploying I get error message that it's empty because of incorrect construction, but no details. There is also a possibility to use calculated columns during the table construction, but I could not embed the RANKX function to the formula for them. Also I don't know how to use a MEASURE inside a construction function. The filter approach from the mentioned stackoverflow question was very slow on my amount of data. I was waiting about half an hour and stopped the deploying.
Because I don't see the ADDCOLUMNS works I've tried GENERATE and SELECTCOLUMNS. For example:
=
VAR NewTable = GENERATE(BaseTable; ROW("RowNumber"; RANKX(BaseTable; BaseTable[SortName];;1)))
RETURN NewTable
It works nice in the SSMS but can't be used to construct a table. While:
=
VAR NewTable = GENERATE(BaseTable; ROW("RowNumber"; 50))
RETURN NewTable
works fine everywhere. So I think RANKX is not working in the tabular model definitions.
Is there any way to rank or number rows in ssas calculated table by some sorting column?
Adding high cardinality columns such as an index number column is generally not advisable for tables in a tabular model, since these columns compress very very badly, thereby reducing query performance. Especially if the table is large. But if you must do it, use the technique shown here. If you don't want to add a calculated column to an existing table, you can wrap everything in a call to ADDCOLUMNS:
=
ADDCOLUMNS(
BaseTable,
VAR CurrentSortName = BaseTable[SortName]
RETURN
COUNTROWS(
FILTER(BaseTable, BaseTable[SortName] < CurrentSortName)
) + 1
)
How do I link my activities variable to only the corresponding KPIs variable?
Using guidance from a number of sources, but primarily the genius of Jeffery Shafer articulated through the SuperDataScience video, I built a Sankey Diagram for my work. For the most part it works, however, I have been trying to figure out how to adjust my Sankey Diagram model to line up each activity with ONLY the corresponding KPIs, but am having no luck.
The data structure looks like this:
You'll note I changed the binary value to "", 2 instead of 0, 1 as it makes visual calculations easier. For the "Viz" variable, I have "Activity" for the raw data set, then I copy/paste/replicate the data to mirror the data (required for the model) but with "KPI" for the mirrored data.
In the following image, you'll see my main issue is that the smallest represented activity still shows as corresponding to all KPIs when in fact it does not. I want activity to line up only with the corresponding KPIs as some activities don't correspond with all, or even any, KPIs.
Finally, here is the model very similar to what the above video link shows:
Can someone help provide insight into how I can adjust the model to fit activities linking only to corresponding KPIs? I appreciate any insight. Thanks!
I have a solution to the issue, thanks to a helpful Tableau support member named Anthony. It was in the data structure. The data was not structured to only associate "Activities" with their "KPI" values within Tableau's requirements, but every "Activities" value with every "KPI" value. As a result, to achieve the desired result, the data needs to be restructured to only contain a row for every valid "Activities" and "KPI" combination. See the visual below where data is removed to format properly:
-------------------------------------->
Once the table is restructured, the desired visual result should configure with the model. It works like a charm!
Good luck out there!
I am reviewing a coworkers sqlgen job and I am unable to figure out what this means in the table generation settings.
Specify number of rows by: "Same as mapped data"
My coworker has this selected on each table, I just need to know what is meant by this I have looked through documentation and been unable to find a definition for this.
I am on version 2 at the moment. Probably not the best question but I need an answer and he is gone for a long period of time and our data is not working correctly with this tool.
The "Same as mapped data" option is only available when you're using an existing table or view as a data source - it just means that the generator will insert all the rows from the source table or view. The other options are:
Numeric value - a set number of rows
Proportion of table - a proportion of the source table/view
Generation time - as much data as the tool can generate in a set time
There's a little more about using an existing table/view as a data source here on the website, but it doesn't have much else useful in it.
I am building a data warehouse for the company's (which I am working for) core ERP application, for a particular client.
Most of the data in the source database, which is related to hierarchies in the data warehouse are in columns as shown below:
But traditionally the model to store dimension data according to my knowledge is as:
I could pivot the data and fit them in the model shown above. But the issue comes when a user introduces a new hierarchy value. Say for instance the user in the future decides to define a new level called Product Sub Category. Then my entire data warehouse model will collapse without a way to accommodate the new hierarchy level defined.
Do let me know a way to overcome this situation.
I hope my answer is clear enough. Just let me know if further details are needed.
Well, nothing should collapse -- the ETL should extract and load the data as always.
Here are a few options to consider:
Simply add one more column for the new hierarchy to the dimProduct.
Try using hierarchy helper table.
Consider adding path string attribute to the dimProduct.
At the moment the team i am working with is looking into the possibility of storing data which is entered by users from a series of input wizard screens as an XML blob in the database. the main reason for this being that i would like to write the input wizard as a component which can be brought into a number of systems without having to bring with it a large table structure.
To try to clarify if the wizard has 100 input fields (for example) then if i go with the normal relational db structure then their will be a 1 to 1 relationship so will have 100 columns in database. So to get this working in another system will have to bring the tables,strore procedures etc into the new system.
I have a number of reservations about this but i would like peoples opinions??
thanks
If those inputted fields don't need to be updated or to be used for later calculation or computation some values using xml or JSON is a smart choice.
so for your scenario seems like its a perfect solution