Add attribute to cube and reprocess without original database - sql

Every year we keep a historical copy of one of our cubes. This year someone decided they wanted to pay us money to add an attribute to the cube which did not previously exists. Fine, I like money, but the issue is we don't have a backup of the database that we built this cube off of.
So a question arises in my head, do we need that original database to add a new attribute to this cube? Is it possible for us to add a new attribute to the cube and only process this attribute without having the cube orignal datasource?
Not having a great understanding of what is happening under the hood when I add an attribute to a SSAS cube and process, I can't say if this is or isn't possible. I could imagine that possibly, the cube has a snapshot in memory of the datasource that it can work off of. I can also imagine that this would be ridiculously inefficient so there is a chance this is no way in heck possible
EDIT: It at least would seem feasible to add a calculated member that makes use of existing data in the cube.
I also should mention that I tried to add an attribute to such a cube and received an error:
"Dimension [Partner] cannot be saved File system error failed to copy
file C:\\MYSQLSERVER\OLAP\DATA\2013_Cube.db\\.dim\.dstore to C:\\MYSQLSERVER\OLAP\DATA\2013_Cube.db\\.dim\.dstore file exists"
Sorry I faked those filepaths a little.

This task is very difficult. The only way I can imagine would be to manually reconstruct the original database based on the Data Source View (it has cached metadata), and then try to generate the data to populate it using a SSAS query tool (e.g. Excel, SSRS, OLE DB Provider for Analysis Services).
If you want to add one attribute in a dimension, you might be able to limit that effort to the source data for the dimension in question.

First let me explain based on the steps of the process how a cube stores the data!!!
Get the datasource - data!!! That is get access to the original databases/files etc. At this point all the data are at the primary source. All data are normalized one way or the other.
Construct a data warehouse. ELT process. At this point you combine all your data in a denormalized wharehouse, without foreign keys or any constraint. All data are now in an intermediate state in a denormalized sql database and ready to be used in the cube.
Construct the OLAP cube. The Data Warehouse is now your data-source. All data are now aggregated in rows inside the cube with their corresponding values. The redundancy is enormous and the data are 100% denormalized, they hardly follow a patern (Of course they do but it is not always easily understandable).
An example at this state would be a row like this
Company -> Department -> Room | Value(Employees)
ET LTD -> IT -> Room 4 -> | 4
The exactly same row would exist for Value(Revenue).
So in essence all data exist inside the SSAS Database (The cube).
Reconstructing the Database would mean a Great Deal of reverse engineering.
You could make a new C# program using MDX connectors and queries to get the data, and MSsql connectors to save them inside an OLTP database. MDX has a steep learning curve and few citations on websites, so the above method is not advisable.
There is no way that I know of to get the data from excel, as excel gets the pivot table data in a dynamic way from the DataConnection.

Related

How to decide on a data warehouse schema for clinical dataset?

I am trying to design a schema for the a clinical dataset, MIMIC-III. I am trying to figure out a way to store the events in a way such that users can query the dataset with ease using possibly a star schema. Almost all of the entries such as diagnoses, procedures, notes, charts etc. are related to a single admission. I had a few things in my mind but I have no experience so am finding it difficult to figure out what the best way to do this is.
Create multiple fact tables, for example I would have one fact table for diagnoses, one for procedures, one for lab notes but this just seems like too many fact tables with little to gain. Like I could have a fact table with an entry for each diagnosis per user per admission but would that give me more benefit than the OLTP schema already implemented?
Create one fact table with a row per admission with multiple columns/dimensions like diagnoses, procedures etc. But the issue with is that for most there are multiple diagnoses per admission so I will have to link to a bridge table in most fact tables and then it would look like the image below. The issue with this is the required joins for the queries.
The third way that I read about is using something like an entity-attribute-value fact table where I have a fact table with each row being one fact. For example, one row could be something like (patientid - addmissionid - (Attribute) Heart Rate Reading - (VALUE) 120bpm) this would create a single fact table with almost everything inside and fewer joins required but it would require the end user to pivot the table after querying due to the nature of EAV's thus making it more complicated for the end user. It would look like the i2b2 star schema.
The last way I thought of was doing an entry per event into the fact table but having many columns in the fact table to store dimensions, like (patientid, admissionid, icustay_id, diagnosis, procedure, labnote, labevent, microbiologyevent, etc.) in which patientid, and admissionid will be in all rows but the rest will depend on the entry so one entry could have just patientid, admissionid and a single procedure. I don't know how the end result of this will be like in terms of querying due to my lack of experience. I also don't know whether or not all these entries with almost every column being irrelevant is the way to go.
Any help would be greatly appreciated, I'm trying to have this implemented into BigQuery.
Some tests have concluded that reducing the number of joins can enhance the BigQuery's performance. In other words, if you denormalize most of your data in a single table you will probably get better results since you'll not need to perform many joins.
There are some points that you should think about when deciding the data modeling:
Is it easy to write queries in your model?
If you need many joins, your model will be probably difficult to build queries. If you use nested fields to reduce the number of joins you can simplify the queries you will write. However, if you create very complex nested structures, you will start having problems to query the model again.
Is it easy to join data in your model?
If you have many tables to be joined, your data will be difficult to join. The more denormalized the data is, the easier it is to join.
Easy to update data
If you need to update your data, you should consider that denormalization can help you. If you reduce the number of tables, you will need to update fewer tables. Its important to say that if you create very complex nested fields, it will also be difficult to be updated.
Make data understandable
This is the most important point. Will your data be understandable in a given model? All the previous points are technical and not exactly related to your data. You should design your model considering these points, but your data must make sense to you.
Finally, I would like to summarize some advice:
You can get a better performance denormalizing your data as long as the data keeps understandable.
Use nested fields to denormalize the model but try not creating very complex structures (more than 2 level of nesting).
Keep in mind that your model will need more disk space when you denormalize it so your billing for storage in BigQuery will probably be higher.
I hope it helps
Initial Reading
Trying reading The Data Warehouse Toolkit, chapter 14 has a section on healthcare data modelling.
Modelling vs Storage
But what you should try and figure out is what is the important and high value data, and what is the less important and low value data. Only model and load into the database the high value data. If you try and build the perfect data model, you are never going to deliver any useful reporting to your customers/clients.
What will be used on a daily or hourly basis? This information needs to be in your data model and loaded into the database for aggregation and slicing.
What will be used only occasionally? Once the slicing and dicing is complete, there will be further questions about the minutiae of a small cohort. This is when you go to the bulk storage and retrieve this data from your Data Lake.
Data Lake
Instrument measurements are great examples of low value information. Most individual measurements are not useful, but your ETL could review them and make an overall determination of some kind. E.G. Blood Pressure Normal, Heart Rate High, etc.
Prescription drugs are another example of low value information. In the database you may set a flag if multiple prescriptions are present or something similar. Freeform notes are another. Once some cohort has been decided on based on many other factors, you enable a data scientist to process the note with some machine learning but doing this for all notes is not very useful.
Group/Junk Dimensions
A number of these measurement determinations could be lumped together inside of a group/junk dimension as a way of resolving maintaining the grain of the fact and keeping the fact from exploding with low value rows. You could even hold off on modelling a junk dimension until your customers/clients start telling you about long executions for specific types of data extracts. You can then design a junk dimension to serve these higher value measurement determinations to enable additional aggregations or slicing prior to extracting the insightful data from the data lake.
File Structure
In your data lake I would have many file schemas for the low value data. These could be JSON, parquet, csv, or whatever you prefer. You would include the data needed to connect it back to the fact along with the data specific to the file type.
PatientId
AdmissionId
MeasureType
MeasureValue
Date
Time
The point is most of this data will never be looked at, but occasionally a few records are high value. You just don't know which they will be, so you store them as cheaply as possible until they are needed.
The data lake also allows you to change your file schema as new information becomes available and executing this change is trivial compared with changing a database star schema.
Use your favourite scripting language to create these files. Python, C#, Azure Function App, AWS Lamda, whatever. This will depend on your own skills and resources available.

SSAS - adding new fields to fact table without errors

gurus!
I've inherited an SSAS 2014 multidimensional cube at work. I've been doing SQL Server database work (queries, tables, stored procs, etc) for many years now. But I'm a complete SSAS newbie. And, even in my ignorance, I can tell that this cube I've inherited is a mess!
I've been able to keep the thing updated with new data each month, but now our company has rolled out a new product and I'm having to add five new fields to the fact table / view for keys related to that product, along with the related dimension views. I've taken a couple of shots at it, but wind up hitting numerous errors when I process the fact table partitions.
BTW, heading off the natural question, there's no way I can roll the "five new fields" data into fields that already exist unless I completely rebuild the cube from scratch, which is out of the question right now.
So, I'll try to boil down what I THINK is the problem here. Hoping someone can answer my question.
The fact data is located in four different data warehouse databases (names changed to protect company data) -
DB_Current
DB_2018
DB_2017
DB_2016
There is a fact view within each of those databases to stage the fact data. That view is called "vw_fact" and is identical across all databases. When that view gets pulled into the cube, it gets partitioned into four different partitions (per month-year) due to data size.
The new product was just rolled out this year, so I added the five new fields to "vw_fact" only in "DB_Current". I didn't change the prior years' views in their respective databases. My shot-in-the-dark guess there was that the prior years views would automically join the matching field names to the current year's view without needing the new fields.
When I tried processing the four years' worth of partitions, I then ran into numerous "field doesn't exist errors".
So, my questions are these:
Do I have to add five new fields to ALL FOUR views? That is, the individual views within all four years' of databases?
If I have to do #1 above, do I then need to run a "Process Full" on all partitions for all four years? Or do I need to run one of the other process options?
Thank you so much in advance for any advice you can offer here!
Joel
You need to have the matching result sets for all partitions source queries. Though, that doesn't mean you will necessarily have to add it to all views. You can edit the source query for the different partitions in visual studio. If you for some reason don't want to edit the 4 views ( which I would probably do ) you could hard code the surrogate key for the unknown member or something similar in the queries of the partitions where the new fields are not relevant (if it's dimension foreign keys we're talking about, alternatively to 0 or something if it's measures). If you have new dimensions I would go for a FULL process.

Google BigQuery move to SQL Server, Big Data table optimisation

I have a curious question and as my name suggests I am a novice so please bear with me, oh and hi to you all, I have learned so much using this site already.
I have an MSSQL database for customers where I am trying to track their status on a daily basis, with various attributes being recorded in several tables, which are then joined together using a data table to create a master table which yields approximately 600million rows.
As you can imagine querying this beast on a middling server (Intel i5, SSD HD OS, 2tb 7200rpm HD, Standard SQL Server 2017) is really slow. I was using Google BigQuery, but that got expensive very quickly. I have implemented indexes which have somewhat sped up the process, but still not fast enough. A simple select distinct on customer id for a given attribute is still taking 12 minutes on average for a first run.
The whole point of having a daily view is to make it easier to have something like tableau or QLIK connect to a single table to make it easy for the end user to create reports by just dragging the required columns. I have thought of using the main query that creates the master table and parameterizes it, but visualization tools aren't great for passing many variables.
This is a snippet of the table, there are approximately 300,000 customers and a row per day is created for customers who join between 2010 and 2017. They fall off the list if they leave.
My questions are:
1) should I even bother creating a flat file or should I just parameterize the query.
2) Are there any techniques I can use aside from setting the smallest data types for each column to keep the DB size to a minimal.
3) There are in fact over a hundred attribute columns, a lot of them, once they are set to either a 0 or 1, seldom change, is there another way to achieve this and save space?
4)What types of indexes should I have on the master table if many of the attributes are binary
any ideas would be greatly received.

How do I create an Excel Pivot connected to an Access DB that downloads only the queried data?

I have a table of around 60 columns and 400,000 rows and increasing. Our company laptops and MS Excel cannot handle this much data in RAM. So I decided to store the data in MS Access and link it to Excel.
However the pivot in Excel still downloads all the data into Excel, and then performs the filters and operations on the data. Which worked with lesser data, but with more data now has started giving memory errors. Also even though the data in the pivot might be only 50 cells, the file size is 30+ MBs...
So is it possible to create a connection to Access in such a way that it downloads only the data that is queried, does the operations before hand and then sends the revised data to Excel?
I saw this setup in my previous company (where the Excel pivot would only download what it needed). But it was querying an SQL DB as far as I remember. (Sadly couldn't learn more about it since the IT director was intent on being the only guy who knew core operations (He basically had the company's IT operations hostage in exchange for his job security))... But I digress.
I've tried searching for this on the internet for a few days, but it's a very specific problem that I can't find in Google :/
Any help or even pointers would be highly appreciated!
Edit: I'd just like to point out that I'm trying to create an OLAP connection for analysis, so the pivot would be changing fields. My understanding of how pivots work, was that when we select the fields in the pivot, excel would design a query (based on the select fields) and send it to the connection DB to retrieve the data requested. If this is not how it happens, how do I make something like this happen? Hope that elaborates.
I suppose that you created a single massive table in Access to store all your data, so if you just link that table as the data source, Excel won't know which particular bit of data is relevant and will most probably have to go through all of it itself.
Instead, you can try a combination of different approaches:
Create a query that pre-filters the data from Access and link that query to Excel.
Use a SQL Command Type for your Connection Properties instead of a Table.
Test that query in Access to make sure it runs well and is fast enough.
Make sure that all important fields have indexes (fields you filter, fields you group by, any field that Excel has to go through to decide whether it should be included or not in the pivot, make sure that that field has a sensible index).
Make sure that you have set a Primary Key in your table(s) in Access. Just use the default auto-increment ID if it's not already used.
If all else fails, break down that huge table: it's not so much the amount of records that's too much it's more the high number of columns.
If you use calculated fields in your pivot or filter data based on some criteria, consider adding columns to your table(s) in Access that contain pre-calculated data. For instance you could run a query from Access to update these additional fields or add some VBA to do that.
It should works pretty well though: to give you an idea, I've made some tests with Excel 2013 linked to a 250MB ACCDB containing 33 fields and 392498 rows (a log of stock operations): most operations on the pivot in Excel only take a fraction of a second, maybe a couple of seconds for the most data-intensive ones.
Another thing: Access has support pivot tables and pivot charts. Maybe you don't need Excel if Access is enough. You can use the Access Runtime 2013 or 2013 (it's free) as a front-end on each machine that needs access to the data. Each front-end can then be linked to the backend database that holds the data on a network share. The tools are a bit more clunky than in Excel, but they work.
Another possible solution, to avoid the creation of queries in the Access DB, is to use PowerPivot add-in in Excel, implementing there queries and normalizations.

Tableau - blend, join, or modify raw?

I have several large files which have metric information. I am linking these files to a roster database so I know who an employee reported to on a certain date (this allows me to track performance per manager or per site, etc).
The way I am doing this is creating a RosterID field which is EmployeeID+Date (Example: x112x20141022). So if I filter on one sheet on my dashboard, it will filter for all sheets.
Question: Do I use a blend between the files (there are about 8 different data sources right now), a left join, or do I just use python/pandas to literally add the information to the raw data source? I want the twbx file to be as fast as possible for the end users.
(What I mean by adding the information to the raw data is literally adding columns like Manager, Director, Location, Hire Date, Training Class. This would add to the file sizes marginally, but would it speed up Tableau?)
Data blend (creating a relationship between different datasources in Tableau) is the worst performing solution. While it is easy to use, it's highly inefficient, because it will virtually perform the join every single time a calculation is made.
Performing a left join on Tableau when connecting to data (I'm assuming you're using csv files) is a very good solution, as is making a single table with the information in Pandas. For the final user, there shouldn't bet any difference in performance (especially if you extract the data to tde in Tableau). Here you need to measure if it's easier for you to maintain a process in python/pandas or in Tableau extracting tool. I believe it's simpler to have Tableau doing the join. But it's your call