I am designing a data pipeline in Mosaic Decisions. I have a database where some values in the data are nulls. I require to use a CustomSQL Node to query the database, but due to these Null values, the result is unexpected
How can I avoid such situations and convert the nulls into some other values before adding the customSQL.
I have previously used another customSQL just to replace all the null values. However, I need to keep adding complex queries to do that. Is there any other possible way to achieve the same thing without the use of separate customSQL node to handle the null values?
Mosaic Decisions has recently introduced a new Process Node. The Impute Process Node is capable of handling the null values present in the dataset.
You can follow these steps below to get rid of the null values in your data with the help of the Impute Node and avoid using the complex SQL queries in the CustomSQL Node.
After dragging in the Impute Node into the canvas and connecting the input, open the configuration menu.
Drag and drop the column(s) you would like to handle your null values on.
From the list of Impute Strategies, select the most suitable choice based on the data type of your column(s). You can either select same or different strategies on the selected column(s) and Add them.
Refer the user manual under the help section for detailed information on the Impute Strategies.
You can direct the output of this node directly to your CustomSQL Node where you are writing the main query for your data pipeline.
Related
I'm displaying a large table on a website and now I want to add server-side filtering.
To build these filters, I need all distinct values for each column. What's the best practice to do that?
I feel like performing a SELECT DISTINCT or GROUP BY statement for each column and on every page load would be too expensive for the database.
Note: Unfortunately I can't change the database so creating tables for foreign keys is not possible.
If indexing the columns isn't possible and you cannot change the database structure (which is the right thing to do from a database perspective) then I see a few of options
Select Distinct
You are just going to bite the bullet and run this every.time.but it isn't ideal if the values rarely change.
The results will be cached on the database so you will get some minimal efficiency.
Select Distinct Everything
Run a single query that gets all the possible combinations of distinct values. You then would need to deduplicate each column but you would only be having a single trip to the database.
This obviously depends on cardinality.
Cache the values on your app
You haven't mentioned what you are using for your webapp but you should be able to use some sort of persistence / caching framework so you load the distinct list of values once and then keep them in memory and update as necessary.
This approach also depends on where updates are taking place. If the database is changing outside of your webapp then this becomes problematic.
I have an SSIS package that is supposed to insert data from a flat-file into a database table. For the sake of this example, let's say I am wanting to insert User records. The data records come from other existing databases, so they already include a previously generated primary ID, which we would like to preserve and continue using. The records also include an email field which should also be unique in the destination table; this is enforced by the schema. A given batch could include records that have previously been "migrated" as well as a user might be in more than one of the original systems with the same email address. In addition to avoiding errors, I would also like to track any possible duplicates (on either the UserID OR the email fields) by writing those to a file.
Because matches can be made on either of the 2 fields, do I need to chain 2 Lookup Transformations? Or is there a way to specify OR operation instead of AND when using multiple columns? Or is there a better-suited transform that I should be using?
Thank you in advance.
Well, let's split your question.
Can I do a Lookup with OR condition on two fields?
Yes, you can.
Suppose you are lookup through User table. On the Lookup transformation General section - specify Partial cache or No cache as Cache mode. Then design your query in Connection section. Important -- map your data flow fields to query columns in Columns section. So far preparation is done.
Go to Advanced section and tick Modify the SQL statement flag. Modify the SQL statement below with something like
select * from (SELECT [ColA], [ColB], ...
FROM [User]) [refTable]
where [refTable].[ColA] = ? OR [refTable].[ColB] = ?
Then - hit Parameters button and specify data flow columns which should be mapped to the first ? - Parameter0, and so on.
As you see, it is possible but not easy.
Should you use two lookups or single complex lookup?
I would go for two lookups, as it allows you finer control and error reporting - with OR statement you can only report that something among unique fields matched. Doing specific lookups allow you to be more specific and design special flow steps if needed.
Unable to use pivot component , please let me know with example how to use pivot and concatenation components in clover etl
I always describe Concatenate as boarding queues at the airport. You have multiple queues (edges), in all of them people are standing (same records, ie. metadata), all of them get boarded by 1) queue (army first, first class then, economy...), and 2)order in the queue (who is standing first in the queue goes first, second, third...)
I wasn't using Pivot component too much, but I think its kind of a denormalize component. You have bunch of simple records (think key, value pairs from json or NoSQL database) and want to merge them into one wide record, where fields are keys of incoming tuples, values are values from incoming records. Its a grouping component, used when you have multiple records for same group key (customer_id for example) and want to produce one wide record with all available properties.
I prefer Denormalize component as it gives me more control, but it requires a bit more of CTL, for easier things you might be fine with Pivot.
Concatenate component serves totally different use case (collecting same looking records from multiple input sources - similar components SimpleGather, Merge) then Pivot (transforming simple records into denormalized, wide one - similar components Denormalizer).
I'm considering designing a table with a computed column in Microsoft SQL Server 2008. It would be a simple calculation like (ISNULL(colA,(0)) + ISNULL(colB,(0))) - like a total. Our application uses Entity Framework 4.
I'm not completely familiar with computed columns so I'm curious what others have to say about when they are appropriate to be used as opposed to other mechanisms which achieve the same result, such as views, or a computed Entity column.
Are there any reasons why I wouldn't want to use a computed column in a table?
If I do use a computed column, should it be persisted or not? I've read about different performance results using persisted, not persisted, with indexed and non indexed computed columns here. Given that my computation seems simple, I'm inclined to say that it shouldn't be persisted.
In my experience, they're most useful/appropriate when they can be used in other places like an index or a check constraint, which sometimes requires that the column be persisted (physically stored in the table). For further details, see Computed Columns and Creating Indexes on Computed Columns.
If your computed column is not persisted, it will be calculated every time you access it in e.g. a SELECT. If the data it's based on changes frequently, that might be okay.
If the data doesn't change frequently, e.g. if you have a computed column to turn your numeric OrderID INT into a human-readable ORD-0001234 or something like that, then definitely make your computed column persisted - in that case, the value will be computed and physically stored on disk, and any subsequent access to it is like reading any other column on your table - no re-computation over and over again.
We've also come to use (and highly appreciate!) computed columns to extract certain pieces of information from XML columns and surfacing them on the table as separate (persisted) columns. That makes querying against those items just much more efficient than constantly having to poke into the XML with XQuery to retrieve the information. For this use case, I think persisted computed columns are a great way to speed up your queries!
Let's say you have a computed column called ProspectRanking that is the result of the evaluation of the values in several columns: ReadingLevel, AnnualIncome, Gender, OwnsBoat, HasPurchasedPremiumGasolineRecently.
Let's also say that many decentralized departments in your large mega-corporation use this data, and they all have their own programmers on staff, but you want the ProspectRanking algorithms to be managed centrally by IT at corporate headquarters, who maintain close communication with the VP of Marketing. Let's also say that the algorithm is frequently tweaked to reflect some changing conditions, like the interest rate or the rate of inflation.
You'd want the computation to be part of the back-end database engine and not in the client consumers of the data, if managing the front-end clients would be like herding cats.
If you can avoid herding cats, do so.
Make Sure You Are Querying Only Columns You Need
I have found using computed columns to be very useful, even if not persisted, especially in an MVVM model where you are only getting the columns you need for that specific view. So long as you are not putting logic that is less performant in the computed-column-code you should be fine. The bottom line is for those computed (not persisted columns) are going to have to be looked for anyways if you are using that data.
When it Comes to Performance
For performance you narrow your query to the rows and the computed columns. If you were putting an index on the computed column (if that is allowed Checked and it is not allowed) I would be cautious because the execution engine might decide to use that index and hurt performance by computing those columns. Most of the time you are just getting a name or description from a join table so I think this is fine.
Don't Brute Force It
The only time it wouldn't make sense to use a lot of computed columns is if you are using a single view-model class that captures all the data in all columns including those computed. In this case, your performance is going to degrade based on the number of computed columns and number of rows in your database that you are selecting from.
Computed Columns for ORM Works Great.
An object relational mapper such as EntityFramework allow you to query a subset of the columns in your query. This works especially well using LINQ to EntityFramework. By using the computed columns you don't have to clutter your ORM class with mapped views for each of the model types.
var data = from e in db.Employees
select new NarrowEmployeeView { Id, Name };
Only the Id and Name are queried.
var data = from e in db.Employees
select new WiderEmployeeView { Id, Name, DepartmentName };
Assuming the DepartmentName is a computed column you then get your computed executed for the latter query.
Peformance Profiler
If you use a peformance profiler and filter against sql queries you can see that in fact the computed columns are ignored when not in the select statement.
Computed columns can be appropriate if you plan to query by that information.
For instance, if you have a dataset that you are going to present in the UI. Having a computed column will allow you to page the view while still allowing sorting and filtering on the computed column. if that computed column is in code only, then it will be much more difficult to reasonably sort or filter the dataset for display based on that value.
Computed column is a business rule and it's more appropriate to implement it on the client and not in the storage. Database is for storing/retrieving data, not for business rule processing. The fact that it can do something doesn't mean you should do it that way. You too you are free to jump from tour Eiffel but it will be a bad decision :)
I have a fact table that has a column with dates loaded from an MS Access source. The thing is, some of the values are NULL and the SSAS won't let me refer my DATE dimension to it.
Is it better to solve it at the SSIS stage or is there some solution at the SSAS?
Thank you very much for you help.
Best practice is not to have any NULL key (i.e. Dimension key) values in a Fact table.
Instead, create a MAX date in the Date dimension table (or an 'UnknownValue', -1 for instance) and key to that.
Sometimes it is undesirable for non-technical reasons to fix the nulls in the DSV or source system. In that case you can use the unknown member and null processing properties to work around this issue:
http://technet.microsoft.com/en-us/library/ms170707.aspx
I have done this when trying to highlight the data qualities problems or for fast prototyping purposes.
Each member of a hierarchy has a property "HideMemberIf". Setting this to "NoName", should hide the null values from the Dimension Browser and allow processing of the cube.
Also you could created Named Calculations in the Datamart View. The Named Calculation would use the ISNULL function, which can fill in values in place of nulls. Then of course build your Time Dimension off of these Named Calculations instead of the raw data fields.
Again, it's better not to have any nulls in your data altogether, but you can usually fix this inside the Cube.