Select columns mid Matillion transformation job - matillion

I'm a new Matillion user so suspect I'm overlooking an obvious answer...
I am currently reading a number of identically formatted tables (yearly sales data) and vertically stacking them using a Unite component. As I am exploring the data rather building a pipeline with a specific function I would like to keep as much flexibility as possible. Hence, I would like to select columns after the unite component has run. I appreciate I can easily do such column selections reading the data in at the point of running Table Input components.
Am I missing the obvious solution?

A humble Rename Component can help with that. Add one immediately after the Unite, and in the column mapping choose Add All. Then you can choose which columns to drop, mid-transformation, using the minus button.
The Rename Component's primary purpose is to add alias names, but it does double up as a good way to slim down the column selection. As a bonus side effect, doing that can improve performance.

Related

Querying SQL using a Code column vs extended table

I am setting up a fairly large dataset (catalogue) on a sql database (i'd guesstimate ∼100k records) to store information regarding products. Each product is characterized by about 20-30 properties, so that would basically mean 20-30 column. The system is setup so that each of these properties is actually linked to a code, and each product is therefore characterized by a unique string made concatenating all these properties (the string has to be unique, if two product codes are the same then the two products are actually the same product). What I am trying to figure out is if sql-wise there is any difference to storing the catalogue as a table of 20-30 columns, or if I am better off just having 1 column with the code and decoding the properties from the code. The difference being that in one case I would do
SELECT * FROM Catalogue WHERE Color='RED'
versus
SELECT * FROM Catalogue WHERE Code LIKE '____R____________'
Also it might make it easier to check whether a product already exists, as I am only comparing a single column compared to 20-30 columns. I could also just add an extra column to the complete table to store the code and use one method when doing one operation and another when doing another operation.
I have almost no knowledge of how the SQL engine works so I might be completely off with my reasoning here.
The code approach seems silly. Why do I phrase it this way?
You have a few dozen columns with attributes and you know what they are. Why would you NOT include that information in the data model.
I am also amused by how you are going to distinguish these comparisons:
WHERE Code LIKE '____R____________'
WHERE Code LIKE '___R_____________'
WHERE Code LIKE '_____R___________'
WHERE Code LIKE '____R___________'
That just seems like a recipe for spending half the rest of your future life on debugging -- if not your code then someone else's.
And, with separate columns, you can create indexes for commonly used combinations.
If not all rows have all attributes -- or if the attributes can be expanded in the future -- you might want a structure with a separate line for each attribute:
entityId code value
1 Color Red
This is called an entity-attribute-value (EAV) model and is appropriate under some circumstances.

How can I divide a single table into multiple tables in access?

Here is the problem: I am currently working with a Microsoft Access database that the previous employee created by just adding all the data into one table (yes, all the data into one table). There are about 186 columns in that one table.
I am now responsible for dividing each category of data into its own table. Everything is going fine although progress is too slow. Is there perhaps an SQL command that will somehow divide each category of data into its proper table? As of now I am manually looking at the main table and carefully transferring groups of data into each respective table along with its proper IDs making sure data is not corrupted. Here is the layout I have so far:
Note: I am probably one of the very few at my campus with database experience.
I would approach this as a classic normalisation process. Your single hugely wide table should contain all of the entities within your domain so as long as you understand the domain you should be able to normalise the structure until you're happy with it.
To create your foreign key lookups run distinct queries against the columns your going to remove and then add the key values back in.
It sounds like you know what you're doing already ? Are you just looking for reassurance that you're on the right track ? (Which it looks like you are).
Good luck though, and enjoy it - it sounds like a good little piece of work.

Combining similar values in a single column

I have a column that is being used to list competitors names in a table I'm putting together. Right now don't have a lot of control over how these inputs are made, and it causes some serious headaches. There are random spaces and misspellings throughout our data, and yet we need to list the data by competitor.
As an example (not actual SQL I'm using), list of competitors:
Price Cutter
PriceCutter
PriceCuter
Price Cuter
If I ran the query:
SELECT Competitor_Name, SUM(Their_Sales),
FROM Cmdata.Competitors
Where Their_Sales
Between 10000 AND 100000000
Group by Competitor_Name
I would get a different entry for each version of Price Cutter, something I clearly want to avoid.
I would think this problem would come up a lot, but I did a Google search and came up dry. I will admit, the question is kind of hard to articulate in a few words, maybe that's why I didn't come with anything. Either that or this is so basic I should already know...
(PS- Yes, we're moving to a drop down menu, but it's gonna take some time. In the mean time, is there a solution?)
You need to add a Competitor table, that has a standard name for each competitor.
Then, use foreign key references in other tables.
The problem that you are facing is a data cleansing and data modeling issue. It is not particularly hard to solve, but it does require a fair amount of work. You can get started by getting a list of all the current spellings and standardize them -- probably in an Excel spreadsheet.
If you do that, you can then create a lookup table and change the values by looking them up.
However, in the medium term, you should be creating a Competitor table and modelling the data in the way that your application needs.
This is a very hard problem in general. If your database supports it, you could try grouping by SOUNDEX(Competitor_Name) instead of just Competitor_Name.
Really, the Competitor_Name column should be a foreign key into a Competitors table anyway, instead of a bare text field.
Whatever you do to fix, you should also UPDATE the table so that you don't have to do this sort of hoop-jumping in the future.
(I'm a bit hazy on the syntax, but this is close)
alter table Competitors add column cleanedName (varchar(100));
update Competitors set cleanedName = Replace(Upper(Competitor_Name), ' ', '')
then Group by cleanedName instead of Competitor_Name

When to use recursive table

I have a need to build a schema structure to support table of contents (so the level of sections / sub-sections could change for each book or document I add)...one of my first thoughts was that I could use a recursive table to handle that. I want to make sure that my structure is normalized, so I was trying to stay away from deonormalising the table of contents data into a single table (then have to add columns when there are more sub-sections).
It doesn't seem right to build a recursive table and could be kind of ugly to populate.
Just wanted to get some thoughts on some alternate solutions or if a recursive table is ok.
Thanks,
S
It helps that SQL Server 2008 has both the recursive WITH clause and hierarchyid to make working with hierarchical data easier - I was pointing out to someone yesterday that MySQL doesn't have either, making things difficult...
The most important thing is to review your data - if you can normalize it to be within a single table, great. But don't shoehorn it in to fit a single table setup - if it needs more tables, then design it that way. The data & usage will show you the correct way to model things.
When in doubt, keep it simple. Where you've a collection of similar items, e.g. employees then a table that references itself makes sense. Whilst here you can argue (quite rightly) that each item within the table is a 'section' of some form or another, unless you're comfortable with modelling the data as sections and handling the different types of sections through relationships to these entities, I would avoid the complexity of a self-referencing table and stick with a normalized approach.

SQL Server 2005 Computed Column Result From Aggregate Of Another Table Field's Value

Sorry for the long question title.
I guess I'm on to a loser on this one but on the off chance.
Is it possible to make the calculation of a calculated field in a table the result of an aggregate function applied to a field in another table.
i.e.
You have a table called 'mug', this has a child called 'color' (which makes my UK head hurt but the vendor is from the US, what you going to do?) and this, in turn, has a child called 'size'. Each table has a field called sold.
The size.sold increments by 1 for every mug of a particular colour and size sold.
You want color.sold to be an aggregate of SUM size.sold WHERE size.colorid = color.colorid
You want mug.sold to be an aggregate of SUM color.sold WHERE color.mugid = mug.mugid
Is there anyway to make mug.sold and color.sold just work themselves out or am I going to have to go mucking about with triggers?
you can't have a computed column directly reference a different table, but you can have it reference a user defined function. here's a link to a example of implementing a solution like this.
http://www.sqlservercentral.com/articles/User-Defined+functions/complexcomputedcolumns/2397/
No, it is not possible to do this. A computed column can only be derived from the values of other fields on the same row. To calculate an aggregate off another table you need to create a view.
If your application needs to show the statistics ask the following questions:
Is it really necessary to show this in real time? If so, why? If it is really necesary to do this, then you would have to use triggers to update a table. This links to a short wikipedia article on denormalisation. Triggers will affect write performance on table updates and relies on the triggers being active.
If it is only necessary for reporting purposes, you could do the calculation in a view or a report.
If it is necessary to support frequent ad-hoc reports you may be into the realms of a data mart and overnight ETL process.