As an example, lets say my dataset holds:
EMPLOYEE_ID
EMPLOYEE_NAME
EMPLOYEE_ACCT_ID
EMPLOYEE_ACCT_TYPE
EMPLOYEE_ACCT_BALANCE
I would like to present the data in the following way:
EMPLOYEE | CHECKING | SAVINGS | INVESSTMENT | XMAS |
_______________________________________________________________________
Mary | 100.00 | 700.00 | 3,000.00 | 175.00
Jim | 850.00 | 600.00 | 1,500.00 | 0.00
TOTAL | 950.00 | 1,300.00 | 4,500.00 | 175.00
Where I'm stuck is how to break out the EMPLOYEE_ACCT_TYPE into columns with each account type values listed with it's balance. Thanks in advance.
What you are trying to do is a called a Pivot. Some systems (e.g. SQL Server) have native support for this in SQL, but only if you know the number of columns in advance (i.e. you would have to hard-code the account types into the SQL). Other systems don't support pivoting natively (e.g. MySQL) so you would need to write a stored procedure or some dynamic SQL to do it.
Since you don't mention what DBMS you are using, that's about as specific as I can get.
It sounds to me like you need to do some serious normalization, first. Break employee_account data out into its own table:
table: employee_account_data
EMPLOYEE_ACCT_ID int
EMPLOYEE_ACCT_TYPE varchar(15)
EMPLOYEE_ACCT_BALANCE decimal
You'll also need a bridge table, since many employees can have many accounts (many to many):
table: employee_account_lookup
EMPLOYEE_ID int
EMPLOYEE_ACCT_ID int
This way, you won't be repeating employee_name for each account type (as I suspect you are now). If you really wanted to normalize well, you could also create a table to hold the different Employee Account Types. That way you wouldn't have to worry about someone mispelling "Checking" or "Savings" on data entry.
Related
Background
I need to compare two tables in two different datacenters to make sure they're the same. The tables can be hundreds of millions, even a billion lines.
An example of this is having a production data pipeline and a development data pipeline. I need to verify that the tables at the end of each pipeline are the same, however, they're located in different datacenters.
The tables are the same if all the values and datatypes for each row and column match. There are primary keys for each table.
Here's an example input and output:
Input
table1:
Name | Age |
Alice| 25.0|
Bob | 49 |
Jim | 45 |
Cal | 52 |
table2:
Name | Age |
Bob | 49 |
Cal | 42 |
Alice| 25 |
Output:
table1 missing rows (empty):
Name | Age |
| |
table2 missing rows:
Name | Age |
Jim | 45 |
mismatching rows:
Name | Age | table |
Alice| 25.0| table1|
Alice| 25 | table2|
Cal | 52 | table1|
Cal | 42 | table2|
Note: The output doesn't need to be exactly like the above format, but it does need to contain the same information.
Question
Is it faster to import these tables into a new, common SQL environment, then use SQL to produce my desired output?
OR
Is it faster to use something like JDBC, retrieve all rows for each table, sort each table, then compare them line by line to produce my desired output?
Edits:
The above solutions would be executed at a datacenter that's hosting one of the tables. In the first solution, the only purpose for creating a new database would be to compare these tables using SQL, there are no other uses.
You should definitively start with the database option. Especially if the databases are connected with a database link you can easy set up the transfer of the data.
Such comparison often leads to a full outer join of the two sources and the experience tell us that DIY joins are notorically less performant that the native database implementation (you can deploy for example a parallel option).
Anyway you may try to implement some sofisticated algoritm that can make the compare without the necessity to transfer the whole table.
An example is based on the Merkle Trees where you first scan both source in their location to recognise which parts are identical (that can be ignored) and transfer and compare only the party with a difference.
So if you expect the tables are nearly identical and have keys that allows some hierarchy such approach could end better than a brute force full compare.
The faster solution is to load both tables to variables (memory) in your programing language and then compare them with your favorite algorithm.
Copy them first to a new table is the more than the double of time in read/write operations to disk, especially the write ones.
I am sure this question has been asked before, but I'm so new to SQL, I can't even combine the correct search terms to find an answer! So, apologies if this is a repetition.
The db I'm creating has to be created at run-time, then the data is entered after creation. Some fields will have a varying number of entries, but the number is unknown at creation time.
I'm struggling to come up with a db design to handle this variation.
As an (anonymised) example, please see below:
| salad_name | salad_type | salad_ingredients | salad_cost |
| apple | fruity | apple | cheap |
| unlikely | meaty | sausages, chorizo | expensive |
| normal | standard | leaves, cucumber, tomatoes | mid |
As you can see, the contents of "salad_ingredients" varies.
My thoughts were:
just enter a single, comma-separated string and separate at run-time. Seems hacky, and couldn't search by salad_ingredients!
have another table, for each salad, such as "apple_ingredients", which could have a varying number of rows for each ingredient. However, I can't do this, because I don't know the salad_name at creation time! :(
Have a separate salad_ingredients table, where each row is a salad_name, and there is an arbitrary number of ingredients fields, say 10, so you could have up to 10 ingredients. Again, seems slightly hacky, as I don't like to unused fields, and what happens if a super-complicated salad comes along?
Is there a solution that I've missed?
Thanks,
Dan
based on my experience the best solution is based on a normalized set of tables
table salads
id
salad_name
salad_type
salad_cost
.
table ingredients
id
name
and
table salad_ingredients
id
id_salad
id_ingredients
where id_salad is the corresponding if from salads
and id_ingredients is the corresponding if from ingredients
using proper join you can get (select) and filter (where) all the values you need
While trying to build a data warehousing application using Talend, we are faced with the following scenario.
We have two tables tables that look like
Table master
ID | CUST_NAME | CUST_EMAIL
------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM
Events Table
ID | CUST_ID | EVENT_NAME | EVENT_DATE
---------------------------------------
1 | 1 | ACC_APPLIED | 2014-01-01
2 | 1 | ACC_OPENED | 2014-01-02
3 | 1 | ACC_CLOSED | 2014-01-02
There is a one-to-many relationship between master and the events table.Since, given a limited number of event names I proposing that we denormalize this structure into something that looks like
ID | CUST_NAME | CUST_EMAIL | ACC_APP_DATE_ID | ACC_OPEN_DATE_ID |ACC_CLOSE_DATE_ID
-----------------------------------------------------------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM | 20140101 | 20140102 | 20140103
THE DATE_ID columns refer to entries inside the time dimension table.
First question : Is this a good idea ? What are the other alternatives to this scheme ?
Second question : How do I implement this using Talend Open Studio ? I figured out a way in which I moved the data for each event name into it's own temporary table along with cust_id using the tMap component and later linked them together using another tMap. Is there another way to do this in talend ?
To do this in Talend you'll need to first sort your data so that it is reliably in the order of applied, opened and closed for each account and then denormalize it to a single row with a single delimited field for the dates using the tDenormalizeRows component.
After this you'll want to use tExtractDelimitedFields to split the single dates field.
Yeah, this is a good idea, this is called a cumulative snapshot fact. http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
Not sure how to do this in Talend (dont know the tool) but it would be quite easy to implement in SQL using a Case or Pivot statement
Regarding only your first question, it's certainly a good idea -- unless there is any possibility of the same persons applying-opening-closing their account more than once AND you want to keep all this information in their history (so UPDATE wouldn't help).
Snowflaking is definitely not a good option if you are going to design a data warehouse. So, denormalizing will certainly be a good choice in this case. Following article almost fits perfectly to clear the air over such scenarios,
http://www.kimballgroup.com/2008/09/design-tip-105-snowflakes-outriggers-and-bridges/
I have a table with data along the (massively simplified) lines of:
User | Value
-----|------
UsrA | 100
UsrA | 102
UsrB | 100
UsrA | 100
UsrB | 101
and, for reasons far to obscure to go into, I need to store the COUNT of each value in a table for future retrieval - ending up with something like
User | Value100Count | Value101Count | Value102Count
-----|---------------|---------------|--------------
UsrA | 2 | 0 | 1
UsrB | 1 | 1 | 0
However, there could be up to 255 different Values - meaning potentially 255 different ValueXCount columns. I know this is a horrible way to do things, but is there an easy way to get the data into a format that can be easily INSERTed into the destination table? Is there a better way to store the COUNT of values per user (unfortunately I do need to store this information; grabbing it from the source table each time isn't an option)?
The whole thing isn't very pretty, but you know that, rather than your table with 255 columns I'd consider setting up another table with:
User | Value | CountOfValue
And set a primary key over User and Value.
You could then insert the count's for given user/value combos into the CountOfValue field
As I said, the design is horrible and it feels like you would be better off starting from scratch, normalizing and doing counts live.
Check out indexed views. You can maintain the table automatically, with integrity and as a bonus it can get used in queries that already do count(*) on that data.
I have a Data intensive problem which requires a lot of massaging and data manipulation and I'm putting this out there to see if anyone has an idea as to how to approach it.
In simplest form. I have a lot of tables which can be joined together to give me a price listing for dentists and how much each charges for a procedure.
so we have multiple tables that looks like this.
Dentist | Procedure1 | Procedure2 | Procedure3 | .........| Procedure?
John | 500 | 342 | 434 | .........| 843
Dave | 343 | 434 | 322 | NULLs....|
Mary | 500 | 342 | 434 | .........| 843
Linda | 500 | 342 | Null | .........| 843
Dentists can have different number of procedures and different pricing for each procedures. But there are a lot of Dentists that have the same number of procedures and the same rates that goes with it. Internally, we create a unique ID for each of these so-called fee listings.
like John would be 001, Dave would be 002, but Mary would be fee 001 and Linda would be 003
It's not so bad if I have to deal with this data once but these fee listings comes in flat files (csvs) which i basically have to DTS up to a SQL server to work with. and they come on a monthly bases. The pricing could change from month to month for each dentist which then would put them in a different unique ID internally.
Can someone shed some light on as to how to best approach this problem so that it's most efficient to process on a monthly basis without having to do tons of data manipulation?
what's the best approach to finding out the duplicates of the fee listings?
How do i keep track of updating a Dentist's fee listing incase they change their rates the next month? if Mary decides to charge a different fee for procedure2, then she would have a different unique ID internally. how do i keep track of that on a monthly bases without having to delete everything and re-insert?
There are a few million fee listings that I'm working with and some have standard rules that are based on zipcodes and some are just unique fee listings, what's the approach here?
I can write some kind of ad-hoc .net program to work with it but it's a lot of data and working straight in SQL server would be easier for me.
any help would be great, thanks guys.
You probably need to unpivot the data to normalize it - so that you end up with:
Doctor: DoctorID, DoctorDetails...
FeeSchedule: DoctorID, ScheduleID, EffectiveDate, OtherDetailAtThisLevel...
FeeScheduleDetail: ScheduleID, ProcedureCode, Fee, OtherDetailAtThisLevel...
When the data comes in for a doctor, it is pivoted, a new schedule is created and the detail rows are created from the unpivoted data.
SSIS has an unpivot component which is fine - you would load the schedule first and then the detail. If the format varies significantly, you might need a custom data source or just avoid SSIS.
This system would keep track of new schedules for doctors. If the schedule is identical for a doctor, you could simply not insert it.
If this logic is extensive, you could load the data to staging tables (SSIS or whatever) and do all this in SQL (T-SQL also has an UNPIVOT operator). That can have advantages in that the code is all in one place and can do all its operations in sets.
Regarding the zip codes, if the doctor doesn't have a fee, are these like usual and customary fee? This could simply be determined from the zip code of the doctor row. In this case you have a few options. You can overlay the doctor fee schedule over a zip code fee schedule:
ZipCodeSchedule: ZipScheduleID, ZipCode, EffectiveDate
ZipCodeScheduleDetail: ZipScheduleID, ProcedureCode, Fee
Or you could save this in the regular feeschedule (potentially with some kind of flag that it was defaulted to the UCR).