I have been practicing data manipulation in SQL Server Express 2017, and I have been provided with a data source I can't seem to make sense of. I was hoping there might be someone more familiar here that might be able to point me in the right direction. I need to work on some SQL queries on the dataset, but I haven't the faintest idea on where to start.
The data looks like this for instance:
Company Code - Field - Value (3 fields)
1001 - Vendor Name - 7 Eleven
1001 - Vendor Name - Bob Jane
1001 - Vendor Name - Krispy Kreme
1001 - Vendor Address - 102 Reservoir Street
1001 - Vendor Address - 110 Pitt Road
1001 - Vendor Address - 23 Foxy Place
Usually, I would expect to see it in a somewhat relational type of table like
Company Code Vendor Name Vendor Address
1001 7 Eleven 102 Reservoir Street.
1001 Bob Jane 110 Pitt Road.
What you have appears to be a design called EAV - entity attribute value. That term you can google. Unfortunately either you left out important information or your design is fundamentally broken. Given what you posted, there is no way to know that "Bob Jane" goes with "110 Pitt Road". Rows in a table have no reliable or specific order. You need a column in your table to define "order" if you want to associate rows based on "order".
Related
I do have a large sql table (from the automotive industry) containing information similar to the following format:
NAME ID PARENTID CHILDID IN_LAW_ID
---------- ---------- ---------- ---------- ----------
Bill 1 - - 10
Faye 2 - - -
Joe 3 2 1 -
Billy 4 2 1 -
Bob 5 2 1 9
Catherine 6 7 - -
Calvin 7 6 4 -
Achmed 8 - - -
Rachel 9 - - 5
(well, the names are in fact spare parts, parents would be predecessor parts, children successor parts and inlaws would be optional parts...)
My goal would be to add an additional column with a unique ID showing which family a NAME belongs to.
For example, Achmed has no parent, no child and no inlaws, so he would be tagged as having no family,
Bill on the other hand would belong to the same family as Joe and anyone else who has an ID belonging to the same tree (no matter whether it is a PARENTID, CHILDID or IN_LAW_ID relationship).
To complicate things a bit, relationships in the tree can be circular.
I.e. Catherine can have Calvin as PARENTID and Calvin can have Catherine as PARENTID.
Ah, and trees can get quite large, with up to 3000 members.
My current approach is to use tools dedicated for Network Mining and extract and name every distinct subnetwork (thus the full subnetwork with no connection to other networks). Yet these tools run on my laptop and take a full week to generate the final list of FAMILYIDs.
I imagine that a simple (or also a rather complex) SQL query will be much more performant - but I have no idea on how to attack the problem in SQL. (I intend to run it on our Microsoft SQL server btw.)
Any help will be much appreciated!
Our organization is currently in the process of building a new data warehouse. We are actually able to use some techniques borrowed from the DW community such as ETL processing to conform data, de-normalized dimensions in the "kimbal" style, etc. etc. Overall, data warehousing is still fairly new to our organization, but we are learning the concepts as we go along.
The problem: We have multiple sources of data, with often conflicting sources of facts. For example, we have a Master Person Index, where we use a score-based matching algorithm during ETL to match an inbound person to an existing person, so even if the inbound record doesn't exactly match, we can score based on other things like zip code radius.
Here's the question: What is the standard way to handle multiple versions of a fact from two or more sources?
I understand one of the main ideas of the data warehouse is to keep a running history of any fact, which we are doing. That's all fine and dandy when a record is being maintained by one inbound source, we keep the history of that fact over time. The problem occurs when two different sources perhaps updating on a daily basis have two different facts, e.g. source A says the name is Mary Smith, source B says the name is Mary Jane changing this value every day! Based on the matching algorithm we're confident it's the same person, but due to our history style table, it basically keeps flopping back and forth to both names every day because it is reading the name as a "change" from each data source.
An example table:
first_name last_name source last_updated
Mary Smith A 5/2/12 1:00am
Mary Jane B 5/2/12 2:00am
Mary Smith A 5/3/12 1:00am
Mary Jane B 5/3/12 2:00am
Mary Smith A 5/4/12 1:00am
Mary Jane B 5/4/12 2:00am
...
Have one table that stores your external data:
id | first_name | last_name | source | external_unique_id | import_date
----+------------+-----------+--------+--------------------+-------------
1 | Mary | Smith | A | abcdefg123 | 5/2/12 1:00am
2 | Mary | Jane | B | 1234567abc | 5/2/12 2:00am
Then have a second table that contains your cleaned data:
id | first_name | last_name
----+------------+-----------
1 | Mary | Jane-Smith (or whatever)
Then have a mapping table between the two.
local_person_id | foreign_person_id
-----------------+-------------------
1 | 1
1 | 2
Or something broadly similar.
The objective is to load the facts from your source once, and keep them.
Then use your fuzzy logic to relate them to master records somewhere. Which you only need to do when new facts are loaded or old facts are changed.
Still, you have the choice on what last_name to use. But that can be almost arbitrary in the absence of determining data. For example : Whichever pick the last name from the fact loaded most recently.
You can still quickly and simply relate the master to the child facts, to their sources, and to their corresponding data. But you have a unified entity in your warehouse to hang these external facts on.
One thing about terminology - What you've listed are "Attributes", not "Facts". A fact is a measure that you take on a set of dimensional Attributes. (for example, an order that this "person" places, or the dollar value of this customer's recent order, etc). In this case, you have multiple sources of dimensional attributes, each one considered the "same".
#Dems method is one way (and a good one) to keep your cleaned data separate from your staging / operational data set.
Another, if you need to have access to both data sets in reporting, while still keeping a "clean" version, would be to put all the attributes on your person/customer dimension:
FIRST_NAME
LAST_NAME
SOURCE1_FIRST_NAME
SOURCE1_LAST_NAME
SOURCE2_FIRST_NAME
SOURCE2_LAST_NAME
For reports on measures where the user community is expecting to see the name from Source 2, you can use the source2 attribute. For people expecting source 1, use that. For people looking for the results of the processing which "conforms" the name, use the main attribute.
This query is supposed to run with ms access 2003 using SQL. the function JOIN is NOT supported explicitly. implicitly in the WHERE clause is fine...implicity anywhere is fine as long as the word JOIN INNER JOIN Etc is not used.
DayNumnber PastTime
.
.
.
333 Homework
333 TV
334 Date
620 Chores
620 Date
620 Homework
725 Chores
725 Date
888 Internet
888 TV
.
.
.
Hey I would like a query that can Show the most important past time done for each day (TV and internet do not count!) .So importance would be Homework > Chores > Date.So:
DayNumber PastTime
333 Homework
334 Date
620 Homework
725 Chores
Something that might change this problem. Altho all the different past times are listen in a table together. but that was because i appended the table. originally the homework entries. chore entries and date entriess . internet entriess. tv entries. came from different tables.
eg homework 333
homework 620
Is it easier to do it without appending these tables first? I would hopefully like it to be done with the appended table but ya
I was thinking of a mixture of insert. delete... but the hardest part is checking that there is something there for a date a few things and how to put the more important thing done that day . Thank you
Create another table with:
Pri | PastTime
--------------
1 | Homework
2 | Chores
3 | Date
This is a priority list for the items.
Next do:
SELECT MIN(Pri), DayNumber
FROM PastTime_table, Priority_table
WHERE PastTime_table.PastTime = Priority_table.PastTime
GROUP BY DayNumber
This will give you the most important past time for each day. And because TV and Internet are not listed they will not show up.
But it will give you a number, and not the name.
If you had a better SQL you could then join this back to the Priority_table and lookup the name. But I guess you will have to do that part manually.
If you are willing to change the name and call them:
A_Homework
B_Chores
C_Date
instead then you could do (without any extra table):
SELECT MIN(PastTime), DayNumber
FROM PastTime_table
GROUP BY DayNumber
Since it sorts the name alphabetically it will always give you the best one.
You can add a WHERE to remove TV and Internet.
using Crystal reports 10 linked to an excel document. Would like to pull the dinner field but also pull country and Company name from row that dont have it, this are linked via Bookingref. Example below. I've tried sub-reports and supressing unwanted fields but can't get it right. Also I can't make changes in excel doc as it's 1000+ records, which is exported from an online system weekly.
Id BookingRef Country CompanyName Surname Forname Dinner
1 001 UK Company1 John Andrews
2 001 Mary Jane 1
3 001 Tom Andrews 1
4 002 Germany Company2 Lee Jones
5 003 Germany Company3 Peter Lee 1
6 003 Sofie Lee 1
OK I am not sure I understand the full extent of your problem but let's start with the Country and Company name and see if I can get you moving forward. Instead of putting the Country field directly on the report you could use a formula field and do something like this:
IF {#BookingRef} = "001" Then
"UK"
Else IF {#BookingRef} = "002" Then
"Germany"
Else
"Unnamed"
Now you just put the formula field where the country field used to be and it will put the right country in bases on the BookingRef code. This, however, is only practical if you are working with a small number of Country / Company Names or possibly a big list that never changes although I would caution against the latter.
The other thing you could do is create a table in any database that holds the BookingRef, Company and Country values, link the BookingRef fields from both "databases" and then just drop the fields on your report.
If I am missing the point of your question please be real specific about what it is you are trying to accomplish and what is and is not working in your current solution.
I have a Data intensive problem which requires a lot of massaging and data manipulation and I'm putting this out there to see if anyone has an idea as to how to approach it.
In simplest form. I have a lot of tables which can be joined together to give me a price listing for dentists and how much each charges for a procedure.
so we have multiple tables that looks like this.
Dentist | Procedure1 | Procedure2 | Procedure3 | .........| Procedure?
John | 500 | 342 | 434 | .........| 843
Dave | 343 | 434 | 322 | NULLs....|
Mary | 500 | 342 | 434 | .........| 843
Linda | 500 | 342 | Null | .........| 843
Dentists can have different number of procedures and different pricing for each procedures. But there are a lot of Dentists that have the same number of procedures and the same rates that goes with it. Internally, we create a unique ID for each of these so-called fee listings.
like John would be 001, Dave would be 002, but Mary would be fee 001 and Linda would be 003
It's not so bad if I have to deal with this data once but these fee listings comes in flat files (csvs) which i basically have to DTS up to a SQL server to work with. and they come on a monthly bases. The pricing could change from month to month for each dentist which then would put them in a different unique ID internally.
Can someone shed some light on as to how to best approach this problem so that it's most efficient to process on a monthly basis without having to do tons of data manipulation?
what's the best approach to finding out the duplicates of the fee listings?
How do i keep track of updating a Dentist's fee listing incase they change their rates the next month? if Mary decides to charge a different fee for procedure2, then she would have a different unique ID internally. how do i keep track of that on a monthly bases without having to delete everything and re-insert?
There are a few million fee listings that I'm working with and some have standard rules that are based on zipcodes and some are just unique fee listings, what's the approach here?
I can write some kind of ad-hoc .net program to work with it but it's a lot of data and working straight in SQL server would be easier for me.
any help would be great, thanks guys.
You probably need to unpivot the data to normalize it - so that you end up with:
Doctor: DoctorID, DoctorDetails...
FeeSchedule: DoctorID, ScheduleID, EffectiveDate, OtherDetailAtThisLevel...
FeeScheduleDetail: ScheduleID, ProcedureCode, Fee, OtherDetailAtThisLevel...
When the data comes in for a doctor, it is pivoted, a new schedule is created and the detail rows are created from the unpivoted data.
SSIS has an unpivot component which is fine - you would load the schedule first and then the detail. If the format varies significantly, you might need a custom data source or just avoid SSIS.
This system would keep track of new schedules for doctors. If the schedule is identical for a doctor, you could simply not insert it.
If this logic is extensive, you could load the data to staging tables (SSIS or whatever) and do all this in SQL (T-SQL also has an UNPIVOT operator). That can have advantages in that the code is all in one place and can do all its operations in sets.
Regarding the zip codes, if the doctor doesn't have a fee, are these like usual and customary fee? This could simply be determined from the zip code of the doctor row. In this case you have a few options. You can overlay the doctor fee schedule over a zip code fee schedule:
ZipCodeSchedule: ZipScheduleID, ZipCode, EffectiveDate
ZipCodeScheduleDetail: ZipScheduleID, ProcedureCode, Fee
Or you could save this in the regular feeschedule (potentially with some kind of flag that it was defaulted to the UCR).