Matching disambiguating data to existing duplicate records

Matching disambiguating data to existing duplicate records - sql

I have a table called transactions that has the ledger from a storefront. Let's say it looks like this, for simplicity:
trans_id | cust | date | num_items | cost
---------+------+------+-----------+------
1 | Joe | 4/18 | 6 | 14.83
2 | Sue | 4/19 | 3 | 8.30
3 | Ann | 4/19 | 1 | 2.28
4 | Joe | 4/19 | 4 | 17.32
5 | Sue | 4/19 | 3 | 8.30
6 | Lee | 4/19 | 2 | 9.55
7 | Ann | 4/20 | 1 | 2.28
For the credit card purchases, I subsequently get an electronic ledger that has the full timestamp. So I have a table called cctrans with date, time, cust, cost, and some other info. I want to add a column trans_id to the cctrans table, that references the transactions table.
The update statement for this is simple enough, except for one hitch: I have an 11 AM transaction from Sue on 4/19 for $8.30 and a 3 PM transaction from Sue on 4/19 for $8.30 that are the same in the transactions table except for the trans_id field. I don't really care which record of the cctrans table gets linked to trans_id 2 and which one gets linked to trans_id 5, but they can't both be assigned the same trans_id.
The question here is: How do I accomplish that (ideally in a way that also works when a customer makes the same purchase three or four times in a day)?
The best I have so far is to do:
UPDATE cctrans AS cc
SET trans_id = t.trans_id
WHERE cc.cust = t.cust AND cc.date = t.date AND cc.cost = t.cost
And then fix them one-by-one via manual inspection. But obviously that's not my preferred solution.
Thanks for any help you can provide.

Related

SQL - specific requirement to compare tables

I'm trying to merge 2 queries into 1 (cuts the number of daily queries in half): I have 2 tables, I want to do a query against 1 table, then the same query against the other table that has the same list just less entries.
Basically its a list of (let's call it for obfuscation) people and hobby. One table is ALL people & hobby, the other shorter list is people & hobby that I've met. Table 2 would all be found in table 1. Table 1 includes entries (people I have yet to meet) not found in table 2
The tables are synced up from elsewhere, what I'm looking to do is print a list of ALL people in the first column then print the hobby ONLY of people that are on both lists. That way I can see the lists merged, and track the rate at which the gap between both lists is closing. I have tried a number of SQL combinations but they either filter out the first table and match only items that are true for both (i.e. just giving me table 2) or just adding table 2 to table 1.
Example of what I'm trying to do below:
+---------+----------+--+----------+---------+--+---------+----------+
| table1 | | | table2 | | | query | |
+---------+----------+--+----------+---------+--+---------+----------+
| name | hobby | | activity | person | | name | hobby |
| bob | fishing | | fishing | bob | | bob | fishing |
| bill | vidgames | | hiking | sarah | | bill | |
| sarah | hiking | | planking | sabrina | | sarah | hiking |
| mike | cooking | | | | | mike | |
| sabrina | planking | | | | | sabrina | planking |
+---------+----------+--+----------+---------+--+---------+----------+
Normally I'd just take the few days to learn SQL a bit better however I'm stretched pretty thin at work as it is!
I should mention the table 2 is flipped and the headings are all unique (don't think this matters)!

I think you just want a left join:
select t1.name, t2.activity as hobby
from table1 t1 left join
table2 t2
on t1.name = t2.person;

Returning singular row/value from joined table date based on closest date

I have a Production Table and a Standing Data table. The relationship of Production to Standing Data is actually Many-To-Many which is different to how this relationship is usually represented (Many-to-One).
The standing data table holds a list of tasks and the score each task is worth. Tasks can appear multiple times with different "ValidFrom" dates for changing the score at different points in time. What I am trying to do is query the Production Table so that the TaskID is looked up in the table and uses the date it was logged to check what score it should return.
Here's an example of how I want the data to look:
Production Table:
+----------+------------+-------+-----------+--------+-------+
| RecordID | Date | EmpID | Reference | TaskID | Score |
+----------+------------+-------+-----------+--------+-------+
| 1 | 27/02/2020 | 1 | 123 | 1 | 1.5 |
| 2 | 27/02/2020 | 1 | 123 | 1 | 1.5 |
| 3 | 30/02/2020 | 1 | 123 | 1 | 2 |
| 4 | 31/02/2020 | 1 | 123 | 1 | 2 |
+----------+------------+-------+-----------+--------+-------+
Standing Data
+----------+--------+----------------+-------+
| RecordID | TaskID | DateActiveFrom | Score |
+----------+--------+----------------+-------+
| 1 | 1 | 01/02/2020 | 1.5 |
| 2 | 1 | 28/02/2020 | 2 |
+----------+--------+----------------+-------+
I have tried the below code but unfortunately due to multiple records meeting the criteria, the production data duplicates with two different scores per record:
SELECT p.[RecordID],
p.[Date],
p.[EmpID],
p.[Reference],
p.[TaskID],
s.[Score]
FROM ProductionTable as p
LEFT JOIN StandingDataTable as s
ON s.[TaskID] = p.[TaskID]
AND s.[DateActiveFrom] <= p.[Date];
What is the correct way to return the correct and singular/scalar Score value for this record based on the date?

You can use apply :
SELECT p.[RecordID], p.[Date], p.[EmpID], p.[Reference], p.[TaskID], s.[Score]
FROM ProductionTable as p OUTER APPLY
( SELECT TOP (1) s.[Score]
FROM StandingDataTable AS s
WHERE s.[TaskID] = p.[TaskID] AND
s.[DateActiveFrom] <= p.[Date]
ORDER BY S.DateActiveFrom DESC
) s;
You might want score basis on Record Level if so, change the where clause in apply.

Calculate Equation From Seperate Tables Data

I'm working on my senior High School Project and am reaching out to the community for help! (As my teacher doesn't know the answer to my question).
I have a simple "Products" table as shown below:
I also have a "Orders" table shown below:
Is there a way I can create a field in the "Orders" table named "Total Cost", and make that automaticly calculate the total cost from all the products selected?

Firstly, I would advise against storing calculated values, and would also strongly advise against using calculated fields in tables. In general, calculations should be performed by queries.
I would also strongly advise against the use of multivalued fields, as your images appear to show.
In general, when following the rules of database normalisation, most sales databases are structured in a very similar manner, containing with the following main tables (amongst others):
Products (aka Stock Items)
Customers
Order Header
Order Line (aka Order Detail)
A good example for you to learn from would be the classic Northwind sample database provided free of charge as a template for MS Access.
With the above structure, observe that each table serves a purpose with each record storing information pertaining to a single entity (whether it be a single product, single customer, single order, or single order line).
For example, you might have something like:
Products
Primary Key: Prd_ID
+--------+-----------+-----------+
| Prd_ID | Prd_Desc | Prd_Price |
+--------+-----------+-----------+
| 1 | Americano | $8.00 |
| 2 | Mocha | $6.00 |
| 3 | Latte | $5.00 |
+--------+-----------+-----------+
Customers
Primary Key: Cus_ID
+--------+--------------+
| Cus_ID | Cus_Name |
+--------+--------------+
| 1 | Joe Bloggs |
| 2 | Robert Smith |
| 3 | Lee Mac |
+--------+--------------+
Order Header
Primary Key: Ord_ID
Foreign Keys: Ord_Cust
+--------+----------+------------+
| Ord_ID | Ord_Cust | Ord_Date |
+--------+----------+------------+
| 1 | 1 | 2020-02-16 |
| 2 | 1 | 2020-01-15 |
| 3 | 2 | 2020-02-15 |
+--------+----------+------------+
Order Line
Primary Key: Orl_Order + Orl_Line
Foreign Keys: Orl_Order, Orl_Prod
+-----------+----------+----------+---------+
| Orl_Order | Orl_Line | Orl_Prod | Orl_Qty |
+-----------+----------+----------+---------+
| 1 | 1 | 1 | 2 |
| 1 | 2 | 3 | 1 |
| 2 | 1 | 2 | 1 |
| 3 | 1 | 1 | 4 |
| 3 | 2 | 3 | 2 |
+-----------+----------+----------+---------+
You might also opt to store the product description & price on the order line records, so that these are retained at the point of sale, as the information in the Products table is likely to change over time.

Editing a row in a database table affects all previous records that query that information. How should prior versions be stored/managed?

I’ve been working on a Windows Form App using vb.net that retrieves information from a SQL database. One of the forms, frmContract, queries several tables, such as Addresses, and displays them in various controls, such as Labels and DataGridViews. Every year, the customer’s file is either renewed or expired, and I’m just now realizing that a change committed to any record today will affect the information displayed for the customer in the past. For example, if we update a customer’s mailing address today, this new address will show up in all previous customer profiles. What is the smartest way to avoid this problem without creating separate rows in each table with the same information? Or to put it another way, how can versions of a customer’s profile be preserved?
Another example would be a table that stores customer’s vehicles.
VehicleID | Year | Make | Model | VIN | Body
---------------------------------------------------------------
1 | 2005 | Ford | F150 | 11111111111111111 | Pickup
2 | 2001 | Niss | Sentra | 22222222222222222 | Sedan
3 | 2004 | Intl | 4700 | 33333333333333333 | Car Carrier
If today vehicle 1 is changed from a standard pickup to a flatbed, then if I load the customer contract from 2016 it will also show as flatbed even though back then it was a pickup truck.
I have a table for storing individual clients.
ClientID | First | Last | DOB
---------|----------|-----------|------------
1 | John | Doe | 01/01/1980
2 | Mickey | Mouse | 11/18/1928
3 | Eric | Forman | 03/05/1960
I have another table to store yearly contracts.
ContractID | ContractNo | EffectiveDate | ExpirationDate | ClientID (foreign key)
-----------|------------|---------------|-------------------|-----------
1 | 13579 | 06/15/2013 | 06/15/2014 | 1
2 | 13579 | 06/15/2014 | 06/15/2015 | 1
3 | 24680 | 10/05/2016 | 10/05/2017 | 3
Notice that the contract number can remain the same across different periods. In addition, because the same vehicle can be related to multiple contracts, I use a bridge table to relate individual vehicles to different contracts.
Id | VehicleID | ContractID <-- both foreign keys
---|-----------|------------
1 | 1 | 1
2 | 3 | 1
3 | 1 | 2
4 | 3 | 2
5 | 2 | 3
6 | 2 | 2
When frmContract is loaded, it queries the database and displays information about that particular contract year. However, if Vehicle 1 is changed from pickup to flatbed right now, then all the previous contract years will also show it as a flatbed.
I hope this illustrates my predicament. Any guidance will be appreaciated.

Some DB systems have built-in temporal features so you can keep audit history of rows. Check to see if your DB has built-in support for this.

Remove newest redundant row and update timestamp

I'm working with a SQLite database that receives large data dumps on a regular basis from several sources. Unfortunately, those sources aren't intelligent about what they dump, and I end up with a lot of repeated records from one time to the next. I'm looking for a way to remove these repeated records without affecting the records that have legitimately changed from the past dump to this one.
Here's the general structure of the data (_id is the primary key):
| _id | _dateUpdated | _dateEffective | _dateExpired | name | status | location |
|-----|--------------|----------------|--------------|------|--------|----------|
| 1 | 2016-05-01 | 2016-05-01 | NULL | Fred | Online | USA |
| 2 | 2016-05-01 | 2016-05-01 | NULL | Jim | Online | USA |
| 3 | 2016-05-08 | 2016-05-08 | NULL | Fred | Offline| USA |
| 4 | 2016-05-08 | 2016-05-08 | NULL | Jim | Online | USA |
| 5 | 2016-05-15 | 2016-05-15 | NULL | Fred | Offline| USA |
| 6 | 2016-05-15 | 2016-05-15 | NULL | Jim | Online | USA |
I'd like to be able to reduce this data to something like this:
| _id | _dateUpdated | _dateEffective | _dateExpired | name | status | location |
|-----|--------------|----------------|--------------|------|--------|----------|
| 1 | 2016-05-01 | 2016-05-01 | 2016-05-07 | Fred | Online | USA |
| 2 | 2016-05-15 | 2016-05-01 | NULL | Jim | Online | USA |
| 3 | 2016-05-15 | 2016-05-08 | NULL | Fred | Offline| USA |
The idea here is that rows 4, 5, and 6 exactly duplicate rows 2 and 3 except for the timestamps (I'd need to compare by all three fields - name, status, location). However, row 3 does not duplicate row 1 (status changed from Online to Offline), so the _dateExpired field is set in row 1, and row 3 becomes the most recent record.
I'm querying this table with something like this:
SELECT * FROM Data WHERE
date(_dateEffective) <= date("now")
AND (_dateExpired IS NULL OR date(_dateExpired) > date("now"))
Is this sort of reduction possible in SQLite?
I am still a beginner to SQL and database design in general, so it's possible that I haven't structured the database in the best way. I'm open to suggestions there as well...I'm going for the ability to query data at a given point in time - for example, "what was Jim's status around 2016-05-06?"
Thanks in advance!

Consider using a staging table where the dump file goes into a DumpTable (regularly cleaned out before each dump) and then an INSERT...SELECT query migrates to your final table.
Now the SELECT portion maintains a correlated subquery (to calculate new [_dateExpired] for needed rows) and derived table subquery (to filter out non-dups according to your criteria). Finally, the LEFT JOIN...NULL with FinalTable is to ensure no duplicate records are appended, assuming [_id] is a unique identifier. Below is the routine:
Clean Out DumpTable
DELETE FROM DumpTable;
Run Dump Routine to be appended into DumpTable
Append Records to FinalTable
INSERT INTO FinalTable ([_id], [_dateUpdated], [_dateEffective], [_dateExpired],
[name], status, location)
SELECT d.[_id], d.[_dateUpdated], d.[_dateEffective],
(SELECT Min(date(sub.[_dateEffective], '-1 day'))
FROM DumpTable sub
WHERE sub.[name] = DumpTable.[name]
AND sub.[_dateEffective] > DumpTable.[_dateEffective]
AND sub.status <> DumpTable.status) As calcExpired
d.name, d.status, d.location
FROM DumpTable d
INNER JOIN
(SELECT Min(DumpTable.[_id]) AS min_id,
DumpTable.name, DumpTable.status
FROM DumpTable
GROUP BY DumpTable.name, DumpTable.status) AS c
ON (c.name = d.name)
AND (c.min_id = d.[_id])
AND (c.status = d.status)
LEFT JOIN FinalTable f
ON d.[_id] = f.[_id]
WHERE f.[_id] IS NULL;
-- INSERTED RECORDS:
-- _id _dateUpdated _dateEffective _dateExpired name status location
-- 1 2016-05-01 2016-05-01 2016-05-07 Fred Online USA
-- 2 2016-05-01 2016-05-01 Jim Online USA
-- 3 2016-05-08 2016-05-08 Fred Offline USA

Is this sort of reduction possible in SQLite?
The answer to any "reduction" question in SQL is always Yes. The trick is to find what axes you're reducing along.
Here's a partial solution to illustrate; it gives the first Online date for each name & location.
select min(_dateEffective) as start_date
, name
, location
from Data
where status = 'Online'
group by
name
, location
With an outer join back to the table (on name & location) where the status is 'Offline' and the _dateEffective is greater than start_date, you get your _dateExpired.
_id is the primary key
There is a commonly held misunderstanding that every table needs some kind of sequential "ID" number as a primary key. The key you really care about is known as a natural key, 1 or more columns in the data that uniquely identify the data. In your case, it looks to me like that's _dateEffective, name, status, and location. At the very least, declare them unique to prevent accidental duplication.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Matching disambiguating data to existing duplicate records - sql

Related

SQL - specific requirement to compare tables

Returning singular row/value from joined table date based on closest date

Calculate Equation From Seperate Tables Data

Editing a row in a database table affects all previous records that query that information. How should prior versions be stored/managed?

Remove newest redundant row and update timestamp

Categories

Resources