Compare 2 data frames with a tolerance - dataframe

I have 2 data frames, each containing 4 columns and hundreds of rows. Although the columns are the same, the rows may appear in any order. I need to reconcile these data frames and ensure that everything in DF1 is also in DF2 and vice versa, and highlight any rows that do not appear in both data frames.
However, the final column, "Net Price", will sometimes not be the exact same figure in each data frame. I need a tolerance on it of $1. that is, Net Cash can differ by $1 and still be accepted. I have attached 2 simplified data frames.
DF_Client
| Client | Broker | Stock | Qty | Net Cash |
| John | Bank 1 | FMG | 500 | 32,456.4532 |
| Charlie | Bank 2 | CBA | 1,200 | 37,783.3740 |
| Paul | Bank 3 | TLS | 780 | 210,237.9830 |
| Richard | Bank 3 | WOW | 4,921 | 25,119.9952 |
| John | Bank 1 | FMG | 1,595 | 545.8500 |
DF_Broker
| Client | Broker | Stock | Qty | Net Cash |
| Richard | Bank 3 | WOW | 4,921 | 25,119.8603 |
| John | Bank 1 | FMG | 1,595 | 546.0000 |
| Charlie | Bank 2 | CBA | 1,200 | 37,783.5892 |
| Paul | Bank 3 | TLS | 780 | 210,237.7521 |
| John | Bank 1 | FMG | 500 | 32,456.6600 |
I have tried to merge, however, because the net cash column is not always exact, it will not match them.
df_compare = pd.merge(df_Client, df_Broker, how="outer", indicator="Missing")

Related

Teradata SQL Assistant - How can I pivot or transpose large tables with many columns and many rows?

I am using Teradata SQL Assistant Version TD 16.10.06.01 ...
I have seen a lot people transpose data for set smallish tables but I am working on thousands of clients and need the break the columns up into Line Item Values to compare orders/highlight differences between orders. Problem is it is all horizontally linked and I need to transpose it to Id,Transaction id,Version and Line Item Value 1, Line Item Value 2... then another column comparing values to see if they changed.
example:
+----+------------+-----------+------------+----------------+--------+----------+----------+------+-------------+
| Id | First Name | Last Name | DOB | transaction id | Make | Location | Postcode | Year | Price |
+----+------------+-----------+------------+----------------+--------+----------+----------+------+-------------+
| 1 | John | Smith | 15/11/2001 | 1654654 | Audi | NSW | 2222 | 2019 | $ 10,000.00 |
| 2 | Mark | White | 11/02/2002 | 1661200 | BMW | WA | 8888 | 2016 | $ 8,999.00 |
| 3 | Bob | Grey | 10/05/2002 | 1667746 | Ford | QLD | 9999 | 2013 | $ 3,000.00 |
| 4 | Phil | Faux | 6/08/2002 | 1674292 | Holden | SA | 1111 | 2000 | $ 5,800.00 |
+----+------------+-----------+------------+----------------+--------+----------+----------+------+-------------+
hoping to change the data to :
+----+----------+----------+----------+----------------+----------+----------+----------------+---------+-----+
| id | trans_id | Vers_ord | Item Val | Ln_Itm_Dscrptn | Org_Val | Updt_Val | Amndd_Ord_chck | Lbl_Rnk | ... |
+----+----------+----------+----------+----------------+----------+----------+----------------+---------+-----+
| 1 | 1654654 | 2 | 11169 | Make | Audi BLK | Audi WHT | Yes | 1 | |
| 1 | 1654654 | 2 | 11189 | Location | NSW | WA | Yes | 2 | |
| 1 | 1654654 | 2 | 23689 | Postcode | 2222 | 6000 | Yes | 3 | |
+----+----------+----------+----------+----------------+----------+----------+----------------+---------+-----+
Recently with smaller data I created a table added in Values then used a case statement when value 1 then xyz with a product join ... and the data warehouse admins didn't mention anything out of order. but I only had row 16 by 200 column table to transpose ( Sum, Avg, Count, Median(function) x 4 subsets of clients) , which were significantly smaller than my current tables to make comparisons with.
I am worried my prior method will probably slow the data Warehouse down, plus take me significant amount of time to type the SQL.
Is there a better way to transpose large tables?

Logging for multiple tables

Lets say we have a client table for sports brands like nike and adidas.
+--------------+------------+
| Client Table | |
+--------------+------------+
| Id | ClientName |
| 1 | Nike |
| 2 | Adidas |
+--------------+------------+
We also record customer information and their preferred sport and fitness level. Sports and fitness level are used in dropdown lists.
+--------------+------------+
| Sports Table | |
+--------------+------------+
| Id | Name |
| 1 | Basketball |
| 2 | Volleyball |
+--------------+------------+
+------------------+---------------+
| Fitnesslvl Table | |
+------------------+---------------+
| Id | Fitness Level |
| 1 | Beginner |
| 2 | Intermediate |
| 3 | Advance |
+------------------+---------------+
+----------------+--------------+----------+----------------+
| Customer Table | | | |
+----------------+--------------+----------+----------------+
| Id | CustomerName | SportsId | FitnessLevelId |
| 1 | John | 1 | 1 |
| 2 | Doe | 2 | 3 |
+----------------+--------------+----------+----------------+
Then sports brands want to filter our customer via sports and fitness level. In this example nike wants all sports while adidas only wants customer interested in basketball. Likewise, nike wants customer in all fitness level while adidas only wants advanced fitness level.
+---------------+----------+----------+
| Sports Filter | | |
+---------------+----------+----------+
| Id | ClientId | SportsId |
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 2 | 1 |
+---------------+----------+----------+
+-------------------+----------+--------------+
| Fitnesslvl Filter | | |
+-------------------+----------+--------------+
| Id | ClientId | FitnessLvlId |
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 1 | 3 |
| 4 | 2 | 3 |
+-------------------+----------+--------------+
How can we handle logging in this case when we want to record failed filters for the sports and fitness level? I'm thinking of two options
Create different table for each failed filter.
-Sports Failed Filter
-FitnessLevel Failed Filter
+----------------------+-------------+----------------+
| Sports Failed Filter | | |
+----------------------+-------------+----------------+
| Id | CustomerId | SportsFilterId |
| 1 | 1 | 2 |
| 2 | 1 | 3 |
+----------------------+-------------+----------------+
However if we have 10 filters, this means we will also have 10 failed filters table. I think this is very difficult to maintain.
Instead of different table for dropdown values like sports and fitness level, we can create lookup table, and a single failedfilter table.
I think the tradeoff is its not simple and there is no strict referential integrity.
Please let me know if you have different solution for this.
EDIT:
This filters are used in a backend application and the filtering logic is there. I dont plan to include this logic in the database as the query will be very complex and hard to maintain.

Database Design For Course Registration & Pricing

How would you design the table(s) to handle a registration form and pricing for an upcoming event?
As seen in the table below:
+-----------------------------+-------------------+----------------+--------------+
| Occupation/Level | Optional Item | Base Price | Early Bird Price |
+-----------------------------+-------------------+------------+------------------+
| Residents | | $1000 | $800 |
+--------------------------------------------------------------+------------------+
| Practitioners | Exam Prep (+$500) | $1500 | $1300 |
+--------------------------------------------------------------+------------------+
| OBYGN Consultant - Friday | | $800 | $900 |
| OBYGN Consultant - Saturday | | $800 | $900 |
| OBYGN Consultant - Sunday | | $600 | $700 |
| OBYGN Consultant - All Days | | $1900 | $2100 |
+-----------------------------+-------------------+------------+------------------+
different prices are charged based on the attendee's occupation.
practitioners have access to an optional exam preparation session for an additional fee
OBGYN Consultants have the option of registering for single/multiple days; each of which charges a different amount
To further illustrate the details, here's a screenshot of the old registration form.
My initial idea is to keep it very simple and treat each line separately and store the amount with it. The days which an OBGYN Consultant may choose simply become another line in the table.
+----+---------------------------+----------------+------------------+
| ID | Occupation/Level | Base Price | Early Bird Price |
+----+---------------------------+----------------+------------------+
| 1 | Residents | $1000 | $800 |
| 2 | Practitioners | $1500 | $1300 |
| 4 | OBYGN_Consultant_Friday | $800 | $900 |
| 5 | OBYGN_Consultant_Saturday | $800 | $900 |
| 6 | OBYGN_Consultant_Sunday | $600 | $700 |
| 7 | OBYGN_Consultant_All_Days | $1900 | $2100 |
+----+---------------------------+----------------+------------------+
The optional_materials table would handle any courses that have additional options.
+----+----------+-----------+--------+
| ID | CourseID | Name | Amount |
+----+----------+-----------+--------+
| 1 | 2 | Exam Prep | $500 |
+----+----------+-----------+--------+
See any major issues with this design OR see a better way of handling it?
Depends on how flexible you want to be in the future. If you for example want to have a super early bird price next year or you will support different currencies then I would make it like this:
+----+----------+---------------------------+------------+-------+----------+
| ID | CourseID | Title | Price Type | Price | Currency |
+----+----------+---------------------------+------------+-------+----------+
| 1 | 1 | Residents | Base | 1000 | Dollar |
| 2 | 1 | Residents | Early | 800 | Dollar |
| 3 | 2 | Practitioners | Base | 1500 | Dollar |
| 4 | 2 | Practitioners | Early | 1300 | Dollar |
| 5 | 4 | OBYGN_Consultant_Friday | Base | 800 | Dollar |
| 6 | 4 | OBYGN_Consultant_Friday | Early | 900 | Dollar |
| 7 | 5 | OBYGN_Consultant_Saturday | Base | 800 | Dollar |
| 8 | 5 | OBYGN_Consultant_Saturday | Early | 900 | Dollar |
| 9 | 6 | OBYGN_Consultant_Sunday | Base | 600 | Dollar |
| 10 | 6 | OBYGN_Consultant_Sunday | Early | 700 | Dollar |
| 11 | 7 | OBYGN_Consultant_All_Days | Base | 1900 | Dollar |
| 12 | 7 | OBYGN_Consultant_All_Days | Early | 2100 | Dollar |
+----+----------+---------------------------+------------+-------+----------+
But overall your approach is totally valid.
Also concider adding a "Created" and "Edited" date field in the end. Makes it easier to keep transparency over data changes (maybe you want to highlight options that are only available or have changed over last 14 days or so)

Outer Join multible tables keeping all rows in common colums

I'm quite new to SQL - hope you can help:
I have several tables that all have 3 columns in common: ObjNo, Date(year-month), Product.
Each table has 1 other column, that represents an economic value (sales, count, netsales, plan ..)
I need to join all tables on the 3 common columns giving. The outcome must have one row for each existing combination of the 3 common columns. Not every combination exists in every table.
If I do full outer joins, I get ObjNo, Date, etc. for each table, but only need them once.
How can I achieve this?
+--------------+-------+--------+---------+-----------+
| tblCount | | | | |
+--------------+-------+--------+---------+-----------+
| | ObjNo | Date | Product | count |
| | 1 | 201601 | Snacks | 22 |
| | 2 | 201602 | Coffee | 23 |
| | 4 | 201605 | Tea | 30 |
| | | | | |
| tblSalesPlan | | | | |
| | ObjNo | Date | Product | salesplan |
| | 1 | 201601 | Beer | 2000 |
| | 2 | 201602 | Sancks | 2000 |
| | 5 | 201605 | Tea | 2000 |
| | | | | |
| | | | | |
| tblSales | | | | |
| | ObjNo | Date | Product | Sales |
| | 1 | 201601 | Beer | 1000 |
| | 2 | 201602 | Coffee | 2000 |
| | 3 | 201603 | Tea | 3000 |
+--------------+-------+--------+---------+-----------+
Thx
Devon
It sounds like you're using SELECT * FROM... which is giving you every field from every table. You probably only want to get the values from one table, so you should be explicit about which fields you want to include in the results.
If you're not sure which table is going to have a record for each case (i.e. there is not guaranteed to be a record in any particular table) you can use the COALESCE function to get the first non-null value in each case.
SELECT COALESCE(tbl1.ObjNo, tbl2.ObjNo, tbl3.ObjNo) AS ObjNo, ....
tbl1.Sales, tbl2.Count, tbl3.Netsales

Joining data from two result rows on a numerical range

I am trying to create a custom interface for a system that tracks tickets.
I have got tickets in a table of the form:
+----------------------+
| Section | Row | Seat |
+----------------------+
| 15 | A | 100 |
| 15 | A | 102 |
| 15 | A | 103 |
| 15 | A | 110 |
| 15 | A | 111 |
| 15 | B | 102 |
| 15 | B | 103 |
| 15 | B | 104 |
| 15 | C | 99 |
| 15 | C | 100 |
| 15 | C | 101 |
| 15 | C | 102 |
| 15 | C | 103 |
| 15 | C | 104 |
+----------------------+
I am trying to display the ticket 'blocks' where seats behind each other are marked as such. i.e. I'd like to be able to display:
+------------------------------------------------+
| Section | Row | Seat Range | Overlaps Previous |
+------------------------------------------------+
| 15 | A | 100 - 103 | No |
| 15 | B | 102 - 104 | Yes |
| 15 | C | 99 - 104 | Yes |
| 15 | A | 110 - 111 | No |
+------------------------------------------------+
Any thoughts?
You could have an additional relation that assignes all neighbouring seats to a given one. This will then also work better than any soly numerical scheme for any sort of separation of your seats. And you could allow for a neighbourhood across rows. From there you could then iteratively define any block of free seats.
If this is about supporting a cashier, I tend to think I would not solely address that in the database but seek for an integration with the GUI to identify the blocks via some backtracking upon a click on a first free seat.