Mutli-level index using only certain values from 2 columns

Mutli-level index using only certain values from 2 columns - pandas

Good morning!
I am very new to pandas/python. I mainly use SQL and SSIS for my current ETL, but for a new data source it requires tedious manual reformatting in excel. I am trying to learn python to save hours of manual work. The data on the report is extremely redundant. I have spent days trying to phrase what I need in a way that returns the information I need, but to no avail.
I can't use my actual data because it contains PHI, so I will give an analogous example using Clients and Orders. An external system generates a 'MonthlyOrders.xls' report. There is pretty much ZERO flexibility in the export format. The .xls file extension gives you an idea about how dated the source environment is. First, I loaded the data to a data frame and split it down into smaller data frames by "Group". So each df represents one group. This is what it looks like after that:
General Format:
index
Name/Date
ID/Item
Price/ 'P'
Billed/'NaaN'
PaidOn/Seller
Total/Dept
1
ClientName
Client ID
'P'
Date Billed
Pmt_received_On
Order Total
2
Order Date
item name
item price
'NaaN'
sold by
dept
3
Same order date
2nd item
price
'Naan'
sold by
dept
4
NextClientName
NextID
'P'
Date Billed
Pmt_received_On
Order Total
Example of Data:
Index
Name/Date
ID/Item
Price/ 'P'
Billed/'NaaN'
PaidOn/Seller
Total/Dept
1
Victim, One
VO100
'P'
08/12/2021
08/13/2021
78
2
08/11/2021
books
12
'NaaN'
Mrs. White
The Study
3
08/11/2021
Rope
56
'Naan'
Mrs. White
The Study
4
08/11/2021
Pens
10
'NaaN'
Mrs. White
The Study
5
Second, Dead
SD123
'P'
08/18/2021
08/20/2021
250
6
08/17/2021
Pool Cue
198
'NaaN'
Mr. Green
Billiard Room
7
08/17/2021
Knife
52
'Naan'
Mr. Green
Billiard Room
What I want to do is create a multi-level index using Client Name and Client ID, OrderDate.
Maybe could I put the Name:ID as a dictionary and use that as the first level of index and then the date would be the next level. I am not sure if I can do that.
Or, I want to split the first and second columns into four columns (Name, ID, orderDate, Item). I do not use the 'Order Total' column. The data goes into a Billing_Import staging table, and then I is further manipulated and transformed in the Data Warehouse. The destination table has the following structure:
RecID
Group
ClientName
OrderDate
ClientID
ItemID
desc
ChgAmt
pmtAmt
seller
dept
The 'RecID' is added in SSIS, and 'Item ID' split from the 'Desc' column after import with SQL. I plan to add a "Group" column back into each data frame so I know which data belongs to which group. Right now the groups are in separate data frames.
The 'Department' will always be the same for an order. There almost always only one 'Seller', but if there were 2 sellers on one order another record would be added.
The format I want would look like this:
Group
ID
Name
OrderDate
ItemDesc
Charge
Pmt
Seller
Department
Group1
ID1a
Name1a
1/1/2021
item1
$x
$y
Ms. Scarlet
The Lounge
item2
$x
$y
Ms. Scarlet
item3
$x
$y
Ms. Scarlet
ID2a
Name2b
1/15/2021
item1
$x
$y
Mrs. Peacock
The Kitchen
item2
$x
$y
Mrs. Peacock
Group2
ID2a
Name2a
1/22/2021
item1
$x
$y
Wadsworth
The Cellar
item2
$x
$y
Wadsworth
ID2a
Name2a
1/22/2021
item1
$x
$y
Col. Mustard
The Cellar
Any and all suggestions are greatly appreciated!
Kindest Regards,
Cori

The first step would be to separate columns into a format similar to:
index
name
date
ID
item
price
billed
paid_on
seller
total
dept
1
John
10/20/21
1234
socks
10
10/12/21
10/20/21
James
10
garments
Once this step is complete, you can create your Multi-Index with:
df_muti_index = df.set_index(['ID', 'name', 'date'])

Related

sum not calculating correct no. of units in SQL command

I have the following SQL script(of which the result is displayed under the script). The issue I am having is that I need to add up the quantity on the invoice. The quantity works fine when all the products on the invoice are different. When there is a product that appears twice on the invoice, the result is incorrect. Any help appreciated.

The DISTINCT keyword acts on all columns you select.
A new product introduces a difference which makes it no longer distinct. Hence the extra row(s).
Where you had:
Order Product Total
1 Toaster $10
2 Chair $20
And another item is added to order 1:
Order Product Total
1 Toaster $99
1 Balloon $99 -- Yes that's a $89 balloon!
2 Chair $20
The new row (balloon) is distinct and isn't reduced into the previous row (toaster).
To make is distinct again, don't select the product name:
Order Total
1 $99
2 $20
Uniqueness kicks in and everyone's happy!
If you can remove the column from the select list that's "different", you should get the results you need.

MS Access Small Equivalent

I have this working in Excel however it really needs moved into Access as that's where the rest of the database resides.
Its simply one table that contains Unique_ID, Seller and Fruit...
1 Chris Orange
2 Chris Apple
3 Chris Apple
4 Sarah Kiwi
5 Chris Pear
6 Sarah Orange
The end results should be displayed by Seller and then a list of each fruit sold (in the following example Robert has not sold any fruit, I do have a list of all sellers name however this could be ignored in this example as that I believe that will be easy to integrate.) They will only sell a maximum of 20 fruit.
Seller 1st 2nd 3rd 4th
Chris Orange Apple Apple Pear
Sarah Kiwi Orange
Robert
At the moment Excel uses Index, Match and Small to return results. Small is simply used on the Unique_ID to find the 1st, 2nd, 3rd, ect...smallest entries and is matched to each sellers name to build the above results.
As Access doesn't have a Small function I am at a loss! In reality there are over 100,000 records (minimum) with over 4000 sellers....they are also not fruit :)

TRANSFORM First(Sales.Fruit) AS FirstOfFruit
SELECT Sales.Seller
FROM Sales
GROUP BY Sales.Seller
PIVOT DCount([id],"sales","seller='" & [seller] & "' and id<=" & [id]);
Where the table name is "Sales" and the columns are "ID", "Seller" and "Fruit"

To understand DCount better, use it is a SELECT query instead of a crosstab:
SELECT Sales.ID, Sales.Seller, Sales.Fruit, DCount([id],"sales","seller='" & [seller] & "' and id<=" & [id]) AS N
FROM Sales;
On each row, the last column is the DCount result. The syntax is DCount (field, source, expression) so what it does is count the IDs (field) in the Sales table (source) that match the expression - in other words, has the same seller as that row's record and an ID <= the current row's ID. So for Chris's sales, it numbers them 1 through 4, even though Sarah had a sale in the middle.
From this result, it's easy to take a Crosstab query that makes a table with seller in the row and N in the column - putting the sales in order for each seller the way you wanted to see them. The "First" function finds the first fruit for the combination of seller and N for each row and column of the result. You could just as easily use "Max" or "Min" here - any text function. Of course, there is only one record matching the seller row and the N column, but Crosstab queries require a function to evaluate and cannot use "Group by" for the field selected as a Value.
My 1st answer combines these steps - the select and the crosstab queries - in one query.
Hope this helps.

Multiple entries with the same reference in a table with SQL

In a unique table, I have multiple lines with the same reference information (ID). For the same day, customers had drink and the Appreciation is either 1 (yes) or 0 (no).
Table
ID DAY Drink Appreciation
1 1 Coffee 1
1 1 Tea 0
1 1 Soda 1
2 1 Coffee 1
2 1 Tea 1
3 1 Coffee 0
3 1 Tea 0
3 1 Iced Tea 1
I first tried to see who appreciated a certain drink, which is obviously very simple
Select ID, max(appreciation)
from table
where (day=1 and drink='coffee' and appreciation=1)
or (day=1 and drink='tea' and appreciation=1)
Since I am not even interested in the drink, I used max to remove duplicates and keep only the lane with the highest appreciation.
But what I want to do now is to see who in fact appreciated every drink they had. Again, I am not interested in every lane in the end, but only the ID and the appreciation. How can I modify my where to have it done on every single ID? Adding the ID in the condition is also not and option. I tried switching or for and, but it doesn't return any value. How could I do this?

This should do the trick:
SELECT ID
FROM table
WHERE DRINK IN ('coffee','tea') -- or whatever else filter you want.
group by ID
HAVING MIN(appreciation) > 0
What it does is:
It looks for the minimum appreciation and see to it that that is bigger than 0 for all lines in the group. And the group is the ID, as defined in the group by clause.
as you can see i'm using the having clause, because you can't have aggregate functions in the where section.
Of course you can join other tables into the query as you like. Just be carefull not to add some unwanted filter by joining, which might reduce your dataset in this query.

Storing a set of criteria in another table

I have a large table with sales data, useful data below:
RowID Date Customer Salesperson Product_Type Manufacturer Quantity Value
1 01-06-2004 James Ian Taps Tap Ltd 200 £850
2 02-06-2004 Apple Fran Hats Hats Inc 30 £350
3 04-06-2004 James Lawrence Pencils ABC Ltd 2000 £980
...
Many rows later...
...
185352 03-09-2012 Apple Ian Washers Tap Ltd 600 £80
I need to calculate a large set of targets from table containing values different types, target table is under my control and so far is like:
TargetID Year Month Salesperson Target_Type Quantity
1 2012 7 Ian 1 6000
2 2012 8 James 2 2000
3 2012 9 Ian 2 6500
At present I am working out target types using a view of the first table which has a lot of extra columns:
SELECT YEAR(Date)
, MONTH(Date)
, Salesperson
, Quantity
, CASE WHEN Manufacturer IN ('Tap Ltd','Hats Inc') AND Product_Type = 'Hats' THEN True ELSE False END AS IsType1
, CASE WHEN Manufacturer = 'Hats Inc' AND Product_Type IN ('Hats','Coats') THEN True ELSE False END AS IsType2
...
...
, CASE WHEN Manufacturer IN ('Tap Ltd','Hats Inc') AND Product_Type = 'Hats' THEN True ELSE False END AS IsType24
, CASE WHEN Manufacturer IN ('Tap Ltd','Hats Inc') AND Product_Type = 'Hats' THEN True ELSE False END AS IsType25
FROM SalesTable
WHERE [some stuff here]
This is horrible to read/debug and I hate it!!
I've tried a few different ways of simplifying this but have been unable to get it to work.
The closest I have come is to have a third table holding the definition of the types with the values for each field and the type number, this can be joined to the tables to give me the full values but I can't work out a way to cope with multiple values for each field.
Finally the question:
Is there a standard way this can be done or an easier/neater method other than one column for each type of target?
I know this is a complex problem so if anything is unclear please let me know.
Edit - What I need to get:
At the very end of the process I need to have targets displayed with actual sales:
Type Year Month Salesperson TargetQty ActualQty
2 2012 8 James 2000 2809
2 2012 9 Ian 6500 6251
Each row of the sales table could potentially satisfy 8 of the types.
Some more points:
I have 5 different columns that need to be defined against the targets (or set to NULL to include any value)
I have between 30 and 40 different types that need to be defined, several of the columns could contain as many as 10 different values
For point 2, if I am using a row for each permutation of values, 2 columns with 10 values each would give me 100 rows for each sales person for each month which is a lot but if this is the only way to define multiple values I will have to do this.
Sorry if this makes no sense!

If I am correct that the "Target_Type" field in the Target Table is based on the Manufacturer and the Product_Type, then you can create a TargetType table that looks like what's below and JOIN on Manufacturer and the Product_Type to get your Target_Type_Value:
ID Product_Type Manufacturer Target_Type_Value
1 Taps Tap Ltd 1
2 Hats Hats Inc 2
3 Coats Hats Inc 2
4 Hats Caps Inc 3
5 Pencils ABC Ltd 6
This should address the "multiple values for each field" problem by having a row for each possibility.

SELECT datafields with multiple groups and sums

I cant seem to group by multiple data fields and sum a particular grouped column.
I want to group Person to customer and then group customer to price and then sum price. The person with the highest combined sum(price) should be listed in ascending order.
Example:
table customer
-----------
customer | common_id
green 2
blue 2
orange 1
table invoice
----------
person | price | common_id
bob 2330 1
greg 360 2
greg 170 2
SELECT DISTINCT
min(person) As person,min(customer) AS customer, sum(price) as price
FROM invoice a LEFT JOIN customer b ON a.common_id = b.common_id
GROUP BY customer,price
ORDER BY person
The results I desire are:
**BOB:**
Orange, $2230
**GREG:**
green, $360
blue,$170
The colors are the customer, that GREG and Bob handle. Each color has a price.

There are two issues that I can see. One is a bit picky, and one is quite fundamental.
Presentation of data in SQL
SQL returns tabular data sets. It's not able to return sub-sets with headings, looking something a Pivot Table.
The means that this is not possible...
**BOB:**
Orange, $2230
**GREG:**
green, $360
blue, $170
But that this is possible...
Bob, Orange, $2230
Greg, Green, $360
Greg, Blue, $170
Relating data
I can visually see how you relate the data together...
table customer table invoice
-------------- -------------
customer | common_id person | price |common_id
green 2 greg 360 2
blue 2 greg 170 2
orange 1 bob 2330 1
But SQL doesn't have any implied ordering. Things can only be related if an expression can state that they are related. For example, the following is equally possible...
table customer table invoice
-------------- -------------
customer | common_id person | price |common_id
green 2 greg 170 2 \ These two have
blue 2 greg 360 2 / been swapped
orange 1 bob 2330 1
This means that you need rules (and likely additional fields) that explicitly state which customer record matches which invoice record, especially when there are multiples in both with the same common_id.
An example of a rule could be, the lowest price always matches with the first customer alphabetically. But then, what happens if you have three records in customer for common_id = 2, but only two records in invoice for common_id = 2? Or do the number of records always match, and do you enforce that?
Most likely you need an extra piece (or pieces) of information to know which records relate to each other.

you should group by using all your selected fields except sum then maybe the function group_concat (mysql) can help you in concatenating resulting rows of the group clause

Im not sure how you could possibly do this. Greg has 2 colors, AND 2 prices, how do you determine which goes with which?
Greg Blue 170 or Greg Blue 360 ???? or attaching the Green to either price?
I think the colors need to have unique identofiers, seperate from the person unique identofiers.
Just a thought.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas