Pandas - Pivot and Rearrange Table With Multiple Labels in Same Header - pandas

I have an xlsx file with tabs for multiple years of data. Each tab contains a table with many columns and the table is structured like this:
+-----------+-------+-------------------------+----------------------+
| City | State | Number of Drivers, 2019 | Number of Cars, 2019 |
+-----------+-------+-------------------------+----------------------+
| LA | CA | 123 | 10.0 |
| San Diego | CA | 456 | 2345 |
+-----------+-------+-------------------------+----------------------+
I would like to rearrange the table to look like this, and do it for each tab in the xlsx:
+-----------+-------+------+-------------------+---------------+
| City | State | Year | Measure Name | Measure Value |
+-----------+-------+------+-------------------+---------------+
| LA | CA | 2019 | Number of Drivers | 123 |
| San Diego | CA | 2019 | Number of Drivers | 456 |
| LA | CA | 2019 | Number of Cars | 10 |
| San Diego | CA | 2019 | Number of Cars | 2345 |
+-----------+-------+------+-------------------+---------------+
There are a lot of moving pieces to this and has been a little tricky to get the final formatting correct.

We do melt then join with str.split
s=df.melt(['City','State'])
s=s.join(s.variable.str.split(',',expand=True))
Out[120]:
City State variable value 0 1
0 LA CA NumberofDrivers,2019 123.0 NumberofDrivers 2019
1 SanDiego CA NumberofDrivers,2019 456.0 NumberofDrivers 2019
2 LA CA NumberofCars,2019 10.0 NumberofCars 2019
3 SanDiego CA NumberofCars,2019 2345.0 NumberofCars 2019
# if you need change the name adding .rename(columns={}) at the end

This is how I wwas able to apply Yoben's solution to every tab in the xlsx file, append them together and write the full table to a .csv:
sheets_dict = pd.read_excel(r'file.xlsx', sheet_name=None)
full_table = pd.DataFrame()
for name, sheet in sheets_dict.items():
sheet['sheet'] = name
sheet = sheet.melt(['City','State'])
sheet = sheet.join(sheet.variable.str.split(',' , expand=True))
full_table = full_table.append(sheet)
full_table.reset_index(inplace=True, drop=True)
full_table.to_csv('Full Table.csv')

Related

PowerBI / SQL Query to verify records

I am working on a PowerBI report that is grabbing information from SQL and I cannot find a way to solve my problem using PowerBI or how to write the required code. My first table, Certifications, includes a list of certifications and required trainings that must be obtained in order to have an active certification.
My second table, UserCertifications, includes a list of UserIDs, certifications, and the trainings associated with a certification.
How can I write a SQL code or PowerBI measure to tell if a user has all required trainings for a certification? ie, if UserID 1 has the A certification, how can I verify that they have the TrainingIDs of 1, 10, and 150 associated with it?
Certifications:
CertificationsTable
UserCertifications:
UserCertificationsTable
This is a DAX pattern to test if contains at least some values.
| Certifications |
|----------------|------------|
| Certification | TrainingID |
|----------------|------------|
| A | 1 |
| A | 10 |
| A | 150 |
| B | 7 |
| B | 9 |
| UserCertifications |
|--------------------|---------------|----------|
| UserID | Certification | Training |
|--------------------|---------------|----------|
| 1 | A | 1 |
| 1 | A | 10 |
| 1 | A | 300 |
| 2 | A | 150 |
| 2 | B | 9 |
| 2 | B | 90 |
| 3 | A | 7 |
| 4 | A | 1 |
| 4 | A | 10 |
| 4 | A | 150 |
| 4 | A | 1000 |
In the above scenario, DAX needs to find out if the mandatory trainings (Certifications[TrainingID]) by Certifications[Certification] is completed by
UserCertifications[UserID ]&&UserCertifications[Certifications] partition.
In the above scenario, DAX should only return true for UserCertifications[UserID ]=4 as it is the only User that completed at least all the mandatory trainings.
The way to achieve this is through the following measure
areAllMandatoryTrainingCompleted =
VAR _alreadyCompleted =
CONCATENATEX (
UserCertifications,
UserCertifications[Training],
"-",
UserCertifications[Training]
) // what is completed in the fact Table; the fourth argument is very important as it decides the sort order
VAR _0 =
MAX ( UserCertifications[Certification] )
VAR _supposedToComplete =
CONCATENATEX (
FILTER ( Certifications, Certifications[Certification] = _0 ),
Certifications[TrainingID],
"-",
Certifications[TrainingID]
) // what is comeleted in the training Table; the fourth argument is very important as it decides the sort order
VAR _isMandatoryTrainingCompleted =
CONTAINSSTRING ( _alreadyCompleted, _supposedToComplete ) // CONTAINSSTRING (<Within Text>,<Search Text>); return true false
RETURN
_isMandatoryTrainingCompleted

How to search using a delimited string as array in query

I am trying to search for records columns that match a value within a delimited string.
I have two tables that look like this
Vehicles
| Id | Make | Model |
|----|------|-------|
| 1 | Ford | Focus |
| 2 | Ford | GT |
| 3 | Ford | Kuga |
| 4 | Audi | R8 |
Monitor
| Id | Makes | Models |
|----|-------|----------|
| 1 | Ford | GT,Focus |
| 2 | Audi | R8 |
What I'm trying to achieve is the following:
| Id | Makes | Models | Matched_Count |
|----|-------|----------|---------------|
| 2 | Audi | R8 | 1 |
| 1 | Ford | GT,Focus | 2 |
Using the following query I can get matches on singular strings, but I'm not sure how I can split the commas to search for individual models.
select Id, Makes, Models, (select count(id) from Vehicles va where UPPER(sa.Makes) = UPPER(va.Make) AND UPPER(sa.Models) = UPPER(va.Model)) as Matched_Count
from Monitor sa
(I am using a very SQL Server 2016 however I do not have access to create custom functions or variables)
If you are stuck with this data model, you can use string_split():
select m.*, v.matched_count
from monitor m outer apply
(select count(*) as matched_count
from string_split(m.models, ',') s join
vehicles v
on s.value = v.model and m.makes = v.makes
) v;
I would advise you to put your efforts into fixing the data model, though.
Here is a db<>fiddle.

SQL Group by Client Location

Sample of Data I am trying to manipulate
Order | OrderDate | ClientName| ClientAddress | City | State| Zip |
-------|-----------|-----------|---------------|--------|------|-------|
CO101 | 1/5/2015 | Client ABC| 101 Park Drive| Boston | MA | 02134 |
C0102 | 2/6/2015 | Client ABC| 101 Park Drive| Boston | MA | 02134 |
C0103 | 1/7/2015 | Client ABC| 354 Foo Pkwy | Dallas | TX | 75001 |
C0104 | 3/7/2015 | Client ABC| 354 Foo Pkwy | Dallas | TX | 75001 |
C0105 | 5/7/2015 | Client XYZ| 1 Binary Road | Austin | TX | 73301 |
C0106 | 1/8/2015 | Client XYZ| 1 Binary Road | Austin | TX | 73301 |
C0107 | 7/9/2015 | Client XYZ| 51 Testing Rd | Austin | TX | 73301 |
I have a database setup in MS-SQL Server with all client orders for the past two year period. Some clients only have one location, others have multiple locations. I would like to write a script that will show me the number of orders a customer placed by location over the total number of weeks there was at least one order.
Based on the results of this script, I would like to be able to deduce every customer location's summary of unique orders (placed at various times). For example:
Client ABC has placed 45 orders over 35 total weeks at location A
Client ABC has placed 35 orders over 15 total weeks at location B
Client ABC has placed 15 orders over 15 total weeks at location C
I would like see this information for each unique location for each client. I am not sure how to aggregate the data in such a way. Here is where I am at with my script:
SELECT t1.ClientName, (SELECT DISTINCT t2.ClientAddress), COUNT(DISTINCT t2.Orders) AS TotalOrders,
DATEPART(week, t1.OrderDate) AS Week
FROM database t1
INNER JOIN database t2 on t1.Orders = t2.Orders
GROUP BY DATEPART(week, t1.OrderDate), t1.ClientAddress, t2.ClientAddress
HAVING COUNT(DISTINCT t2.SalesOrder) > 1
ORDER BY TotalOrders DESC
The results that I get show me the unique orders by location by week, but I'm not sure how to count the number of weeks in the way that I need; I have tried writing subqueries but I keep running into issues. I realize that in this script I am showing number of order by location by each individual week, I would like to count the total number of weeks within the time frame of where there is at least one order.
The results structure is as followed:
| ClientName| ClientAddress | TotalOrders | Week |
|-----------|---------------|--------------|------|
|Client ABC |101 Park Drive | 30 | 21 |
|Client ABC |101 Park Drive | 29 | 13 |
|Client ABC |101 Park Drive | 28 | 10 |
|Client XYZ |1 Binary Road | 27 | 19 |
|Client XYZ |1 Binary Road | 25 | 7 |
|Client XYZ |51 Testing Rd | 22 | 9 |
Any and all help would be greatly appreciated; thank you in advance.
Isn't this what you want?
SELECT t1.ClientName, ClientAddress, COUNT(DISTINCT t1.Orders) AS TotalOrders,
COUNT(DISTINCT DATEPART(week, t1.OrderDate)) AS Weeks
FROM database t1
GROUP BY t1.ClientName, t1.ClientAddress
HAVING COUNT(DISTINCT t2.SalesOrder) > 1
ORDER BY TotalOrders DESC
I don't really follow why you're doing a self-join. Seems useless to me, but I left it in, just in case, and to focus only on the change I made to get your result.

Can I merge SQL Server tables if they have not exactly the same structure?

I have two tables, source and target.
source:
+--------+------+-------------+
| Name | Year | City |
+--------+------+-------------+
| Toyota | 2002 | Los Angeles |
| Seat | 2012 | Madrid |
+--------+------+-------------+
target:
+----+---------+------+----------+
| ID | Name | Year | City |
+----+---------+------+----------+
| 1 | Bentley | 1969 | Budapest |
| 2 | Toyota | 1988 | New York |
| 3 | Ford | 2001 | Tokyo |
| 4 | Seat | 1995 | Madrid |
| 5 | Bugatti | 1995 | London |
+----+---------+------+----------+
I want to merge source into target. I know the MERGE command, it's fine. The issue is that the source has no column ID so that it won't match.
Since Name column in both are unique I only need to match if they are equal, then if not exists insert into target, if exists update target.
I could do it using NOT EXIST statement, but we are talking about billions of rows so MERGE would be a much quicker solution.
So can I somehow set the MERGE command to take only that column into account when matching?
Yes, you can:
MERGE target t
USING source s
ON t.name = s.name
WHEN NOT MATCHED
INSERT (Name, Year, City)
VALUES (s.Name, s.Year, s.City)
WHEN MATCHED THEN
UPDATE SET Year = s.Year,
City = s.City;
If your ID column in target is not IDENTITY column you can create sequence to populate it.

SQL query for many-to-many self-join

I have a database table that has a companion many-to-many self-join table alongside it. The primary table is part and the other table is alternate_part (basically, alternate parts are identical to their main part with different #s). Every record in the alternate_part table is also in the part table. To illustrate:
`part`
| part_id | part_number | description |
|---------|-------------|-------------|
| 1 | 00001 | wheel |
| 2 | 00002 | tire |
| 3 | 00003 | window |
| 4 | 00004 | seat |
| 5 | 00005 | wheel |
| 6 | 00006 | tire |
| 7 | 00007 | window |
| 8 | 00008 | seat |
| 9 | 00009 | wheel |
| 10 | 00010 | tire |
| 11 | 00011 | window |
| 12 | 00012 | seat |
`alternate_part`
| main_part_id | alt_part_id |
|--------------|-------------|
| 1 | 5 | // Wheel
| 5 | 1 | // |
| 5 | 9 | // |
| 9 | 5 | // |
| 2 | 6 | // Tire
| 6 | 2 | // |
| ... | ... | // |
I am trying to produce a simple SQL query that will give me a list of all alternates for a main part. The tricky part is: some alternates are only listed as alternates of alternates, it is not guaranteed that every viable alternate for a part is listed as a direct alternate. e.g., if 'Part 3' is an alternate of 'Part 2' which is an alternate of 'Part 1', then Part 3 is an alternate of Part 1 (even if the alternate_part table doesn't list a direct link). The reverse is also true (Part 1 is an alternate of Part 3).
Basically, right now I'm pulling alternates and iterating through them
SELECT p.*, ap.*
FROM part p
INNER JOIN alternate_part ap ON p.part_id = ap.main_part_id
And then going back and doing the same again on those alternates. But, I think there's got to be a better way.
The SQL query I'm looking for will basically give me:
| part_id | alt_part_id |
|---------|-------------|
| 1 | 5 |
| 1 | 9 |
For part_id = 1, even when 1 & 9 are not explicitly linked in the alternates table.
Note: I have no control whatever over the structure of the DB, it is a distributed software solution.
Note 2: It is an Oracle platform, if that affects syntax.
You have to create hierarchical tree , probably you have to use connect by prior , nocycle query
something like this
select distinct p.part_id,p.part_number,p.description,c.main_part_id
from part p
left join (
select main_part_id,connect_by_root(main_part_id) real_part_id
from alternate_part
connect by NOCYCLE prior main_part_id = alternate_part_id
) c
on p.part_id = c.real_part_id and p.part_id != c.main_part_id
order by p.part_id
You can read full documentation about Hierarchical queries at http://docs.oracle.com/cd/B28359_01/server.111/b28286/queries003.htm