How to use dictonary or other method effecintly to clean data

How to use dictonary or other method effecintly to clean data - pandas

Iam working on a dataset with lots of code for the department. I have other paper that decodes the department is there efficient way to combine or replace.
First Dataset:
Location DeptCode
Delhi 12B
Gurgoun 12D
Hydrabad 13A
Punjab 20A
Jhansi 31B
Below is the code:
Department DeptCode
Electronics [12A, 12B, 12C, 12D, 12E ........12Z]
Electronics [13A,13B,.......13Z]
Grocery 20A
Grocery [31A,31B,31C,.........31Z]
Expected:
Department DeptCode Location
Electronics 12B Delhi
Electronics 12D Gurgoun
Electronics 13A Hydrabad
Grocery 20A Punjab
Grocery 31B Jhansi

Let us try explode then merge
Out = df1.merge(df2.explode('DeptCode'), on='Deptcode', how='left')

Related

pandas: how to create new columns based on two columns and aggregate the results

I am trying to perform a sort of aggregation, but with the creation of new columns.
Let's take the example of the dataframe below:
df = pd.DataFrame({'City':['Los Angeles', 'Denver','Denver','Los Angeles'],
'Car Maker': ['Ford','Toyota','Ford','Toyota'],
'Qty': [50000,100000,80000,70000]})
That generates this:
City
Car Maker
Qty
0
Los Angeles
Ford
50000
1
Denver
Toyota
100000
2
Denver
Ford
80000
3
Los Angeles
Toyota
70000
I would like to have one line per city and the Car Maker as a new column with the Qty related to that City:
City
Car Maker
Ford
Toyota
0
Los Angeles
Ford
50000
70000
1
Denver
Toyota
80000
100000
Any hints on how to achieve that?
I've tried some options with transforming it on a dictionary and compressing on a function, but I am looking for a more pandas' like solution.

df.pivot(index='City', columns='Car Maker', values='Qty').reset_index()

Try dataframe.pivot_table()
df.pivot_table(values='Qty', index=['City', 'Car Maker'], columns='Car Maker').reset_index()

Get name(s) from JSON format column, that not in 2 other columns with names

I need to create column with name(s) (Supervisors - can be multiple supervisors at the same time, but also there might not be supervisor at all) from JSON format column, that not in 2 other column with names (Employee and Client).
Id
Employee
Client
AllParticipants
1
Justin Bieber
Ariana Grande
[{"ParticipantName":"Justin Bieber"},{"ParticipantName":"Ariana Grande"}]
2
Lionel Messi
Christiano Ronaldo
[{"ParticipantName":"Christiano Ronaldo"},{"ParticipantName":"Lionel Messi"}]
3
Nicolas Cage
Robert De Niro
[{"ParticipantName":"Robert De Niro"},{"ParticipantName":"Nicolas Cage"},{"ParticipantName":"Brad Pitt"}]
4
Harry Potter
Ron Weasley
[{"ParticipantName":"Ron Weasley"},{"ParticipantName":"Albus Dumbldor"},{"ParticipantName":"Harry Potter"},{"ParticipantName":"Lord Voldemort"}]
5
Tom Holland
Henry Cavill
[{"ParticipantName":"Henry Cavill"},{"ParticipantName":"Tom Holland"}]
6
Spider Man
Venom
[{"ParticipantName":"Venom"},{"ParticipantName":"Iron Man"},{"ParticipantName":"Superman"},{"ParticipantName":"Spider Man"}]
7
Andrew Garfield
Leonardo DiCaprio
[{"ParticipantName":"Tom Cruise"},{"ParticipantName":"Andrew Garfield"},{"ParticipantName":"Leonardo DiCaprio"}]
8
Dwayne Johnson
Jennifer Lawrence
[{"ParticipantName":"Jennifer Lawrence"},{"ParticipantName":"Dwayne Johnson"}]
The output column I need:
Supervisors
NULL
NULL
Brad Pitt
Albus Dumbldor, Lord Voldemort
NULL
Iron Man, Superman
Tom Cruise
NULL
I've tried to create extra columns to use Case expression after that, but it seems too complex.
SELECT *,
JSON_VALUE(w.AllParticipants,'$[0].ParticipantName') AS ParticipantName1,
JSON_VALUE(w.AllParticipants,'$[1].ParticipantName') AS ParticipantName2,
JSON_VALUE(w.AllParticipants,'$[2].ParticipantName') AS ParticipantName3,
JSON_VALUE(w.AllParticipants,'$[3].ParticipantName') AS ParticipantName4
FROM Work AS w
I'm wondering if there is an easy way to compare values and extract only unique ones.

Presenting Data uniformly between two different table presentations with SQL

Hello Everyone I have a problem…
Table 1 (sorted) is laid out like this:
User ID Producer ID Company Number
JWROSE 23401 234
KXPEAR 23903 239
LMWEEM 27902 279
KJMORS 18301 183
Table 2 (unsorted) looks like this:
Client Name City Company Number
Rajat Smith London JWROSE
Robert Singh Cleveland KXPEAR
Alberto Johnson New York City LMWEEM
Betty Lee Dallas KJMORS
Chase Galvez Houston 23401
Hassan Jackson Seattle 23903
Tooti Fruity Boise 27902
Joe Trump Tokyo 18301
Donald Biden Cairo 234
Mike Harris Rome 239
Kamala Pence Moscow 279
Adolf Washington Bangkok 183
Now… Table 1 has all of the User IDs and Producer IDs properly rowed with the Company Number.
I want to pull all the data and correctly sorted….
Client Name City User ID Producer ID Company Number
Rajat Smith London JWROSE 23401 234
Robert Singh Cleveland KXPEAR 23903 239
Alberto Johnson New York City LMWEEM 27902 279
Betty Lee Dallas KJMORS 18301 183
Chase Galvez Houston JWROSE 23401 234
Hassan Jackson Seattle KXPEAR 23903 239
Tooti Fruity Boise LMWEEM 27902 279
Joe Trump Tokyo KJMORS 18301 183
Donald Biden Cairo JWROSE 23401 234
Mike Harris Rome KXPEAR 23903 239
Kamala Pence Moscow LMWEEM 27902 279
Adolf Washington Bangkok KJMORS 18301 183
Query:
Select
b.client_name,
b.city.,
a.user_id,
a.producer_id,
a.company_number
From Table 1 A
Left Join Table 2 B On a.company….
And this is where I don’t know what do to….because both tables have all the same variables, but Company Number in Table 2 is mixed with User IDs and Producer IDs... however we know what company Number those ID's are associated to.

As I mention in the comments, and others do, the real problem is your design. "The fact that UserID is clearly a varchar, while the other 2 columns are an int really does not make this any better", and makes this not simple (and certainly not SARGable).
To get the data in the correct order, as well, you need a column to order it on which the data lacks. I have therefore added a pseudo column, MissingIDColumn, to represent this missing column you need to add to your data; which you can do when you fix the design:
SELECT T2.ClientName,
T2.City,
T1.UserID,
T1.ProducerID,
T1.CompanyNumber
FROM (VALUES('JWROSE',23401,234),
('KXPEAR',23903,239),
('LMWEEM',27902,279),
('KJMORS',18301,183))T1(UserID,ProducerID,CompanyNumber)
JOIN (VALUES(1,'Rajat Smith ','London ','JWROSE'),
(2,'Robert Singh ','Cleveland ','KXPEAR'),
(3,'Alberto Johnson ','New York City','LMWEEM'),
(4,'Betty Lee ','Dallas ','KJMORS'),
(5,'Chase Galvez ','Houston ','23401'),
(6,'Hassan Jackson ','Seattle ','23903'),
(7,'Tooti Fruity ','Boise ','27902'),
(8,'Joe Trump ','Tokyo ','18301'),
(9,'Donald Biden ','Cairo ','234'),
(10,'Mike Harris ','Rome ','239'),
(11,'Kamala Pence ','Moscow ','279'),
(12,'Adolf Washington','Bangkok ','183'))T2(MissingIDColumn,ClientName,City,CompanyNumber) ON T2.CompanyNumber IN (T1.UserID,CONVERT(varchar(6),T1.ProducerID),CONVERT(varchar(6),T1.CompanyNumber))
ORDER BY MissingIDColumn;

Pandas difficult to add new column with condition?

I was trying to do multiple group and also adding count to new column.
My input file
OrderDate Region Rep Item Units Unit Cost Total
----------------------------------------------------------
1/6/18 East Jones Pencil 95 1.99 189.05
1/23/18 Central Kivell Binder 50 19.99 999.50
2/9/18 Central Jardine Pencil 36 4.99 179.64
2/26/18 Central Gill Pen 27 19.99 539.73
3/15/18 West Sorvino Pencil 56 2.99 167.44
4/1/18 East Jones Binder 60 4.99 299.40
4/18/18 Central Andrews Pencil 75 1.99 149.25
4/18/18 West Jones Pencil 75 1.99 149.25
I am trying to do like
Region Rep Count same/diff
-------------------------------
east jones 2 2-same
jones
central Kivell 4 >3 differnce
Jardine
Gill
Andrews
West Sorvino 2 2-different
West jones1
My code:
df1 = pd.read_excel(excel_path, sheet_name = 'SalesOrders', index_col=0)
df3 = (df1.groupby('Region')['Rep'].value_counts())
print(df3)
Please help me to do this. Thanks
In rep column, based on Region i have done group by to know Rep values. if Rep member are same then 2 same people, consider central region has 4 different people working so it i greater than 3 .

shop that has served more than 3 people query sql

I have a table called frequents that has two columns name and pizzeria. Each name is linked to a pizzeria. Some names are mentioned more than once since they are linked to different pizzerias. I need help writing a query that shows all the pizzerias that have served more than 3 people. Thank you.
Name Pizzareia
Amy Pizza Hut
Ben Pizza Hut
Ben Chicago Pizza
Cal Straw Hat
Cal New York Pizza
Dan Straw Hat
Dan New York Pizza
Eli Straw Hat
Eli Chicago Pizza
Fay Dominos
Fay Little Caesars
Gus Chicago Pizza
Gus Pizza Hut
Hil Dominos
Hil Straw Hat
Hil Pizza Hut
Ian New York Pizza
Ian Straw Hat
Ian Dominos
And the query:
SELECT name, count(pizzeria)
FROM frequency
GROUP BY name
HAVING COUNT(pizzeria) >= 3
The result is supposed to show the pizzeria where its name has come up more than 3 times

I need help writing a query that shows all the pizzerias that have served more than 3 people
You need to GROUP BY pizzeria, not by name:
SELECT pizzeria FROM frequency GROUP BY pizzeria HAVING COUNT(*) >= 3

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use dictonary or other method effecintly to clean data - pandas

Let us try explode then merge Out = df1.merge(df2.explode('DeptCode'), on='Deptcode', how='left')

Related

pandas: how to create new columns based on two columns and aggregate the results

Get name(s) from JSON format column, that not in 2 other columns with names

Presenting Data uniformly between two different table presentations with SQL

Pandas difficult to add new column with condition?

shop that has served more than 3 people query sql

Categories

Resources