dataframes join depending on one column string - dataframe

I have two dataframes, I want the main one (vehicle_data_all) to have a new column with the vehicle type. I have this in another dataframe
vehicle_data_all
vehicle type
both dataframes have the column "Manufacturer"
I would like to do it with a join
Manufacturer model year price transmission mileage fuelType tax
VW Eos 2020.0 5990 Manual 74000 Diesel 125.0
VW Fox 2020.0 1799 Manual 88102 Petrol 145.0
VW Fox 2020.0 1590 Manual 70000 Petrol 200.0
VW Fox 2020.0 1250 Manual 82704 Petrol 150.0
VW Fox 2017.0 2295 Manual 74000 Petrol 145.0

Related

Is there any easy way to use Pandas to read data from clipboard while data have string columns with varying white space?

Suppose I have the following data
StadiumType Turf Temperature Humidity GameId
0 Outdoor Field Turf 63.0 77.0 2017090700
1 Outdoors A-Turf Titan 65.0 53.0 2017091000
2 Green Outdoor Grass 64.0 57.0 2017091001
3 Red Outdoor UBU Sports Speed S5-M 68.0 43.0 2017091002
4 Outdoor Grass 63.0 53.0 2017091003
I found that using pd.read_clipboard() doesn't work well because the white space. I ended up with awkward split and join using with context manager. Is there any easier way than this?
Thanks very much.

Filtering a pandas dataframe by aggregating on two columns

I have a pandas dataframe. Here are the first five rows:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
I would like to group by StockCode and CustomerID, and sum Quantity. Then, I'd like to throw out all of the StockCode/CustomerID pairs where this sum is negative. The desired final product is the original dataframe with the rows corresponding to these StockCode/CustomerID pairs removed.
I have a working solution:
retail_df.groupby(['CustomerID','StockCode']).filter(lambda x: x['Quantity'].sum() >= 0)
However, it takes my laptop four minutes to run it. There are 406829 rows. Is there a faster way?
This should do the trick:
df2=retail_df.groupby(['CustomerID','StockCode'])["Quantity"].sum().ge(0)
retail_df=retail_df.set_index(['CustomerID','StockCode']).loc[df2.loc[df2].index].reset_index(drop=False)

Pandas difficult to add new column with condition?

I was trying to do multiple group and also adding count to new column.
My input file
OrderDate Region Rep Item Units Unit Cost Total
----------------------------------------------------------
1/6/18 East Jones Pencil 95 1.99 189.05
1/23/18 Central Kivell Binder 50 19.99 999.50
2/9/18 Central Jardine Pencil 36 4.99 179.64
2/26/18 Central Gill Pen 27 19.99 539.73
3/15/18 West Sorvino Pencil 56 2.99 167.44
4/1/18 East Jones Binder 60 4.99 299.40
4/18/18 Central Andrews Pencil 75 1.99 149.25
4/18/18 West Jones Pencil 75 1.99 149.25
I am trying to do like
Region Rep Count same/diff
-------------------------------
east jones 2 2-same
jones
central Kivell 4 >3 differnce
Jardine
Gill
Andrews
West Sorvino 2 2-different
West jones1
My code:
df1 = pd.read_excel(excel_path, sheet_name = 'SalesOrders', index_col=0)
df3 = (df1.groupby('Region')['Rep'].value_counts())
print(df3)
Please help me to do this. Thanks
In rep column, based on Region i have done group by to know Rep values. if Rep member are same then 2 same people, consider central region has 4 different people working so it i greater than 3 .

Mathematical calculation in sql(oracle 11g)

i got a table like:
COST RETAIL DISCOUNT CATEGORY
18.75 30.95 FITNESS
14.2 22 FAMILY LIFE
37.8 59.95 3 CHILDREN
31.4 55.95 COMPUTER
12.5 19.95 COOKING
47.25 75.95 3.8 COMPUTER
21.8 25 COMPUTER
37.9 54.5 COMPUTER
48 89.95 4.5 FAMILY LIFE
19 28.75 COOKING
5.32 8.95 CHILDREN
17.85 29.95 1.5 SELF HELP
15.4 31.95 BUSINESS
21.85 39.95 LITERATURE
And my problem is:
Display the book Category and the average retail price after discount in all categories where average discounted price is less than the highest average retail price of books for all the categories.
I tried with the following query but cant get the solution as i think i did not understood the question properly.
select category, retail-discount as A, avg(A) as B, avg(retail) as C from books
where B<max(C)

What is the effect of this order_by clause?

I don't understand what this order_by clause is doing and whether I need it or not:
select c.customerid, c.firstname, c.lastname, i.order_date, i.item, i.price from
items_ordered i, customers c
where i.customerid = c.customerid
group by c.customerid, i.item, i.order_date
order by i.order_date desc;
This produces this data:
10330 Shawn Dalton 30-Jun-1999 Pogo stick 28.00
10101 John Gray 30-Jun-1999 Raft 58.00
10410 Mary Ann Howell 30-Jan-2000 Unicycle 192.50
10101 John Gray 30-Dec-1999 Hoola Hoop 14.75
10449 Isabela Moore 29-Feb-2000 Flashlight 4.50
10410 Mary Ann Howell 28-Oct-1999 Sleeping Bag 89.22
10339 Anthony Sanchez 27-Jul-1999 Umbrella 4.50
10449 Isabela Moore 22-Dec-1999 Canoe 280.00
10298 Leroy Brown 19-Sep-1999 Lantern 29.00
10449 Isabela Moore 19-Mar-2000 Canoe paddle 40.00
10413 Donald Davids 19-Jan-2000 Lawnchair 32.00
10330 Shawn Dalton 19-Apr-2000 Shovel 16.75
10439 Conrad Giles 18-Sep-1999 Tent 88.00
10298 Leroy Brown 18-Mar-2000 Pocket Knife 22.38
10299 Elroy Keller 18-Jan-2000 Inflatable Mattress 38.00
10438 Kevin Smith 18-Jan-2000 Tent 79.99
10101 John Gray 18-Aug-1999 Rain Coat 18.30
10449 Isabela Moore 15-Dec-1999 Bicycle 380.50
10439 Conrad Giles 14-Aug-1999 Ski Poles 25.50
10449 Isabela Moore 13-Aug-1999 Unicycle 180.79
10101 John Gray 08-Mar-2000 Sleeping Bag 88.70
10299 Elroy Keller 06-Jul-1999 Parachute 1250.00
10438 Kevin Smith 02-Nov-1999 Pillow 8.50
10101 John Gray 02-Jan-2000 Lantern 16.00
10315 Lisa Jones 02-Feb-2000 Compass 8.00
10449 Isabela Moore 01-Sep-1999 Snow Shoes 45.00
10438 Kevin Smith 01-Nov-1999 Umbrella 6.75
10298 Leroy Brown 01-Jul-1999 Skateboard 33.00
10101 John Gray 01-Jul-1999 Life Vest 125.00
10330 Shawn Dalton 01-Jan-2000 Flashlight 28.00
10298 Leroy Brown 01-Dec-1999 Helmet 22.00
10298 Leroy Brown 01-Apr-2000 Ear Muffs 12.50
While if I remove the order_by clause completely, as in this query:
select c.customerid, c.firstname, c.lastname, i.order_date, i.item, i.price from
items_ordered i, customers c
where i.customerid = c.customerid
group by c.customerid, i.item, i.order_date;
I get these results:
10101 John Gray 30-Dec-1999 Hoola Hoop 14.75
10101 John Gray 02-Jan-2000 Lantern 16.00
10101 John Gray 01-Jul-1999 Life Vest 125.00
10101 John Gray 30-Jun-1999 Raft 58.00
10101 John Gray 18-Aug-1999 Rain Coat 18.30
10101 John Gray 08-Mar-2000 Sleeping Bag 88.70
10298 Leroy Brown 01-Apr-2000 Ear Muffs 12.50
10298 Leroy Brown 01-Dec-1999 Helmet 22.00
10298 Leroy Brown 19-Sep-1999 Lantern 29.00
10298 Leroy Brown 18-Mar-2000 Pocket Knife 22.38
10298 Leroy Brown 01-Jul-1999 Skateboard 33.00
10299 Elroy Keller 18-Jan-2000 Inflatable Mattress 38.00
10299 Elroy Keller 06-Jul-1999 Parachute 1250.00
10315 Lisa Jones 02-Feb-2000 Compass 8.00
10330 Shawn Dalton 01-Jan-2000 Flashlight 28.00
10330 Shawn Dalton 30-Jun-1999 Pogo stick 28.00
10330 Shawn Dalton 19-Apr-2000 Shovel 16.75
10339 Anthony Sanchez 27-Jul-1999 Umbrella 4.50
10410 Mary Ann Howell 28-Oct-1999 Sleeping Bag 89.22
10410 Mary Ann Howell 30-Jan-2000 Unicycle 192.50
10413 Donald Davids 19-Jan-2000 Lawnchair 32.00
10438 Kevin Smith 02-Nov-1999 Pillow 8.50
10438 Kevin Smith 18-Jan-2000 Tent 79.99
10438 Kevin Smith 01-Nov-1999 Umbrella 6.75
10439 Conrad Giles 14-Aug-1999 Ski Poles 25.50
10439 Conrad Giles 18-Sep-1999 Tent 88.00
10449 Isabela Moore 15-Dec-1999 Bicycle 380.50
10449 Isabela Moore 22-Dec-1999 Canoe 280.00
10449 Isabela Moore 19-Mar-2000 Canoe paddle 40.00
10449 Isabela Moore 29-Feb-2000 Flashlight 4.50
10449 Isabela Moore 01-Sep-1999 Snow Shoes 45.00
10449 Isabela Moore 13-Aug-1999 Unicycle 180.79
I'm not sure what the order_by is doing here and if it's having the intended effects.
It looks like it is ordering on i.ordered_date, but using string comparison rather than date comparison, which is why 30-Jun-1999 is placed before 29-Feb-2000. As a string "30-Jun-1999" > "28-Feb-2000", but as dates, the reverse is true.
Check the type of i.ordered_date in the items_ordered table - it should be datetime or similar - if it's varchar, then you will need to either change it to a date type, or cast the value to a date in the order-by clause. E.g.
order by CAST(i.order_date AS DATE) desc
You should always use proper DATETIME datatype to store dates