Suppose I have the following data
StadiumType Turf Temperature Humidity GameId
0 Outdoor Field Turf 63.0 77.0 2017090700
1 Outdoors A-Turf Titan 65.0 53.0 2017091000
2 Green Outdoor Grass 64.0 57.0 2017091001
3 Red Outdoor UBU Sports Speed S5-M 68.0 43.0 2017091002
4 Outdoor Grass 63.0 53.0 2017091003
I found that using pd.read_clipboard() doesn't work well because the white space. I ended up with awkward split and join using with context manager. Is there any easier way than this?
Thanks very much.
I have a pandas dataframe. Here are the first five rows:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
I would like to group by StockCode and CustomerID, and sum Quantity. Then, I'd like to throw out all of the StockCode/CustomerID pairs where this sum is negative. The desired final product is the original dataframe with the rows corresponding to these StockCode/CustomerID pairs removed.
I have a working solution:
retail_df.groupby(['CustomerID','StockCode']).filter(lambda x: x['Quantity'].sum() >= 0)
However, it takes my laptop four minutes to run it. There are 406829 rows. Is there a faster way?
This should do the trick:
df2=retail_df.groupby(['CustomerID','StockCode'])["Quantity"].sum().ge(0)
retail_df=retail_df.set_index(['CustomerID','StockCode']).loc[df2.loc[df2].index].reset_index(drop=False)
I was trying to do multiple group and also adding count to new column.
My input file
OrderDate Region Rep Item Units Unit Cost Total
----------------------------------------------------------
1/6/18 East Jones Pencil 95 1.99 189.05
1/23/18 Central Kivell Binder 50 19.99 999.50
2/9/18 Central Jardine Pencil 36 4.99 179.64
2/26/18 Central Gill Pen 27 19.99 539.73
3/15/18 West Sorvino Pencil 56 2.99 167.44
4/1/18 East Jones Binder 60 4.99 299.40
4/18/18 Central Andrews Pencil 75 1.99 149.25
4/18/18 West Jones Pencil 75 1.99 149.25
I am trying to do like
Region Rep Count same/diff
-------------------------------
east jones 2 2-same
jones
central Kivell 4 >3 differnce
Jardine
Gill
Andrews
West Sorvino 2 2-different
West jones1
My code:
df1 = pd.read_excel(excel_path, sheet_name = 'SalesOrders', index_col=0)
df3 = (df1.groupby('Region')['Rep'].value_counts())
print(df3)
Please help me to do this. Thanks
In rep column, based on Region i have done group by to know Rep values. if Rep member are same then 2 same people, consider central region has 4 different people working so it i greater than 3 .
i got a table like:
COST RETAIL DISCOUNT CATEGORY
18.75 30.95 FITNESS
14.2 22 FAMILY LIFE
37.8 59.95 3 CHILDREN
31.4 55.95 COMPUTER
12.5 19.95 COOKING
47.25 75.95 3.8 COMPUTER
21.8 25 COMPUTER
37.9 54.5 COMPUTER
48 89.95 4.5 FAMILY LIFE
19 28.75 COOKING
5.32 8.95 CHILDREN
17.85 29.95 1.5 SELF HELP
15.4 31.95 BUSINESS
21.85 39.95 LITERATURE
And my problem is:
Display the book Category and the average retail price after discount in all categories where average discounted price is less than the highest average retail price of books for all the categories.
I tried with the following query but cant get the solution as i think i did not understood the question properly.
select category, retail-discount as A, avg(A) as B, avg(retail) as C from books
where B<max(C)
I don't understand what this order_by clause is doing and whether I need it or not:
select c.customerid, c.firstname, c.lastname, i.order_date, i.item, i.price from
items_ordered i, customers c
where i.customerid = c.customerid
group by c.customerid, i.item, i.order_date
order by i.order_date desc;
This produces this data:
10330 Shawn Dalton 30-Jun-1999 Pogo stick 28.00
10101 John Gray 30-Jun-1999 Raft 58.00
10410 Mary Ann Howell 30-Jan-2000 Unicycle 192.50
10101 John Gray 30-Dec-1999 Hoola Hoop 14.75
10449 Isabela Moore 29-Feb-2000 Flashlight 4.50
10410 Mary Ann Howell 28-Oct-1999 Sleeping Bag 89.22
10339 Anthony Sanchez 27-Jul-1999 Umbrella 4.50
10449 Isabela Moore 22-Dec-1999 Canoe 280.00
10298 Leroy Brown 19-Sep-1999 Lantern 29.00
10449 Isabela Moore 19-Mar-2000 Canoe paddle 40.00
10413 Donald Davids 19-Jan-2000 Lawnchair 32.00
10330 Shawn Dalton 19-Apr-2000 Shovel 16.75
10439 Conrad Giles 18-Sep-1999 Tent 88.00
10298 Leroy Brown 18-Mar-2000 Pocket Knife 22.38
10299 Elroy Keller 18-Jan-2000 Inflatable Mattress 38.00
10438 Kevin Smith 18-Jan-2000 Tent 79.99
10101 John Gray 18-Aug-1999 Rain Coat 18.30
10449 Isabela Moore 15-Dec-1999 Bicycle 380.50
10439 Conrad Giles 14-Aug-1999 Ski Poles 25.50
10449 Isabela Moore 13-Aug-1999 Unicycle 180.79
10101 John Gray 08-Mar-2000 Sleeping Bag 88.70
10299 Elroy Keller 06-Jul-1999 Parachute 1250.00
10438 Kevin Smith 02-Nov-1999 Pillow 8.50
10101 John Gray 02-Jan-2000 Lantern 16.00
10315 Lisa Jones 02-Feb-2000 Compass 8.00
10449 Isabela Moore 01-Sep-1999 Snow Shoes 45.00
10438 Kevin Smith 01-Nov-1999 Umbrella 6.75
10298 Leroy Brown 01-Jul-1999 Skateboard 33.00
10101 John Gray 01-Jul-1999 Life Vest 125.00
10330 Shawn Dalton 01-Jan-2000 Flashlight 28.00
10298 Leroy Brown 01-Dec-1999 Helmet 22.00
10298 Leroy Brown 01-Apr-2000 Ear Muffs 12.50
While if I remove the order_by clause completely, as in this query:
select c.customerid, c.firstname, c.lastname, i.order_date, i.item, i.price from
items_ordered i, customers c
where i.customerid = c.customerid
group by c.customerid, i.item, i.order_date;
I get these results:
10101 John Gray 30-Dec-1999 Hoola Hoop 14.75
10101 John Gray 02-Jan-2000 Lantern 16.00
10101 John Gray 01-Jul-1999 Life Vest 125.00
10101 John Gray 30-Jun-1999 Raft 58.00
10101 John Gray 18-Aug-1999 Rain Coat 18.30
10101 John Gray 08-Mar-2000 Sleeping Bag 88.70
10298 Leroy Brown 01-Apr-2000 Ear Muffs 12.50
10298 Leroy Brown 01-Dec-1999 Helmet 22.00
10298 Leroy Brown 19-Sep-1999 Lantern 29.00
10298 Leroy Brown 18-Mar-2000 Pocket Knife 22.38
10298 Leroy Brown 01-Jul-1999 Skateboard 33.00
10299 Elroy Keller 18-Jan-2000 Inflatable Mattress 38.00
10299 Elroy Keller 06-Jul-1999 Parachute 1250.00
10315 Lisa Jones 02-Feb-2000 Compass 8.00
10330 Shawn Dalton 01-Jan-2000 Flashlight 28.00
10330 Shawn Dalton 30-Jun-1999 Pogo stick 28.00
10330 Shawn Dalton 19-Apr-2000 Shovel 16.75
10339 Anthony Sanchez 27-Jul-1999 Umbrella 4.50
10410 Mary Ann Howell 28-Oct-1999 Sleeping Bag 89.22
10410 Mary Ann Howell 30-Jan-2000 Unicycle 192.50
10413 Donald Davids 19-Jan-2000 Lawnchair 32.00
10438 Kevin Smith 02-Nov-1999 Pillow 8.50
10438 Kevin Smith 18-Jan-2000 Tent 79.99
10438 Kevin Smith 01-Nov-1999 Umbrella 6.75
10439 Conrad Giles 14-Aug-1999 Ski Poles 25.50
10439 Conrad Giles 18-Sep-1999 Tent 88.00
10449 Isabela Moore 15-Dec-1999 Bicycle 380.50
10449 Isabela Moore 22-Dec-1999 Canoe 280.00
10449 Isabela Moore 19-Mar-2000 Canoe paddle 40.00
10449 Isabela Moore 29-Feb-2000 Flashlight 4.50
10449 Isabela Moore 01-Sep-1999 Snow Shoes 45.00
10449 Isabela Moore 13-Aug-1999 Unicycle 180.79
I'm not sure what the order_by is doing here and if it's having the intended effects.
It looks like it is ordering on i.ordered_date, but using string comparison rather than date comparison, which is why 30-Jun-1999 is placed before 29-Feb-2000. As a string "30-Jun-1999" > "28-Feb-2000", but as dates, the reverse is true.
Check the type of i.ordered_date in the items_ordered table - it should be datetime or similar - if it's varchar, then you will need to either change it to a date type, or cast the value to a date in the order-by clause. E.g.
order by CAST(i.order_date AS DATE) desc
You should always use proper DATETIME datatype to store dates