pandas: how to create new columns based on two columns and aggregate the results - pandas

I am trying to perform a sort of aggregation, but with the creation of new columns.
Let's take the example of the dataframe below:
df = pd.DataFrame({'City':['Los Angeles', 'Denver','Denver','Los Angeles'],
'Car Maker': ['Ford','Toyota','Ford','Toyota'],
'Qty': [50000,100000,80000,70000]})
That generates this:
City
Car Maker
Qty
0
Los Angeles
Ford
50000
1
Denver
Toyota
100000
2
Denver
Ford
80000
3
Los Angeles
Toyota
70000
I would like to have one line per city and the Car Maker as a new column with the Qty related to that City:
City
Car Maker
Ford
Toyota
0
Los Angeles
Ford
50000
70000
1
Denver
Toyota
80000
100000
Any hints on how to achieve that?
I've tried some options with transforming it on a dictionary and compressing on a function, but I am looking for a more pandas' like solution.

df.pivot(index='City', columns='Car Maker', values='Qty').reset_index()

Try dataframe.pivot_table()
df.pivot_table(values='Qty', index=['City', 'Car Maker'], columns='Car Maker').reset_index()

Related

Query just one row that meet three conditions in SQL

I'd like to make a query that returns just one row when it meets 3 conditions. I have a database that looks like this:
Location
Date
Item
Price
Chicago
2021-06-10
1
150
New York
2021-06-10
2
130
Chicago
2021-06-10
1
150
Los Angeles
2021-06-10
3
100
Atlanta
2021-06-10
4
120
New York
2021-06-09
2
125
Chicago
2021-06-09
1
155
Los Angeles
2021-06-09
3
99
Atlanta
2021-06-09
4
140
This database contains the price of different items, by date and location. This price changes each day and the price in each location for the same item does not need to be the same. Given that this database contains each sale made in a day, for each item, I'd like to make a query that returns only one observation by Location, Date and Item. I want to have like a time series for each the price of each item, in each location. So the resulting table should look like this:
Location
Date
Item
Price
Chicago
2021-06-10
1
150
New York
2021-06-10
2
130
Los Angeles
2021-06-10
3
100
Atlanta
2021-06-10
4
120
New York
2021-06-09
2
125
Chicago
2021-06-09
1
155
Los Angeles
2021-06-09
3
99
Atlanta
2021-06-09
4
140
Hope someone can help me, thanks.
To elaborate on the comments, this will give exactly what you have specified.
SELECT
DISTINCT
*
FROM
yourTable
The DISTINCT key word looks at all columns in each row and eliminates any row that exactly matches any other row.
If the price can vary within a day, but you want the maximum value, for example, use a GROUP BY...
SELECT
location,
date,
item,
MAX(price) AS max_price
FROM
yourTable
GROUP BY
location,
date,
item
That will ensure you get one row per unique combination of location, date, item, and then you can pick which price to include using aggregate functions.
Note: Using keywords such as date as column names is a bad idea. depending on your database you may need to "quote"/"escape" such column names, and even then the make reading the code harder for others.

How to use dictonary or other method effecintly to clean data

Iam working on a dataset with lots of code for the department. I have other paper that decodes the department is there efficient way to combine or replace.
First Dataset:
Location DeptCode
Delhi 12B
Gurgoun 12D
Hydrabad 13A
Punjab 20A
Jhansi 31B
Below is the code:
Department DeptCode
Electronics [12A, 12B, 12C, 12D, 12E ........12Z]
Electronics [13A,13B,.......13Z]
Grocery 20A
Grocery [31A,31B,31C,.........31Z]
Expected:
Department DeptCode Location
Electronics 12B Delhi
Electronics 12D Gurgoun
Electronics 13A Hydrabad
Grocery 20A Punjab
Grocery 31B Jhansi
Let us try explode then merge
Out = df1.merge(df2.explode('DeptCode'), on='Deptcode', how='left')

Pandas difficult to add new column with condition?

I was trying to do multiple group and also adding count to new column.
My input file
OrderDate Region Rep Item Units Unit Cost Total
----------------------------------------------------------
1/6/18 East Jones Pencil 95 1.99 189.05
1/23/18 Central Kivell Binder 50 19.99 999.50
2/9/18 Central Jardine Pencil 36 4.99 179.64
2/26/18 Central Gill Pen 27 19.99 539.73
3/15/18 West Sorvino Pencil 56 2.99 167.44
4/1/18 East Jones Binder 60 4.99 299.40
4/18/18 Central Andrews Pencil 75 1.99 149.25
4/18/18 West Jones Pencil 75 1.99 149.25
I am trying to do like
Region Rep Count same/diff
-------------------------------
east jones 2 2-same
jones
central Kivell 4 >3 differnce
Jardine
Gill
Andrews
West Sorvino 2 2-different
West jones1
My code:
df1 = pd.read_excel(excel_path, sheet_name = 'SalesOrders', index_col=0)
df3 = (df1.groupby('Region')['Rep'].value_counts())
print(df3)
Please help me to do this. Thanks
In rep column, based on Region i have done group by to know Rep values. if Rep member are same then 2 same people, consider central region has 4 different people working so it i greater than 3 .

Hive sql pack array based off column

I have multiple columns listed below:
state sport size color name
florida football 1 red Max
nevada football 1 red Max
ohio football 1 red Max
texas football 1 red Max
florida hockey 1 red Max
nevada hockey 1 red Max
ohio hockey 1 red Max
texas hockey 1 red Max
florida tennis 2 green Max
nevada tennis 2 green Max
ohio tennis 2 green Max
texas tennis 2 green Max
Is there a way to combine these into arrays like the desired output below based on one column (in this case Name). Mac the results will have one record, instead of repeating and the records will be contained in an array.
state sport
[florida, nevada, ohio,texas] [football, hockey, tennis]
size color
[1,2] [red, green]
You can use collect_set.
select name,collect_set(state),collect_set(sport),collect_set(size),collect_set(color)
from tbl
group by name
You need to use collect_set. Hope this helps. Thanks.
query:
select collect_set(state),
collect_set(sport),
collect_set(size),
collect_set(color)
from myTable
where name = 'Max';

SAS: Assign Numbers To Contents of a Variable

I have this variable called city, and within the variable are names of cities:
City
New York
Chicago
Paris
London
Boston
Hamburg
**New York
London**
I want to create another variable called cityNumber, and this variable should go through the City variable and assign the numbers 1,2, 3 etc.
For example:
City CityNumber
New York 1
Chicago 2
Paris 3
London 4
Boston 5
Hamburg 6
**New York 1
London 4**
etc.
There are several cities, and they are not always in the same order.
Thank you
Sort data by city, then create the cityNumber with the by groups. You want an if statement that increments the cityNumber by one at the beginning of each group. The easiest way to accomplish this is with a sum statement:
data want;
set have;
by city;
if first.city then cityNumber+1;
run;