How to split value generating new extra row but leaving blank only one value from the new row in pandas [duplicate] - pandas

This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 11 months ago.
I have a dataframe that looks like this:
ID Number Description Code Total Cost Store ID
A33 Ice cream; Chocolate 20 5
B44 Chips; Milk 15 6
C66 Cheese; Ice cream 10 6
V77 Pasta; Rice 8 8
I want to split the value of the Description Code into two, generating a new row based on the ";" symbol.
The output should be like this
ID Number Description Code Total Cost Store ID
A33 Ice cream 20 5
A33 Chocolate 20 5
B44 Chips 15 6
B44 Milk 15 6
C66 Cheese 10 6
C66 Ice cream 10 6
V77 Pasta 8 8
V77 Rice 8 8

Use str.split to create a list of Description Code then explode your dataframe:
out = df.assign(**{'Description Code': lambda x: x['Description Code'].str.split('; ')}) \
.explode('Description Code', ignore_index=True)
print(out)
# Output
ID Number Description Code Total Cost Store ID
0 A33 Ice cream 20 5
1 A33 Chocolate 20 5
2 B44 Chips 15 6
3 B44 Milk 15 6
4 C66 Cheese 10 6
5 C66 Ice cream 10 6
6 V77 Pasta 8 8
7 V77 Rice 8 8
Note: As Description Code is not a valid python identifier you have to use **{k1: v1, k2: v2} to expand parameters.

Related

How to use Pandas to find relative share of rooms in individual floors [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 12 days ago.
I have a dataframe df with the following structure
Floor Room Area
0 1 Living room 25
1 1 Kitchen 20
2 1 Bedroom 15
3 2 Bathroom 21
4 2 Bedroom 14
and I want to add a series floor_share with the relative share/ratio of the given floor, so that the dataframe becomes
Floor Room Area floor_share
0 1 Living room 18 0,30
1 1 Kitchen 18 0,30
2 1 Bedroom 24 0,40
3 2 Bathroom 10 0,67
4 2 Bedroom 20 0,33
If it is possible to do this with a one-liner (or any other idiomatic manner), I'll be very happy to learn how.
Current workaround
What I have done that produces the correct results is to first find the total floor areas by
floor_area_sums = df.groupby('Floor')['Area'].sum()
which gives
Floor
1 60
2 35
Name: Area, dtype: int64
I then initialize a new series to 0, and find the correct values while iterating through the dataframe rows.
df["floor_share"] = 0
for idx, row in df.iterrows():
df.loc[idx, 'floor_share'] = df.loc[idx, 'Area']/floor_area_sums[row.Floor]
IIUC use:
df["floor_share"] = df['Area'].div(df.groupby('Floor')['Area'].transform('sum'))
print (df)
Floor Room Area floor_share
0 1 Living room 18 0.300000
1 1 Kitchen 18 0.300000
2 1 Bedroom 24 0.400000
3 2 Bathroom 10 0.333333
4 2 Bedroom 20 0.666667

compare value in two rows in a column pandas

I have a pandas df something like this:
color pct days text
1 red 5 7 good
2 red 10 30 good
3 red 11 60 bad
4 blue 6 7 bad
5 blue 15 30 good
6 blue 21 60 bad
7 yellow 2 7 good
8 yellow 5 30 bad
9 yellow 7 60 bad
So basically, for each color, I have percentage values for 7 days, 30 days and 60 days. Please note that these are not always in correct order as I gave in example above. My task now is to look at the change in percentage for each color between the consecutive days values and if the change is greater or equal to 5%, then write in column "text" as "NA". Text in days 7 category is default and cannot be overwritten.
Desired result:
color pct days text
1 red 5 7 good
2 red 10 30 NA
3 red 11 60 bad
4 blue 6 7 bad
5 blue 15 30 NA
6 blue 21 60 NA
7 yellow 2 7 good
8 yellow 5 30 bad
9 yellow 7 60 bad
I am able to achieve this by a very very long process that I am very sure is not efficient. I am sure there is a much better way of doing this, but I am new to python, so struggling. Can someone please help me with this? Many thanks in advance
A variation on a (now-deleted) suggested answer as comment:
# ensure numeric data
df['pct'] = pd.to_numeric(df['pct'], errors='coerce')
df['days'] = pd.to_numeric(df['days'], errors='coerce')
# update in place
df.loc[df.sort_values(['color','days'])
.groupby('color')['pct']
.diff().ge(5), 'text'] = 'NA'
Output:
color pct days text
1 red 5 7 good
2 red 10 30 NA
3 red 11 60 bad
4 blue 6 7 bad
5 blue 15 30 NA
6 blue 21 60 NA
7 yellow 2 7 good
8 yellow 5 30 bad
9 yellow 7 60 bad
In the code below I'm reading your example table into a pandas dataframe using io, you don't need to do this, you already have your pandas table.
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
""" color pct days text
1 red 5 7 good
2 red 10 30 good
3 red 11 60 bad
4 blue 6 7 bad
5 blue 15 30 good
6 blue 21 60 bad
7 yellow 2 7 good
8 yellow 5 30 bad
9 yellow 7 60 bad"""
),delim_whitespace=True)
not_seven_rows = df['days'].ne(7)
good_rows = df['pct'].lt(5)
#Set the rows which are < 5 and not 7 days to be 'good'
df.loc[good_rows & not_seven_rows, 'text'] = 'good'
#Set the rows which are >= 5 and not 7 days to be 'NA'
df.loc[(~good_rows) & not_seven_rows, 'text'] = 'NA'
df
Output
def function1(dd:pd.DataFrame):
dd1=dd.sort_values("days")
return dd1.assign(text=np.where(dd1.pct.diff()>=5,"NA",dd1.text))
df1.groupby('color',sort=False).apply(function1).reset_index(drop=True)
out
color pct days text
0 red 5 7 good
1 red 10 30 NA
2 red 11 60 bad
3 blue 6 7 bad
4 blue 15 30 NA
5 blue 21 60 NA
6 yellow 2 7 good
7 yellow 5 30 bad
8 yellow 7 60 bad

Divide dataframe in different bins based on condition

i have a pandas dataframe
id no_of_rows
1 2689
2 1515
3 3826
4 814
5 1650
6 2292
7 1867
8 2096
9 1618
10 923
11 766
12 191
i want to divide id's into 5 different bins based on their no. of rows,
such that every bin has approx(equal no of rows)
and assign it as a new column bin
One approach i thought was
df.no_of_rows.sum() = 20247
div_factor = 20247//5 == 4049
if we add 1st and 2nd row its sum = 2689+1515 = 4204 > div_factor.
Therefore assign bin = 1 where id = 1.
Now look for the next ones
id no_of_rows bin
1 2689 1
2 1515 2
3 3826 3
4 814 4
5 1650 4
6 2292 5
7 1867
8 2096
9 1618
10 923
11 766
12 191
But this method proved wrong.
Is there a way to have 5 bins such that every bin has good amount of stores(approximately equal)
You can use an approach based on percentiles.
n_bins = 5
dfa = df.sort_values(by='no_of_rows').cumsum()
df['bin'] = dfa.no_of_rows.apply(lambda x: int(n_bins*x/dfa.no_of_rows.max()))
And then you can check with
df.groupby('bin').sum()
The more records you have the more fair it will be in terms of dispersion.

Pandas: sorting by integer notation value

In this dataframe, column key values correspond to integer notation of each song key.
df
track key
0 Last Resort 4
1 Casimir Pulaski Day 8
2 Glass Eyes 8
3 Ohio - Live At Massey Hall 1971 7
4 Ballad of a Thin Man 11
5 Can You Forgive Her? 11
6 The Only Thing 3
7 Goodbye Baby (Baby Goodbye) 4
8 Heart Of Stone 0
9 Ohio 0
10 the gate 2
11 Clampdown 2
12 Cry, Cry, Cry 4
13 What's Happening Brother 8
14 Stupid Girl 11
15 I Don't Wanna Play House 7
16 Inner City Blues (Make Me Wanna Holler) 11
17 The Lonesome Death of Hattie Carroll 4
18 Paint It, Black - (Original Single Mono Version) 5
19 Let Him Run Wild 11
20 Undercover (Of The Night) - Remastered 5
21 Between the Bars 7
22 Like a Rolling Stone 0
23 Once 2
24 Pale Blue Eyes 5
25 The Way You Make Me Feel - 2012 Remaster 1
26 Jeremy 2
27 The Entertainer 7
28 Pressure 9
29 Play With Fire - Mono Version / Remastered 2002 2
30 D-I-V-O-R-C-E 9
31 Big Shot 0
32 What's Going On 1
33 Folsom Prison Blues - Live 0
34 American Woman 1
35 Cocaine Blues - Live 8
36 Jesus, etc. 5
the notation is as follows:
'C' --> 0
'C#'--> 1
'D' --> 2
'Eb'--> 3
'E' --> 4
'F' --> 5
'F#'--> 6
'G' --> 7
'Ab'--> 8
'A' --> 9
'Bb'--> 10
'B' --> 11
what is specific about this notation is that 11 is closer to 0 than 2, for instance.
GOAL:
given an input_notation = 0, I would like to sort according to closeness to key 0, or 'C'.
you can get closest value by doing:
closest_key = (input_notation -1) % 12
so I would like to sort according to this logic, having on top input_notation values and then closest matches, like so:
8 Heart Of Stone 0
9 Ohio 0
22 Like a Rolling Stone 0
31 Big Shot 0
33 Folsom Prison Blues - Live 0
(...)
I have tried:
v = df[['key']].values
df = df.iloc[np.lexsort(np.abs(v - (input_notation - 1) %12 ).T)]
but this does not work..
any clues?
You can define the closeness firstly and then use argsort with iloc to sort the data frame:
input_notation = 0
# define the closeness or distance
diff = (df.key - input_notation).abs()
closeness = np.minimum(diff, 12 - diff)
# use argsort to calculate the sorting index, and iloc to reorder the data frame
closest_to_input = df.iloc[closeness.argsort(kind='mergesort')]
closest_to_input.head()
# track key
#8 Heart Of Stone 0
#9 Ohio 0
#22 Like a Rolling Stone 0
#31 Big Shot 0
#33 Folsom Prison Blues - Live 0

Create new ID based on cumulative sum in excel vba

I need to create a new transport ID based on the cumulative sum of the volume being transported. Let´s say that originally everything was transported in truck A with a capacity of 25. Now I want to assign these items to shipments with truck B (Capacity 15).
The only real constraint is amt shipped cannot exceed capacity.
I can´t post a picture because of the restrictions...but the overall set up would be like this:
Old Trans # Volume New Trans # Cumulative Volume for Trans
1 1
1 9
1 3
1 7
1 4
2 9
2 10
3 8
3 5
3 9
4 4
4 6
4 8
5 9
5 1
5 5
5 8
6 3
6 4
6 3
6 4
6 4
6 7
7 7
7 10
7 4
8 10
8 6
8 7
9 4
9 9
9 6
10 7
10 4
10 1
10 1
10 5
10 2
11 9
11 3
11 9
12 8
12 5
12 9
13 9
Expected output would be that the first three entries would result in a new shipment ID of 1;the next two entries would result in a new shipment ID of 2;and so on... I´ve tried everthing that I know(excluding VBA): Index/lookup/if functions. My VBA skills are very limited though.Any tips?? thanks!
I think I see what you're trying to do here, and just using an IF formula (and inserting a new column to keep track):
In the Columns C and D, insert these formulas in row 3 and copy down (changing 15 for whatever you want your new volume capacity to be):
Column C: =IF(B3+C2<15,B3+C2,B3)
Column D: =IF(B3+C2<15,D2,D2+1)
And for the cells C2 and D2:
C2: = B2
D2: = A2
Is this what you're looking to do?
A simple formula could be written that 'floats' the range totals for each successive load ID.
In the following, I've typed 25 and 15 in D1:E1 and used a custom number format of I\D 0. In this way, the column is identified and the cell can be referenced as a true number load limit. You can hard-code the limits into the formula if you prefer by overwriting D$1 but you will not have a one-size-fits-all formula that can be copied right for alternate load limits as I have in my example..
      
The formula in D2 is,
=IF(ROW()=2, 1, (SUM(INDEX($B:$B, MATCH(D1, D1:D$1, 0)):$B2)>D$1)+ D1)
Fill right to E2 then down as necessary.