Add rows based on condition and also create and update new columns - pandas

I have a pandas dataframe with few thousand rows, subset of it is below
fr var
1.1 10px
2.9 12pz
Expected Output:
fr var vard varv
1.1 10px -5 xval
1.1 10px 5 zval
2.9 12pz -6 zval
2.9 12pz 6 xval
For rows - Each row is to be split into two
Conditions for new columns:
'vard' - divide the numeric part of 'var' column by 2 and store it as two rows in 'vard', one negative and one positive value.
'varv' - if 'px' is in 'var column' and 'vard' has negative value, then 'varv' should be 'xval' else 'zval'.
Similarly if 'pz' is in 'var' column and 'vard' has negative value, then 'varv' should be 'zval' else 'xval'.
I have read through various answers with almost similar problems and tried many option like 'iterrows', 'shift', 'explode' etc but not able to get the expected output.

Use Series.str.extract first for numeric and non numeric part, convert numeric part to integers and divide by 2, then join with multiple value by -1 in concat, sorting index and create default, last use numpy.where for set new values by conditions:
df[['vard','varv']] = df['var'].str.extract('(\d+)(\D+)')
df['vard'] = df['vard'].astype(int).div(2)
df = pd.concat([df, df.assign(vard = df['vard']*-1)]).sort_index().reset_index(drop=True)
m = (df['varv'].eq('px') & df['vard'].lt(0)) | df['varv'].eq('pz') & df['vard'].gt(0)
df['varv'] = np.where(m, 'zval','xval')
print (df)
fr var vard varv
0 1.1 10px 5.0 xval
1 1.1 10px -5.0 zval
2 2.9 12pz 6.0 zval
3 2.9 12pz -6.0 xval

it is something that can easily be done using the melt function.
# recreate your dataframe
df = pd.DataFrame(columns=['fr','var'])
df['fr']=[1.1,2.9]
df['var']=['10px','12pz']
# split the var into its two components by creating two new columns
df['vard_p'] = df['var'].str[:-2]
df['vard_p'] = df['vard_p'].astype(float)/2
df['vard_n'] = -df['vard_p']
# get the vard from the var (I assumed it was simply the last character in the string)
df['varv'] = df['var'].str[-1]+'val'
# and here you melt on the two new vard columns to get the dataframe in the format you wanted
df = pd.melt(df, id_vars=['fr','var','varv'], value_vars=['vard_p','vard_n'])
# now rename or drop the new columns
df.rename(columns={'value':'vard'},inplace=True)
df.drop('variable',axis=1,inplace=True)
df
Output:
fr var varv vard
0 1.1 10px xval 5.0
1 2.9 12pz zval 6.0
2 1.1 10px xval -5.0
3 2.9 12pz zval -6.0
Hope it helped

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Create a new column based on another column in a dataframe

I have a df with multiple columns. One of my column is extra_type. Now i want to create a new column based on the values of extra_type column. For example
extra_type
NaN
legbyes
wides
byes
Now i want to create a new column with 1 and 0 if extra_type is not equal to wide then 1 else 0
I tried like this
df1['ball_faced'] = df1[df1['extra_type'].apply(lambda x: 1 if [df1['extra_type']!= 'wides'] else 0)]
It not working this way.Any help on how to make this work is appreciated
expected output is like below
extra_type ball_faced
NaN 1
legbyes 1
wides 0
byes 1
Note that there's no need to use apply() or a lambda as in the original question, since comparison of a pandas Series and a string value can be done in a vectorized manner as follows:
df1['ball_faced'] = df1.extra_type.ne('wides').astype(int)
Output:
extra_type ball_faced
0 NaN 1
1 legbyes 1
2 wides 0
3 byes 1
Here are links to docs for ne() and astype().
For some useful insights on when to use apply (and when not to), see this SO question and its answers. TL;DR from the accepted answer: "If you're not sure whether you should be using apply, you probably shouldn't."
df['ball_faced'] = df.extra_type.apply(lambda x: x != 'wides').astype(int)
extra_type
ball_faced
0
NaN
1
1
legbyes
1
2
wides
0
3
byes
1

Python - Looping through dataframe using methods other than .iterrows()

Here is the simplified dataset:
Character x0 x1
0 T 0.0 1.0
1 h 1.1 2.1
2 i 2.2 3.2
3 s 3.3 4.3
5 i 5.5 6.5
6 s 6.6 7.6
8 a 8.8 9.8
10 s 11.0 12.0
11 a 12.1 13.1
12 m 13.2 14.2
13 p 14.3 15.3
14 l 15.4 16.4
15 e 16.5 17.5
16 . 17.6 18.6
The simplified dataset is generated by the following code:
ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
ch.append(s)
x0.append(round(x1[-1]+0.1,1))
x1.append(round(x0[-1]+1,1))
df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)
x0 and x1 represents the starting and ending position of each Character, respectively. Assume that the distance between any two adjacent characters equals to 0.1. In other words, if the difference between x0 of a character and x1 of the previous character is 0.1, the two characters belongs to the same string. If such difference is larger than 0.1, the character should be the start of a new string, etc. I need to produce a dataframe of strings and their respective x0 and x1, which is done by looping through the dataframe using .iterrows()
string = []
x0 = []
x1 = []
for index, row in df.iterrows():
if index == 0:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
else:
if round(row['x0']-x1[-1],1) == 0.1:
string[-1] += row['Character']
x1[-1] = row['x1']
else:
string.append(row['Character'])
x0.append(row['x0'])
x1.append(row['x1'])
df_string = pd.DataFrame(list(zip(string, x0, x1)), columns = ['String', 'x0', 'x1'])
Here is the result:
String x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
Is there any other faster way to achieve this?
You could use groupby + agg:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df.at[0, 'x0'])).abs()
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
# group and aggregate accordingly
res = df.groupby(grouper).agg({ 'Character' : ''.join, 'x0' : 'first', 'x1' : 'last' })
print(res)
Output
Character x0 x1
0 This 0.0 4.3
1 is 5.5 7.6
2 a 8.8 9.8
3 sample. 11.0 18.6
The tricky part is this one:
# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()
The idea is to convert the column of diffs (same) into a True or False column, where every time a True appears it means a new group needs to be created. The cumsum will take care of assigning the same id to each group.
As suggested by #ShubhamSharma, you could do:
# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)
# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()
The other part remains the same.

Append same randint string to Parent as well as child in data frame using list comprehension

I have a function that generates a unique 6 digit string using randint for a given 14 digit string as follows:
import random as r
from random import randint
srng='60817409470000' #This is just an example of the input string
def un_Gen(srng):
''' Takes in 14 digit string and assigns a unique 6 digit string using randint and concatenates wiht 11th index '''
if len(srng) >= 13:
unid = str(randint(100000,999999)) #generate unique 6 digit string
ud = unid + '-00' + srng[11] #concatenate 6 digit string to child rank at 11th index
else:
ud = 'NA' #Exceptional handeling for invalid strings
return ud
So for ex:
un_Gen('60817408440000')
Out[288]:
'217417-000'
I would like to apply this function to the ['UWI'] column in my df in the following way:
I want to generate a new column in my df in such a way that it assigns the new string generated by my un_Gen function to the rows assigned a Parent value (as it is indicated by the ['Parent'] column.
When it comes to any child values (which share the first 10 digit values) , I want the same 6 digit randint number assigned to the child but the string after the dash should have the child rank, which can be indicated by the ['Rank'] column.
So for example, in the attached df image I want a new column for the the highlighted rows in yellow to have the same random 6 digits but the last 3 should reference the child rank.
'60817408440000'--> '217417-000'
'60817408440100'--> '217417-001'
'60817408440200'--> '217417-002'
'60817408440300'--> '217417-003'
'60817408440400'--> '217417-004'
Currently I'm trying to achieve this using list comprehension and conditional statements in the following matter:
dfcat['BUI'] = [un_Gen(str(i)) for i in dfcat['UWI'] if dfcat['Parent'][i] == True ]
The issue that I'm having is that
1)I'm getting the following error
IndexError: index out of bounds
2) Right now my code generates randint 6 digit strings for ALL rows in df. How do I code it in such a way that the children end up with the same 6-digit string as the parent. I already have a column that displays the first 10 digits as the ['Family'], I want the code to use the same rand 6 digits for all the family members.
I look forward to any suggestions/solutions/help and thank you for taking the time to look over this.
enter image description here
This is my approach, it's not ideal, nor does it directly provide what you are looking.
First note if you have control over u_Gen() then I would suggest a rewrite within it. The answer below assumes you do not have control:
import pandas as pd
import numpy as np
import random as r
from random import randint
def un_Gen(srng):
''' Takes in 14 digit string and assigns a unique 6 digit string using randint and concatenates wiht 11th index '''
if len(srng) >= 13:
unid = str(randint(100000,999999)) #generate unique 6 digit string
ud = unid + '-00' + srng[11] #concatenate 6 digit string to child rank at 11th index
else:
ud = 'NA' #Exceptional handeling for invalid strings
return ud
# first some sample data:
df = pd.DataFrame([60817408440000,60817408440100,60817408440200,60817408440300,60817408440400,70817408440000,70817408440100],columns=['base'])
#now a process using helper columns to attain the result, these can be dropped at the end.
df['family'] = np.floor(df['base'] /1000)*1000
df['family'] =df['family'].astype(np.int64).astype(str)
df['rnk'] = df.groupby('family')['base'].rank()
base family rnk
0 60817408440000 60817408440000 1.0
1 60817408440100 60817408440000 2.0
2 60817408440200 60817408440000 3.0
3 60817408440300 60817408440000 4.0
4 60817408440400 60817408440000 5.0
5 70817408440000 70817408440000 1.0
6 70817408440100 70817408440000 2.0
df['family_mapped'] = df[['rnk','family']].apply(lambda x: np.nan if x['rnk'] !=1 else un_Gen(x['family']),axis=1)
df['family_mapped'] = df.groupby('family')['family_mapped'].ffill()
df[['parent','temp_child']] = df['family_mapped'].str.split('-',expand=True)
df['temp_child'] = df['base'].map(lambda x: str(x)[-3:])
df['BUI'] = df[['parent','temp_child']].apply(lambda x: str(x['parent']) + '-' + str(x['temp_child']),axis=1)
base family rnk family_mapped parent temp_child BUI
0 60817408440000 60817408440000 1.0 217100-000 217100 000 217100-000
1 60817408440100 60817408440000 2.0 217100-000 217100 100 217100-100
2 60817408440200 60817408440000 3.0 217100-000 217100 200 217100-200
3 60817408440300 60817408440000 4.0 217100-000 217100 300 217100-300
4 60817408440400 60817408440000 5.0 217100-000 217100 400 217100-400
5 70817408440000 70817408440000 1.0 834075-000 834075 000 834075-000
6 70817408440100 70817408440000 2.0 834075-000 834075 100 834075-100

Using value_counts in pandas with conditions

I have a column with around 20k values. I've used the following function in pandas to display their counts:
weather_data["snowfall"].value_counts()
weather_data is the dataframe and snowfall is the column.
My results are:
0.0 12683
M 7224
T 311
0.2 32
0.1 31
0.5 20
0.3 18
1.0 14
0.4 13
etc.
Is there a way to:
Display the counts of only a single variable or number
Use an if condition to display the counts of only those values which satisfy the condition?
I'll be as clear as possible without having a full example as piRSquared suggested you to provide.
value_counts' output is a Series, therefore the values in your originale Series can be retrieved from the value_counts' index. Displaying only the result of one of the variables then is exactly slicing your series:
my_value_count = weather_data["snowfall"].value_counts()
my_value_count.loc['0.0']
output:
0.0 12683
If you want to display only for a list of variables:
my_value_count.loc[my_value_count.index.isin(['0.0','0.2','0.1'])]
output:
0.0 12683
0.2 32
0.1 31
As you have M and T in your values, I suspect the other values will be treated as strings and not float. Otherwise you could use:
my_value_count.loc[my_value_count.index < 0.4]
output:
0.0 12683
0.2 32
0.1 31
0.3 18
Use an if condition to display the counts of only those values which satisfy the condition?
First create a new column based on the condition you want. Then you can use groupby and sum.
For example, if you want to count the frequency only if a column has a non-null value. In my case, if there is an actual completion_date non-null value:
dataset['Has_actual_completion_date'] = np.where(dataset['ACTUAL_COMPLETION_DATE'].isnull(), 0, 1)
dataset['Mitigation_Plans_in_progress'] = dataset['Has_actual_completion_date'].groupby(dataset['HAZARD_ID']).transform('sum')