Numbering rows in pandas dataframe - pandas

i have a dataframe looks like:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'number': [0, 29, 0, 52, 0, 0]})
And i am working on solution to numbering 0 values in Number column.
My code, with isnot working right
for i, row in df.iterrows():
df.loc[df['number'] == 0, 'number'] = i+1
df
This code replaces 0 with 1, but must replace first 0 with 1..second 0 with 2 etc...
i would like to have solution based on iteration method(.
Note: numbers "29", "52" etc, must not be changed

Try np.where on a Boolean index based on 0 values in df then replace with cumsum of the index to enumerate:
m = df['number'].eq(0)
df['number'] = np.where(m, m.cumsum(), df['number'])
Or use Series.mask
m = df['number'].eq(0)
df['number'] = df['number'].mask(m, m.cumsum())
df:
first_name last_name number
0 John Smith 1
1 Jane Doe 29
2 Marry Jackson 2
3 Victoria Smith 52
4 Gabriel Brown 3
5 Layla Martinez 4
m.cumsum():
0 1
1 1
2 2
3 2
4 3
5 4
Name: number, dtype: int32
Complete Working Example:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'number': [0, 29, 0, 52, 0, 0]
})
m = df['number'].eq(0)
df['number'] = np.where(m, m.cumsum(), df['number'])
print(df)

Try via boolean masking and loc accessor:
mask=df['number']==0 #created boolean mask
df.loc[mask,'number']=mask.cumsum()
OR
via where() method:
df['number']=df.where(~mask,mask.cumsum(),axis=0)['number']
OR
via boolean masking and assign() method
df[mask]=df[mask].assign(number=mask.cumsum())
Output of df:
first_name last_name number
0 John Smith 1
1 Jane Doe 29
2 Marry Jackson 2
3 Victoria Smith 52
4 Gabriel Brown 3
5 Layla Martinez 4

Alternative via replace and fillna:
df.number = df.number.replace(0,np.NAN).fillna(df.number.eq(0).cumsum())

Related

How to expand a nested dictionary in pandas column?

I have a dataframe with the following nested dictionary in one of the columns:
ID dict
1 {Comp A: {Street: 123 Street}, Comp B: {Street: 456 Street}}
2 {Comp C: {Street: 749 Street}}
3 {Comp D: {Street: }}
I want to expand out the dictionary with the resulting data frame
ID company_name street
1 Comp A 123 Street
1 Comp B 456 Street
2 Comp C 749 Street
3 Comp D
I have tried the following
dft['dict'] = df.dict.apply(eval)
dft = dft.explode('dict')
Which gives me the ID and company_name column correctly, though I haven't been able to figure out how to expand out the street column as well.
This is the data in dictionary form, for reproducibility:
data = [{'ID': 1, 'entity_details': "{'comp a': {'street_address': '123 street'}}"},
{'ID': 2, 'entity_details': "{'comp b': {'street_address': '456 street'}}"},{'ID': 3, 'entity_details': "{'comp c': {'street_address': '555 street'},'comp d': {'street_address': '585 street'}, 'comp e': {'street_address': '873 street'}}"},
{'ID': 4, 'entity_details': "{'comp f': {'street_address': '898 street'}}"}]
A for loop should suffice and be efficient for your use case; the key is to export it into a dictionary - you've done that already with the df.to_dict() code - and then iterate based on the logic - if you are on python 3.10 you could have more simplicity with the pattern matching syntax.
out = []
for entry in data:
for key, value in entry.items():
if key == "entity_details":
val = eval(value)
for k, v in val.items():
result = (entry['ID'], k, v['street_address'])
out.append(result)
pd.DataFrame(out, columns = ['ID', 'company_name', 'street_address'])
ID company_name street_address
0 1 comp a 123 street
1 2 comp b 456 street
2 3 comp c 555 street
3 3 comp d 585 street
4 3 comp e 873 street
5 4 comp f 898 street
Below is a possible structural pattern matching option; however, I feel the for loop above is more explicit :
out = []
for entry in data:
match entry:
case {"entity_details": other}:
output = eval(other)
output = [(entry['ID'], key, value['street_address'])
for key, value in output.items()]
out.extend(output)
pd.DataFrame(out, columns = ['ID', 'company_name', 'street_address'])
ID company_name street_address
0 1 comp a 123 street
1 2 comp b 456 street
2 3 comp c 555 street
3 3 comp d 585 street
4 3 comp e 873 street
5 4 comp f 898 street
Of course, for more complex destructuring, the pattern matching can come in handy
Initial data
As far as the original data wasn't provided, I'll supose that we have this one:
data = {
1: "{'Comp A': {'Street': '123 Street'}, 'Comp B': {'Street': '456 Street'}}",
2: "{'Comp C': {'Street': '749 Street'}}",
3: "{'Comp D': {'Street': ''}}",
}
df = pd.DataFrame.from_dict(data, orient='index', columns=['dict'])
At least, the use of the eval function is justified with these data.
The main idea
To transform them in the format Company_name, Street, we can use DataFrame.from_dict and concat in addition to apply(eval) like this:
f = partial(pd.DataFrame.from_dict, orient='index')
df_transformed = pd.concat(map(f, df['dict'].map(literal_eval)))
Here
f converts a dictionary into DataFrame as if its keys were indexes;
.map(literal_eval) is converting json-strings into dictionaries;
map(f, ...) is supplying data frames into pd.concat
The final touch could be setting the index and renaming the columns, which we can do inside pd.concat like this:
pd.concat(..., keys=df.index, names=['id', 'company']).reset_index('company')
The code
import pandas as pd
from functools import partial
from ast import literal_eval
data = {
1: "{'Comp A': {'Street': '123 Street'}, 'Comp B': {'Street': '456 Street'}}",
2: "{'Comp C': {'Street': '749 Street'}}",
3: "{'Comp D': {'Street': ''}}",
}
df = pd.DataFrame.from_dict(data, orient='index', columns=['dict'])
f = partial(pd.DataFrame.from_dict, orient='index')
dft = pd.concat(
map(f, df['dict'].map(literal_eval)),
keys=df.index, # use the original index to identify where each record comes from
names=['id', 'Company']
).reset_index('Company')
print(dft)
The output:
Company Street
id
1 Comp A 123 Street
1 Comp B 456 Street
2 Comp C 749 Street
3 Comp D
P.S.
Let's say, that:
data = \
[{'ID': 1, 'entity_details': "{'comp a': {'street_address': '123 street'}}"},
{'ID': 2, 'entity_details': "{'comp b': {'street_address': '456 street'}}"},{'ID': 3, 'entity_details': "{'comp c': {'street_address': '555 street'},'comp d': {'street_address': '585 street'}, 'comp e': {'street_address': '873 street'}}"},
{'ID': 4, 'entity_details': "{'comp f': {'street_address': '898 street'}}"}]
df = pd.DataFrame(data).set_index('ID')
In this case the only thing we should change in the code is the initial column name. It was dict, and now it's entity_details:
pd.concat(
map(f, df['entity_details'].map(literal_eval)),
keys=df.index,
names=['id', 'Company']
).reset_index('Company')

How to create a multiIndex (hierarchical index) dataframe object from another df's column's unique values?

I'm trying to create a pandas multiIndexed dataframe that is a summary of the unique values in each column.
Is there an easier way to have this information summarized besides creating this dataframe?
Either way, it would be nice to know how to complete this code challenge. Thanks for your help! Here is the toy dataframe and the solution I attempted using a for loop with a dictionary and a value_counts dataframe. Not sure if it's possible to incorporate MultiIndex.from_frame or .from_product here somehow...
Original Dataframe:
data = pd.DataFrame({'A': ['case', 'case', 'case', 'case', 'case'],
'B': [2001, 2002, 2003, 2004, 2005],
'C': ['F', 'M', 'F', 'F', 'M'],
'D': [0, 0, 0, 1, 0],
'E': [1, 0, 1, 0, 1],
'F': [1, 1, 0, 0, 0]})
A B C D E F
0 case 2001 F 0 1 1
1 case 2002 M 0 0 1
2 case 2003 F 0 1 0
3 case 2004 F 1 0 0
4 case 2005 M 1 1 0
Desired outcome:
unique percent
A case 100
B 2001 20
2002 20
2003 20
2004 20
2005 20
C F 60
M 40
D 0 80
1 20
E 0 40
1 60
F 0 60
1 40
My failed for loop attempt:
def unique_values(df):
values = {}
columns = []
df = pd.DataFrame(values, columns=columns)
for col in data:
df2 = data[col].value_counts(normalize=True)*100
values = values.update(df2.to_dict)
columns = columns.append(col*len(df2))
return df
unique_values(data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-a341284fb859> in <module>
11
12
---> 13 unique_values(data)
<ipython-input-84-a341284fb859> in unique_values(df)
5 for col in data:
6 df2 = data[col].value_counts(normalize=True)*100
----> 7 values = values.update(df2.to_dict)
8 columns = columns.append(col*len(df2))
9 return df
TypeError: 'method' object is not iterable
Let me know if there's something obvious I'm missing! Still relatively new to EDA and pandas, any pointers appreciated.
This is a fairly straightforward application of .melt:
data.melt().reset_index().groupby(['variable', 'value']).count()/len(data)
output
index
variable value
A case 1.0
B 2001 0.2
2002 0.2
2003 0.2
2004 0.2
2005 0.2
C F 0.6
M 0.4
D 0 0.8
1 0.2
E 0 0.4
1 0.6
F 0 0.6
1 0.4
I'm sorry! I've written an answer, but it's in javascript. I came here after I thought I've clicked on javascript and started coding, but on posting I saw that you're coding in python.
I will post it anyway, maybe it will help you. Python is not that much different from javascript ;-)
const data = {
A: ["case", "case", "case", "case", "case"],
B: [2001, 2002, 2003, 2004, 2005],
C: ["F", "M", "F", "F", "M"],
D: [0, 0, 0, 1, 0],
E: [1, 0, 1, 0, 1],
F: [1, 1, 0, 0, 0]
};
const getUniqueStats = (_data) => {
const results = [];
for (let row in _data) {
// create list of unique values
const s = [...new Set(_data[row])];
// filter for unique values and count them for percentage, then push
results.push({ index: row, values: s.map((x) => ({ unique: x, percentage: (_data[row].filter((y) => y === x).length / data[row].length) * 100 })) });
}
return results;
};
const results = getUniqueStats(data);
results.forEach((row) =>
row.values.forEach((value) =>
console.log(`${row.index}\t${value.unique}\t${value.percentage}%`)
)
);

Assign Random Number between two value conditionally

I have a dataframe:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['35-44', '20-34', '35-44', '35-44', '45-70', '35-44']})
I wish to replace the values in the Age column by integers between two values. So, for example, I wish to replace each value with age range '35-44' by a random integer between 35-44.
I tried:
df.loc[df["AGE"]== '35-44', 'AGE'] = random.randint(35, 44)
But it picks the same value for each row. I would like it to randomly pick a different value for each row.
I get:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['38', '20-34', '38', '38', '45-70', '38']})
But I would like to get something like the following. I don't much care about how the values are distributed as long as they are in the range that I assign
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['36', '20-34', '39', '38', '45-70', '45']})
The code
random.randint(35, 44)
Produces a single random value making the statement analogous to:
df.loc[df["AGE"]== '35-44', 'AGE'] = 38 # some constant
We need a collection of values that is the same length as the values to fill. We can use np.random.randint instead:
import numpy as np
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = np.random.randint(35, 44, m.sum())
(Series.sum is used to "count" the number of True values in the Series since True is 1 and False is 0)
df:
Prod Line AGE
0 abc Revenues 40
1 qrt EBT 20-34
2 xyz Expenses 41
3 xam Revenues 35
4 asc EBT 45-70
5 yat Expenses 36
*Reproducible with np.random.seed(26)
Naturally, using the filter on both sides of the expression with apply would also work:
import random
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = df.loc[m, 'AGE'].apply(lambda _: random.randint(35, 44))
df:
Prod Line AGE
0 abc Revenues 36
1 qrt EBT 20-34
2 xyz Expenses 37
3 xam Revenues 43
4 asc EBT 45-70
5 yat Expenses 44
*Reproducible with random.seed(28)

pandas row wise comparison and apply condition

This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.
You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5

How to merge columns using mask

I am trying to merge two columns (Phone 1 and 2)
Here is my fake data:
import pandas as pd
employee = {'EmployeeID' : [0, 1, 2, 3, 4, 5, 6, 7],
'LastName' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],
'Name' : ['w', 'x', 'y', 'z', None, None, None, None],
'phone1' : [1, 1, 2, 2, 4, 5, 6, 6],
'phone2' : [None, None, 3, 3, None, None, 7, 7],
'level_15' : [0, 1, 0, 1, 0, 0, 0, 1]}
df2 = pd.DataFrame(employee)
and I want the 'phone' column to be
'phone' : [1, 2, 3, 4, 5, 7, 9, 10]
In the beginning of my code, i split the names based on '/' and this code below creates a column with 0s and 1s which I used as mask to do other tasks through out my code.
df2 = (df2.set_index(cols)['name'].str.split('/',expand=True).stack().reset_index(name='Name'))
m = df2['level_15'].eq(0)
print (m)
#remove column level_15
df2 = df2.drop(['level_15'], axis=1)
#add last name for select first letter by condition, replace NaNs by forward fill
df2['last_name'] = df2['name'].str[:2].where(m).ffill()
df2['name'] = df2['name'].mask(m, df2['name'].str[2:])
I feel like there is a way to merge phone1 and phone2 using the 0s and 1s, but I can't figure out. Thank you.
First, start by filling in NaNs;
df2['phone2'] = df2.phone2.fillna(df2.phone1)
# Alternatively, based on your latest update
# df2['phone2'] = df2.phone2.mask(df2.phone2.eq(0)).fillna(df2.phone1)
You can just use np.where to merge columns on odd/even indices:
df2['phone'] = np.where(np.arange(len(df2)) % 2 == 0, df2.phone1, df2.phone2)
df2 = df2.drop(['phone1', 'phone2'], 1)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8
Or, with Series.where/mask:
df2['phone'] = df2.pop('phone1').where(
np.arange(len(df2)) % 2 == 0, df2.pop('phone2')
)
Or,
df2['phone'] = df2.pop('phone1').mask(
np.arange(len(df2)) % 2 != 0, df2.pop('phone2)
)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8