Update Column Value for previous entries of row value, based on new entry - pandas

enter image description here
I have a dataframe that is updated monthly as such, with a new row for each employee.
If an employee decides to change their gender (for example here, employee 20215 changed from M to F in April 2022, I want all previous entries for that employee number 20215 to be switched to F as well.
This is for a database with roughly 15 million entries, and multiple such changes every month, so I was hoping for a scalable solution (I cannot simply put df['Gender'] = 'F' for example)

Since we didn’t receive a df from you or any code, I neede to generate something myself in order to test it. Please provide enugh code a give us a sample next time as well.
Here the generated df, in case someone comes with a better answer:
import pandas as pd, numpy as np
length=100
df = pd.DataFrame({'ID': np.random.randint(1001,1020,length),
'Ticket': np.random.randint(length),
'salary_grade' : np.random.randint(0,10,size=length),
'date': np.arange(length),
'genre': 'M' })
df['date']=pd.to_numeric(df['date'])
df['date']=pd.to_datetime(df['date'],dayfirst=True,unit='D',origin='15.04.2022')
That is the base DF, now I needed to estimulate some gender changes:
test_id=df.groupby(['ID'])['genre'].count().idxmax() # gives me the employee with most entries.
test_id
df[df['ID']==test_id].loc[:,'genre'] # getting all indexes from test_id, for a testchange/later for checking
df[df['ID']==test_id] # getting indexes of test_ID for gender change
id_lst=[]
for idx in df[df['ID']==test_id].index:
if idx>28: # <-- change this value for you generated df, middle of list
id_lst.append(idx) # returns a list of indexes where gender chage will happen
df.loc[id_lst,'genre']='F' # applying a gender change
Answer:
Finally to your answer:
finder=df.groupby(['ID']).agg({'genre' : lambda x: len(list(pd.unique(x)))>1 , 'date' : 'min'}) # Will return True for every ID with more then 2 genres
finder[finder['genre']] # will return IDs from above condition.
Next steps...
Now with the ID you just need to discover if its M-->F or F-->M new_genreand assign the new genre for the ID_found (int or list).
df.loc[ID_found,'genre']=new_genre

Related

How can I always choose the last column in a csv table that's updated monthly?

Automating small business reporting from my Quickbooks P&L. I'm trying to get the net income value for the current month from a specific cell in a dataframe, but that cell moves one column to the right every month when I update the csv file.
For example, for the code below, this month I want the value from Nov[0], but next month I'll want the value from Dec[0], even though that column doesn't exist yet.
Is there a graceful way to always select the second right most column, or is this a stupid way to try and get this information?
import numpy as np
import pandas as pd
nov = -810
dec = 14958
total = 8693
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
Sure, you can reference the last or second-to-last row or column.
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
x = df.iloc[-1,-2]
This will select the value in the last row for the second-to-last column, in this case 70. :)
If you plan to use the full file, #VincentRupp's answer will get you what you want.
But if you only plan to use the values in the second right most column and you can infer what it will be called, you can tell pd.read_csv that's all you want.
import pandas as pd # 1.5.1
# assuming we want this month's name
# can modify to use some other month
abbreviated_month_name = pd.to_datetime("today").strftime("%b")
df = pd.read_csv("path/to/file.csv", usecols=[abbreviated_month_name])
print(df.iloc[-1, 0])
References
pd.read_csv
strftime cheat-sheet

Selecting Rows Based On Specific Condition In Python Pandas Dataframe

So I am new to using Python Pandas dataframes.
I have a dataframe with one column representing customer ids and the other holding flavors and satisfaction scores that looks something like this.
Although each customer should have 6 rows dedicated to them, Customer 1 only has 5. How do I create a new dataframe that will only print out customers who have 6 rows?
I tried doing: df['Customer No'].value_counts() == 6 but it is not working.
Here is one way to do it
if you post data as a code (preferably) or text, i would be able to share the result
# create a temporary column 'c' by grouping on Customer No
# and assigning count to it using transform
# finally, using loc to select rows that has a count eq 6
(df.loc[df.assign(
c=df.groupby(['Customer No'])['Customer No']
.transform('count'))['c'].eq(6]
)

Pandas aggregate by unique occurrence per group

In pandas, I'd like to analyze groups if there is a single occurrence of a conditional value. I've included a sample dataframe with a first step attempt at identifying such groups below. So, let's say, in the data frame below, I want to filter the original data frame only for species of iris that ever had a sepal length greater than 6. In the last command, I'm counting the number of unique species groups that had a sepal length greater than 6 (so, at least I can count them).
But, what I really want is the original dataframe where I analyze rows only if the species had a sepal length greater than 6 (so, it would be a dataframe without the species "setosa" since they never have one).
The longer explanation is that I have a real dataset of users. Each user will have values in certain columns that may exceed a threshold value of interest. I haven't figured out how to analyze users who have these threshold values.
Perhaps a loop would be better. I might loop through each unique user name and look if any row with that user ever exceeds a certain value and gets some kind of new column (though I know loops are frowned upon in pandas so I'm posting here to see if there's some kind of well-known method of identifying groups by occurrence).
Thanks and let me know if I can make this question any more clear!
import pandas as pd
import seaborn as sns
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
iris = sns.load_dataset('iris')
iris['longsepal'] = iris['sepal_length'] > 7
iris['longpetal'] = iris['petal_length'] > 5
iris.groupby(['longsepal'])['species'].nunique()
Consider groupby().transform() to calculate inline max aggregates to be later filtered on its value by species. Technically, the > 7 only returns one species as veriscolor max reaches 7.0. Below shows the operator and functional form of inequality logic.
iris['longsepal'] = iris.groupby(['species'])['sepal_length'].transform('max')
iris['longpetal'] = iris.groupby(['species'])['petal_length'].transform('max')
# DATA FILTERS
longsepal_iris = iris.loc[iris['longsepal'] > 7] # GREATER THAN OPERATOR FORM: >
longsepal_iris = iris.loc[iris['longsepal'].gt(7)] # GREATER THAN FUNCTIONAL FORM: gt()
longpetal_iris = iris.loc[iris['longpetal'] > 5] # GREATER THAN OPERATOR FORM: >
longpetal_iris = iris.loc[iris['longpetal'].gt(5)] # GREATER THAN FUNCTIONAL FORM: gt()
# SPECIES
longsepal_iris['species'].unique()
# ['virginica']
longpetal_iris['species'].unique()
# ['versicolor' 'virginica']

Generating Percentages from Pandas

0
I am working with a data set from SQL currently -
import pandas as pd
df = spark.sql("select * from donor_counts_2015")
df_info = df.toPandas()
print(df_info)
The output looks like this (I can't include the actual output for privacy reasons): enter image description here
As you can see, it's a data set that has the name of a fund and then the number of people who have donated to that fund. What I am trying to do now is calculate what percent of funds have only 1 donation, what percent have 2, 34, etc. I am wondering if there is an easy way to do this with pandas? I also would appreciate if you were able to see the percentage of a range of funds too, like what percentage of funds have between 50-100 donations, 500-1000, etc. Thanks!
You can make a histogram of the donations to visualize the distribution. np.histogram might help. Or you can also sort the data and count manually.
For the first task, to get the percentage the column 'number_of_donations', you can do:
df['number_of_donations'].value_counts(normalize=True) * 100
For the second task, you need to create a new column with categories, and then make the same:
# Create a Serie with categories
New_Serie = pd.cut(df.number_of_donations,bins=[0,100,200,500,99999999],labels = ['Few','Medium','Many','Too Many'])
# Change the name of the Column
New_Serie.name = Category
# Concat df and New_Serie
df = pd.concat([df, New_Serie], axis=1)
# Get the percentage of the Categories
df['Category'].value_counts(normalize=True) * 100

pandas mapping function to transform data using dictionary

please help friends.
I want to use mapping to match the age of student and identify them in category adult or child by comparing it with a dictionary 'dlist' containing age 1 to 18 as child and age 19 to 60 as adult..
# making Data Frame
age=np.random.randint(1,50,5,int)
name=['kashif', 'dawood', 'ali', 'zain', 'hamza']
df5=pd.DataFrame({'name':name,
'age':age})
# making dictionary
dlist={range(1,18):'child' , range(19,50):'adult'}
# now maping dictionary with data frame 'age' column elements to add status adult if age greater than 18 using dictionary
df5['Status']=df5.age.map(dlist)
but it returns the data frame with column name 'Status' but NAN values (instead of adult or child)
kindly ignore my English if there are mistakes. i m not a native speaker of english.
In Python 3, you are allowed to use ranges as dict-keys, but it does not work the way you seem to think. For example
print(dlist[1])
will give you a key-error, since the key 1 does not exist in dlist, however
print(dlist[range(1,18)])
will work, since you have a key that is range(1,18). This means that you can not use your dlist the way you want in the map-function
To make use of your dict, with ranges as keys, you should instead use apply
df5['Status'] = df5['age'].apply(
lambda x: next((v for k, v in dlist.items() if x in k), 'NA')
)
Where [v for k, v in dlist.items() if x in k] gives you a list of all values in your dict, if x is in k (which is a range). The next()-function gets the next value (i.e. the first value) in that list (but it also works on iterators, and thats why [] can be omitted. NA is the default value for next() if no next exists. See https://docs.python.org/3/library/functions.html#next
You should howevr note that range(1,18) does NOT include 18. So with this code, the age 18 will give you Status = 'NA'
you can achieve by using np.where
df5['status'] = np.where((df5['age']>=1) & (df5['age']<=18), 'child', 'adult')
print(df5)
name age status
kashif 15 child
dawood 11 child
ali 33 adult
zain 21 adult
hamza 31 adult
That's my personal preference when working with pandas. I always use cut() method of pandas with a list of labels and bins to create a categorical variable:
import numpy as np
import pandas as pd
# making Data Frame
np.random.seed(41)
age=np.random.randint(1,50,5,int)
name=['kashif', 'dawood', 'ali', 'zain', 'hamza']
df=pd.DataFrame({'name':name, 'age':age})
# create a bin
bins = [0, 18, 50]
# create a bin label
label_list = ['adult', 'old']
# create a new column with bin and label
df['status'] = pd.cut(df.age, bins, labels=label_list)
use np.select
#specify conditions
conditions=[(df5['age']<=18),
(df5['age']>18)& (df5['age']<=50)]
#specify column output based on conditions
choices = ['child','adult'] #you can also specify numbers as well here
#create status column based on conditions
df5["status"] = np.select(conditions, choices)