Converting pandas valuecounts to a specific type - pandas

I am trying to dispaly the percentages of a particular dataframe column as the percentage of it's grand total. I do have a constraint of this being a specific data type (numpy float does not fly)
My code is quite simple
dict(df['marital_status'].value_counts().transform(lambda x: x/sum(x)))
I tried astype() and trying to cast the values within the transform function itself but no joy.

Instead your function use normalize=True in Series.value_counts, then for percentages multiple 100 and if need integers after round casting:
print (df)
marital_status
0 1
1 0
2 1
3 1
4 1
5 0
6 0
d = df['marital_status'].value_counts(normalize=True).mul(100).round().astype(int).to_dict()
print (d)
{1: 57, 0: 43}

Related

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

Pandas Series: Decrement DateTime by 100 Years

I have a pandas series as follows...
0 2039-03-16
1 2056-01-21
2 2051-11-18
3 2064-03-05
4 2048-06-05
Name: BIRTH, dtype: datetime64
It was created from string data as follows
s = data['BIRTH']
s = pd.to_datetime(s)
s
I want to convert all dates after year 2040 to 1940
I can do this for a single record as follows
s.iloc[0].replace(year=d.year-100)
but I really want to just run it over the whole series. I can't work it out. Help!??
PS - I know there's ways outside of pandas using Python's DT module but I'd like to learn how to do this within Pandas please
Using DateOffset is the obvious choice here:
df['date'] - pd.offsets.DateOffset(years=100)
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Assign it back:
df['date'] -= pd.offsets.DateOffset(years=100)
df
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
We have the offsets module to deal with non-fixed frequencies, it comes in handy in situations like these.
To fix your code, you'd have wanted to apply datetime.replace rowwise using apply (not recommended):
df['date'].apply(lambda x: x.replace(year=x.year-100))
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Or using a list comprehension,
df.assign(date=[x.replace(year=x.year-100) for x in df['date']])
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Neither of these handle NaT entries very well.

Division between two numbers in a Dataframe

I am trying to calculate a percent change between 2 numbers in one column when a signal from another column is triggered.
The trigger can be found with np.where() but what I am having trouble with is the percent change. .pct_change does not work because if you .pct_change(-5) you get 16.03/20.35 and I want the number the opposite way 20.35/16.03. See table below. I have tried returning the array from the index in the np.where and adding it to an .iloc from the 'Close' column but it says I can't use that array to get an .iloc position. Can anyone help me solve this problem. Thank you.
IdxNum | Close | Signal (1s)
==============================
0 21.45 0
1 21.41 0
2 21.52 0
3 21.71 0
4 20.8 0
5 20.35 0
6 20.44 0
7 16.99 0
8 17.02 0
9 16.69 0
10 16.03 1<< 26.9% <<< 20.35/16.03-1 (df.Close[5]/df.Close[10]-1)
11 15.67 0
12 15.6 0
You can try this code block:
#Create DataFrame
df = pd.DataFrame({'IdxNum':range(13),
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Create a function that calculates the reqd diff
def cal_diff(row):
if(row['Signal']==1):
signal_index = int(row['IdxNum'])
row['diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1
return row
#Create a column and apply that difference
df['diff'] = 0
df = df.apply(lambda x:cal_diff(x),axis=1)
In case you don't have IdxNum column, you can use the index to calculate difference
#Create DataFrame
df = pd.DataFrame({
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Calculate the reqd difference
df['diff'] = 0
signal_index = df[df['Signal']==1].index[0]
df.ix[signal_index,'diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1

grouping by column and then doing a boxplot by the index in pandas

I have a large dataframe which I would like to group by some column and examine graphically the distribution per group using a boxplot. I found that df.boxplot() will do it for each column of the dataframe and put it in one plot, just as I need.
The problem is that after a groupby operation, my data is all in one column with the group labels in the index , so i can't call boxplot on the result.
here is an example:
df = DataFrame({'a':rand(10),'b':[x%2 for x in range(10)]})
df
a b
0 0.273548 0
1 0.378765 1
2 0.190848 0
3 0.646606 1
4 0.562591 0
5 0.409250 1
6 0.637074 0
7 0.946864 1
8 0.203656 0
9 0.276929 1
Now I want to group by column b and boxplot the distribution of both groups in one boxplot. How can I do that?
You can use the by argument of boxplot. Is that what you are looking for?
df.boxplot(column='a', by='b')