pandas add a '*' for values less than .05 - pandas

I have a pandas dataframe of p values.
disorder p value(group) p value(cluster) p value(interaction)
0 Specific phobia 0.108 0.022 0.075
1 Social phobia 0.848 0.001 0.690
2 Depression 0.923 0.034 0.016
3 PTSD 0.519 0.039 0.004
4 ODD 0.013 0.053 0.003
5 ADHD 0.876 0.062 0.012
How can I add '*' to those values that are less than .05?

Let us do
df.iloc[:,1:]=df.iloc[:,1:].mask(df.iloc[:,1:].le(0.05),df.astype(str).apply(lambda x : x.str[:5]).add('*'))

Try something like -
df.astype(str)
specific_column = ['Column Name you want to check on']
df[specific_column] = df[specific_column].apply(lambda x: x + "*" if int(x)<0.5 else x)

I have just figured out a way to add * and ** to values less than .05 and .01, respectively.
report2 = report.copy()
report[report2.iloc[:,1:].le(0.05)] = report[report2.iloc[:,1:].le(0.05)].astype(str).apply(lambda x : x.str[:5]).add('*')
report[report2.iloc[:,1:].le(0.01)] = report[report2.iloc[:,1:].le(0.01)].astype(str).apply(lambda x : x.str[:5]).add('**')

Related

Add a column with the highest correlation to the previously ranked variables in pandas

Let's say I have a dataframe (sorted on rank) that looks as follows:
Num score rank
0 1 1.937 6.0
0 2 1.819 5.0
0 3 1.704 4.0
0 6 1.522 3.0
0 4 1.396 2.0
0 5 1.249 1.0
and I want to add a column that displays the highest correlation of that variable to the variables that were ranked above.
With the said correlation matrix being:
Num1 Num2 Num3 Num4 Num5 Num6
Num1 1.0 0.976 0.758 0.045 0.137 0.084
Num2 0.976 1.0 0.749 0.061 0.154 0.096
Num3 0.758 0.749 1.0 -0.102 0.076 -0.047
Num4 0.045 0.061 -0.102 1.0 0.917 0.893
Num5 0.137 0.154 0.076 0.917 1.0 0.863
Num6 0.084 0.096 -0.047 0.893 0.863 1.0
I would expect to get:
Num score rank highestcor
0 1 1.937 6.0 NaN
0 2 1.819 5.0 0.976
0 3 1.704 4.0 0.758
0 6 1.522 3.0 0.096
0 4 1.396 2.0 0.893
0 5 1.249 1.0 0.917
How would I go about this in an efficient way?
Here's one way to do it in numpy:
# Convert the correlation dataframe to numpy array
corr = corr_df.to_numpy()
# Fill the diagonal with negative infinity
np.fill_diagonal(corr, -np.inf)
# Rearrange the correlation matrix in Rank order. I assume
# df["Num"] column contains number 1 to n
num = df["Num"] - 1
corr = corr[num, :]
corr = corr[:, num]
# Mask out the upper triangle with -np.inf because these
# columns rank lower than the current row. `triu` = triangle
# upper
triu_index = np.triu_indices_from(corr)
corr[triu_index] = -np.inf
# And the highest correlation is simply the max of the
# remaining columns
highest_corr = corr.max(axis=1)
highest_corr[0] = np.nan
df["highest_corr"] = highest_corr

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Calculate Compound Interest in Pandas

I have been trying to work out how to calculate the future value of a savings account where each month I must deposit $100.
import pandas as pd
# deposit an extra $100 per month
deposit = [100] * 4
# unbelievable rate of 10%!
rate = [0.1] * 4
df = pd.DataFrame({ 'deposit':deposit, 'rate':rate})
df['interest'] = df.deposit * df.rate
df['total'] = df.deposit.cumsum() + df.interest.cumsum()
This gives the incorrect total of $440 when it should be $464.10 due to compound interest.
total = 0
r = 0.1
d = 100
for i in range(0,4):
total = (total * r) + total + d
print (total)
100.0
210.0
331.0
464.1
What is the correct way to do this in Pandas?
IIUC, it is compounded at the end. Using pd.Series's shift and cumprod:
df['total'] = (df['deposit'] * df['rate'].shift().add(1).cumprod().fillna(1)).cumsum()
print(df)
Output:
deposit rate interest total
0 100 0.1 10.0 100.0
1 100 0.1 10.0 210.0
2 100 0.1 10.0 331.0
3 100 0.1 10.0 464.1

How to access results from extractall on a dataframe

I have a dataframe df in which the column df.Type has dimension information about physical objects. The numbers appear inside a text string which I have successfully extracted using this code:
dftemp=df.Type.str.extractall("([-+]?\d*\.\d+|\d+)").astype(float)
But now, the problem is that results appear as:
0
Unit match
5 0 0.02
1 0.03
6 0 0.02
1 0.02
7 0 0.02
...
How can I multiply these successive numbers (e.g. 0.02 * 0.03 = 0.006) and insert the result into the original dataframe df as a new column, say df.Area for each value of df.Type?
Thanks for your ideas!
I think you can do it with unstack and then prod along axis=1 like
print (dftemp.unstack().prod(axis=1))
then if I'm not mistaken, Unit is the name of the index in df, so I would say that
df['Area'] = dftemp.unstack().prod(axis=1)
should create the column you look for.
With an example:
df = pd.DataFrame( {'Type':['bla 0.03 dddd 0.02 jjk','bli 0.02 kjhg 0.02 wait']},
index=pd.Index([5,6],name = 'Unit'))
df['Area'] = (df.Type.str.extractall("([-+]?\d*\.\d+|\d+)").astype(float)
.unstack().prod(axis=1))
print (df)
Type Area
Unit
5 bla 0.03 dddd 0.02 jjk 0.0006
6 bli 0.02 kjhg 0.02 wait 0.0004

Gnuplot conditional plotting

I need to use gnuplot to plot wind direction values (y) against time (x) in a 2D plot using lines and points. This works fine if successive values are close together. If the values are eg separated by 250 degrees then I need to have a condition that checks the previous y value and does not draw a line joining the two points. This condition occurs when the wind dir is in the 280 degrees to 20 degrees sector and the plots are messy eg a north wind. AS the data is time dependent I cannot use polar plots except at a specific point in time. I need to show change in direction over time.
Basically the problem is:
plot y against x ; when (y2-y1)>= 180 then break/erase line joining successive points
Can anyone give me an example of how to do this?
A sample from the data file is:
2014-06-16 16:00:00 0.000 990.081 0.001 0.001 0.001 0.001 0.002 0.001 11.868 308 002.54 292 004.46 00
2014-06-16 16:10:00 0.000 990.047 0.001 0.001 0.001 0.001 0.002 0.001 11.870 303 001.57 300 002.48 00
2014-06-16 16:20:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.961 334 001.04 314 002.07 00
2014-06-16 16:30:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.818 005 001.18 020 002.14 00
2014-06-16 16:40:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.725 332 001.14 337 002.26 00
and I want to plot column 12 vs time.
You can insert a filtering condition in the using statement and use a value of 1/0 if the condition is not fullfilled. In that case this point is not connect to others:
set timefmt '%Y-%m-%d %H:%M:%S'
set xdata time
unset key
y1 = y2 = 0
plot 'data.dat' using 1:(y1 = y2, y2 = $12, ($0 == 0 || y2 - y1 < 180) ? $12 : 1/0) with lines,\
'data.dat' using 1:12 with points
With your data sample and gnuplot version 4.6.5 I get the plot:
Unfortunately, with this approach you cannot categorize lines, but only points and also the line following the 1/0 point aren't drawn.
A better approach would be to use awk to insert an empty line when a jump occurs. In a 2D-plot points from different data blocks (separated by a single new line) aren't connected:
set timefmt '%Y-%m-%d %H:%M:%S'
set xdata time
unset key
plot '< awk ''{y1 = y2; y2 = $12; if (NR > 1 && y2 - y1 >= 180) printf("\n"); print}'' data.dat' using 1:12 with linespoints
In order to break the joining lines two conditional statements must be fulfilled and BOTH must include the newline statement printf("\n"):
plot '< awk ''{y1 = y2; y2 = $12; if (NR > 1 && y2 - y1 >= 180) printf("\n") ; if (NR > 1 && y2 -y1 <= 0) printf("\n"); print}'' /Desktop/plotdata.txt' using 1:12 with linespoints
There is absolutely no need for awk. You can simply "interrupt" the lines by using variable color for the line. For gnuplot<5.0.0 you can use 0xffffff=white (or whatever your background color is) as linecolor and the line will hardly be visible. For gnuplot>=5.0.0 you can use any transparent color, e.g. 0xff123456, i.e. the line is really invisible.
Data: SO24425910.dat
2014-06-16 16:00:00 330
2014-06-16 16:10:00 320
2014-06-16 16:20:00 310
2014-06-16 16:30:00 325
2014-06-16 16:40:00 090
2014-06-16 16:50:00 060
2014-06-16 17:00:00 070
2014-06-16 17:10:00 280
2014-06-16 17:20:00 290
2014-06-16 17:30:00 300
Script: (works for gnuplot>=4.4.0, March 2010)
### conditional interruption of line
reset
FILE = "SO24425910.dat"
set key noautotitle
set yrange[0:360]
set ytics 90
set grid x
set grid y
set xdata time
set timefmt "%Y-%m-%d %H:%M"
set format x "%H:%M"
set multiplot layout 2,1
plot y1=NaN FILE u 1:(y0=y1,y1=$3):(abs(y1-y0)>=180?0xffffff:0xff0000) w l lc rgb var
plot y1=NaN FILE u 1:(y0=y1,y1=$3):(abs(y1-y0)>=180?0xffffff:0xff0000) w l lc rgb var, \
'' u 1:3 w p pt 7 lc rgb "red"
unset multiplot
### end of script
Result: