Pandas take only a cell from the describe function - pandas

I am using pandas describe function for the below result:
dt_d=dt.describe()
print(dt_d)
count 120.00000 120.000000 120.000000 120.000000
mean 5.89000 3.060000 3.795833 1.190833
std 0.84589 0.441807 1.792861 0.757372
min 4.30000 2.000000 1.000000 0.100000
25% 5.17500 2.800000 1.575000 0.300000
50% 5.80000 3.000000 4.450000 1.400000
75% 6.40000 3.325000 5.100000 1.800000
max 7.90000 4.400000 6.900000 2.500000
If I want to take a cell from the describe function, for example, from the mean row, the mean in the third column, how will I be able to call it on its own?

df.describe() returns a DataFrame so you can just index it as you would any other DataFrame, using .loc.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1,10,(10,3)))
df.describe()
# 0 1 2
#count 10.00000 10.000000 10.000000
#mean 4.30000 2.400000 5.400000
#std 2.58414 1.429841 2.458545
#min 1.00000 1.000000 1.000000
#25% 2.25000 1.000000 4.250000
#50% 4.00000 2.500000 6.000000
#75% 5.00000 3.000000 7.000000
#max 9.00000 5.000000 8.000000
df.describe().loc['mean', 2]
#5.4

Related

How to use dask dataframe instead of pandas to make a faster calculation

demo csv file:
label1 label2 m1
0 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0000_1 0.000000
1 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0001_1 1.000000
2 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0002_1 1.000000
3 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0003_1 1.414214
4 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0004_1 2.000000
5 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0005_1 2.000000
6 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0006_1 3.000000
7 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0007_1 3.162278
8 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0008_1 4.000000
9 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0009_1 5.000000
10 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0010_1 5.000000
11 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0011_1 6.000000
12 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0012_1 6.000000
13 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0013_1 6.000000
14 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0014_1 6.000000
From this CSV file, I will do some comparison operation. I will have a function which will make comparison and return minimum from the combination.
There are 160000 rows. Using pandas and for loop are taking a lot of time. Can I make it faster using dask? I tried dask dataframe from pandas but when I am using to_list which I can used for pandas column, it's giving me error. I have core i7 machine and ram of 128 gb Below is my code:
"""
#the purpose of this function is to calculate different rows...
#values for the m1 column of data frame. there could be two
#combinations and inside combination it needs to get m1 value for the row
#suppose first comb1 will calucalte sum of m1 value of #row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_1) and
#row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_2)
a more details of this function could be found here:
(https://stackoverflow.com/questions/72663618/writing-a-python-function-to-get-desired-value-from-csv/72677299#72677299)
def compute(img1,img2):
comb1=(img1_1,img2_1)+(img1_1,img2_2)
comb2=(img1_2,img2_1)+(img1_2,img2_2)
return minimum(comb1,comb2)
"""
def min_4line(key1,key2,list1,list2,df):
k=['1','2','3','4']
indice_list=[]
key1_line1=key1+'_'+k[0]
key1_line2=key1+'_'+k[1]
key1_line3=key1+'_'+k[2]
key1_line4=key1+'_'+k[3]
key2_line1=key2+'_'+k[0]
key2_line2=key2+'_'+k[1]
key2_line3=key2+'_'+k[2]
key2_line4=key2+'_'+k[3]
ind1=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line1)].tolist()
ind2=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line2)].tolist()
ind3=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line3)].tolist()
ind4=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line4)].tolist()
comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
ind1=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line1)].tolist()
ind2=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line2)].tolist()
ind3=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line3)].tolist()
ind4=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line4)].tolist()
comb2=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
return min(comb1,comb2)
Now, have to create unique list of labels to do the comparison:
list_line=list(df3['label1'].unique())
string_test=[a[:-2] for a in list_line]
#above list comprehension is done as we will get unique label like animebook0000_1,animebook0001_1
list_key=sorted(list(set(string_test)))
print(len(list_key))
#making list of those two column
lable1_list=df3['label1'].to_list()
lable2_list=df3['label2'].to_list()
Next, I will write the output of the comparison function in an excel
%%time
file = open("content\\dummy_metric.csv", "a")
file.write("label1,label2,m1\n")
c=0
for i in range(len(list_key)):
for j in range(i+1,len(list_key)):
a=min_4line(list_key[i],list_key[j] ,lable1_list,lable2_list,df3)
#print(a)
file.write(str(list_key[i]) + "," + str(list_key[j]) + "," + str(a)+ "\n")
c+=1
if c>20000:
print('20k done')
my expected output:
label1 label2 m1
0 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0001 2
1 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0002 2
2 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0003 2
3 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0004 4
4 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0005 5
5 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0006 7
6 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0007 9
7 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0008 13
For dask I was proceeding like this:
import pandas as pd
import dask.dataframe as dd
csv_gb=pd.read_csv("content\\four_metric.csv")
dda = dd.from_pandas(csv_gb, npartitions=10)
Upto that line is fine, but when I want to do the list of label like this
lable1_list=df3['label1'].to_list()
it's showing me this error:
2022-07-05 16:31:17,530 - distributed.worker - WARNING - Compute Failed
Key: ('unique-combine-5ce843b510d3da88b71287e6839d3aa3', 0, 1)
Function: execute_task
args: ((<function pipe at 0x0000022E39F18160>, [0 KeyT1_L1_1_animebook0000_1
.....
25 KeyT1_L1_1_animebook_002
kwargs: {}
Exception: 'TypeError("\'Serialize\' object is not callable")'
Is there any better way to perform the above mentioned code with dask? I am also curious about using dask distributed function like this for my task:
from dask.distributed import Client
client = Client()
client = Client(n_workers=3, threads_per_worker=1, processes=False, memory_limit='40GB')

pandas Selecting single value from df using .loc() is producing a df instead of a numeric

I have two dataframes, sarc and non. After running describe() on both I want to compare the mean value for a particular column in both dataframes. I used .loc() and tried saving the value as a float but it is saving as a dataframe, which prevents me from comparing the two values using the > operator. Here's my code:
sarc.describe()
label c_len c_s_l_len score
count 5092.0 5092.000000 5092.000000 5092.000000
mean 1.0 54.876277 33.123527 6.919874
std 0.0 37.536986 22.566558 43.616977
min 1.0 0.000000 0.000000 -96.000000
25% 1.0 29.000000 18.000000 1.000000
50% 1.0 47.000000 28.000000 2.000000
75% 1.0 71.000000 43.000000 5.000000
max 1.0 466.000000 307.000000 2381.000000
non.describe()
label c_len c_s_l_len score
count 4960.0 4960.000000 4960.000000 4960.000000
mean 0.0 55.044153 33.100806 6.912298
std 0.0 47.873732 28.738776 39.216049
min 0.0 0.000000 0.000000 -119.000000
25% 0.0 23.000000 14.000000 1.000000
50% 0.0 43.000000 26.000000 2.000000
75% 0.0 74.000000 44.000000 4.000000
max 0.0 594.000000 363.000000 1534.000000
non_c_len_mean = non.describe().loc[['mean'], ['c_len']].astype(np.float64)
sarc_c_len_mean = sarc.describe().loc[['mean'], ['c_len']].astype(np.float64)
if sarc_c_len_mean > non_c_len_mean:
# do stuff
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The variables are indeed of <class 'pandas.core.frame.DataFrame'> type, and each prints as a labeled 1-row, 1-col df instead of just the value. How can I select only the numeric value as a float?
Remove the [] in .loc when you pick the columns and index
non.describe().loc['mean', 'c_len']

How can I transform two data frames into another one?

I have a df1 that looks like:
Shady Slim Eminem
Date
2011-01-10 HI Yes 1500
2011-01-13 HI No 1500
2011-01-13 BYBY Yes 4000
2011-01-26 OKDO Yes 1000
I have df2 that looks like this:
HI BYBY OKDO INT
Date
2011-01-10 340.99 143.41 614.21 1.0
2011-01-13 344.20 144.55 616.69 1.0
2011-01-13 344.20 144.55 616.69 1.0
2011-01-26 342.38 156.42 616.50 1.0
I want to save Eminem as Series. I also want each column in df2 to be a series. I want to multiply Eminem by these values in the right corresponding elements of Shady and fill up df3.
I want a df3 that looks like
I also want the INT column to be the sum of the rows for each row in df3.
I want to this in a vectorization way.
Also, based on the SLIM column, if it's YES then I want to add the Eminem * value else I want the negation of it.
Here are the values I want:
HI BYBY OKDO INT
Date
2011-01-10 511,485 0 0 sum(row 1)
2011-01-13 -516300 578200 0 sum(row 2)
2011-01-13 0 578200 0 sum(row 3)
2011-01-26 0 0 616500 sum(row 4)
Option 1
Use the pd.DataFrame.mul method for multiplying and provide an axis parameter in order to specify that you want the series you are multiplying by to be lined up along the index.
df2.mul(df1.Eminem, axis=0)
HI BYBY OKDO SOME COOL INT
Date
2011-01-10 511485.0 215115.0 921315.0 108030.0 184785.0 1500.0
2011-01-13 516300.0 216825.0 925035.0 110310.0 186810.0 1500.0
2011-01-13 1376800.0 578200.0 2466760.0 294160.0 498160.0 4000.0
2011-01-26 342380.0 156420.0 616500.0 76370.0 125800.0 1000.0
Option 2
If by chance, the series in which you are trying to multiply by is already ordered in the way you'd like to multiply, you can forgo the index and access the values attribute.
df2.mul(df1.Eminem.values, 0)
HI BYBY OKDO SOME COOL INT
Date
2011-01-10 511485.0 215115.0 921315.0 108030.0 184785.0 1500.0
2011-01-13 516300.0 216825.0 925035.0 110310.0 186810.0 1500.0
2011-01-13 1376800.0 578200.0 2466760.0 294160.0 498160.0 4000.0
2011-01-26 342380.0 156420.0 616500.0 76370.0 125800.0 1000.0
Option 3
If index proves difficult, you can append a level that makes it unique
unique_me = lambda d: d.set_index(d.groupby(level=0).cumcount(), append=True)
df2.pipe(unique_me).mul(df1.pipe(unique_me).Eminem, axis=0).reset_index(-1, drop=True)
HI BYBY OKDO SOME COOL INT
Date
2011-01-10 511485.0 215115.0 921315.0 108030.0 184785.0 1500.0
2011-01-13 516300.0 216825.0 925035.0 110310.0 186810.0 1500.0
2011-01-13 1376800.0 578200.0 2466760.0 294160.0 498160.0 4000.0
2011-01-26 342380.0 156420.0 616500.0 76370.0 125800.0 1000.0
With Slim Factor
df2.drop('INT', axis=1, errors='ignore').mul(df1.Eminem.values, 0).assign(
INT=lambda d: (lambda s: s.mask(df1.Slim.eq('No'), -s))(d.sum(1)))
HI BYBY OKDO SOME COOL INT
Date
2011-01-10 511485.0 215115.0 921315.0 108030.0 184785.0 1940730.0
2011-01-13 516300.0 216825.0 925035.0 110310.0 186810.0 -1955280.0
2011-01-13 1376800.0 578200.0 2466760.0 294160.0 498160.0 5214080.0
2011-01-26 342380.0 156420.0 616500.0 76370.0 125800.0 1317470.0

Applying function to Pandas with GroupBy along direction of the grouping variable

I have a cohort of N people and I computed a correlation matrix of some quantities (q1_score,...q5_score)
df.groupby('participant_id').corr()
Out[130]:
q1_score q2_score q3_score q4_score q5_score
participant_id
11.0 q1_score 1.000000 -0.748887 -0.546893 -0.213635 -0.231169
q2_score -0.748887 1.000000 0.639649 0.324976 0.335596
q3_score -0.546893 0.639649 1.000000 0.154539 0.151233
q4_score -0.213635 0.324976 0.154539 1.000000 0.998752
q5_score -0.231169 0.335596 0.151233 0.998752 1.000000
14.0 q1_score 1.000000 -0.668781 -0.124614 -0.352075 -0.244251
q2_score -0.668781 1.000000 -0.175432 0.360183 0.184585
q3_score -0.124614 -0.175432 1.000000 -0.137993 -0.125115
q4_score -0.352075 0.360183 -0.137993 1.000000 0.968564
q5_score -0.244251 0.184585 -0.125115 0.968564 1.000000
17.0 q1_score 1.000000 -0.799223 -0.814424 -0.790587 -0.777318
q2_score -0.799223 1.000000 0.787238 0.658524 0.640786
q3_score -0.814424 0.787238 1.000000 0.702570 0.701440
q4_score -0.790587 0.658524 0.702570 1.000000 0.998996
q5_score -0.777318 0.640786 0.701440 0.998996 1.000000
18.0 q1_score 1.000000 -0.595545 -0.617691 -0.472409 -0.477523
q2_score -0.595545 1.000000 0.386705 0.148761 0.115068
q3_score -0.617691 0.386705 1.000000 0.806637 0.782345
q4_score -0.472409 0.148761 0.806637 1.000000 0.982617
q5_score -0.477523 0.115068 0.782345 0.982617 1.000000
I need to compute the median values of the correlations of all participants? What I mean: I need to take corr. between the item J and item K for all participants and find their median value.
I am sure it is a one line of code, but I'm struggling to realise (still learning pandas by examples).
Stack your data, and do another groupby:
df.groupby('participant_id').corr().stack().groupby(level = [1,2]).median()
Edit: Actually, you don't need to stack if you don't want to:
df.groupby('participant_id').corr().groupby(level = [1]).median()
works too.
IIUC, you want the average mean of each participant across all questions:
df.where(df != 1).mean(axis=1).mean(level=0)
Let's get rid of correlations with same question with where, then get the mean for all questions by participant_id with direction of axis=1, then get the participant_id mean level=0.
Output:
participant_id
11.0 0.086416
14.0 -0.031493
17.0 0.130800
18.0 0.105896
dtype: float64
Edit: I used mean instead of median, we can so do the same logic with median.
df.where(df != 1).median(axis=1).median(level=0)

Pandastic way of growing a dataframe

So, I have a year-indexed dataframe that I would like to increment by some logic beyond the end year (2013), say, grow the last value by n percent for 10 years, but the logic could also be to just add a constant, or slightly growing number. I will leave that to a function and just stuff the logic there.
I can't think of a neat vectorized way to do that with arbitrary length of time and logic, leaving a longer dataframe with the extra increments added, and would prefer not to loop it.
The particular calculation matters. In general you would have to compute the values in a loop. Some NumPy ufuncs (such as np.add, np.multiply, np.minimum, np.maximum) have an accumulate method, however, which may be useful depending on the calculation.
For example, to calculate values given a constant growth rate, you could use np.multiply.accumulate (or cumprod):
import numpy as np
import pandas as pd
N = 10
index = pd.date_range(end='2013-12-31', periods=N, freq='D')
df = pd.DataFrame({'val':np.arange(N)}, index=index)
last = df['val'][-1]
# val
# 2013-12-22 0
# 2013-12-23 1
# 2013-12-24 2
# 2013-12-25 3
# 2013-12-26 4
# 2013-12-27 5
# 2013-12-28 6
# 2013-12-29 7
# 2013-12-30 8
# 2013-12-31 9
# expand df
index = pd.date_range(start='2014-1-1', periods=N, freq='D')
df = df.reindex(df.index.union(index))
# compute new values
rate = 1.1
df['val'][-N:] = last*np.multiply.accumulate(np.full(N, fill_value=rate))
yields
val
2013-12-22 0.000000
2013-12-23 1.000000
2013-12-24 2.000000
2013-12-25 3.000000
2013-12-26 4.000000
2013-12-27 5.000000
2013-12-28 6.000000
2013-12-29 7.000000
2013-12-30 8.000000
2013-12-31 9.000000
2014-01-01 9.900000
2014-01-02 10.890000
2014-01-03 11.979000
2014-01-04 13.176900
2014-01-05 14.494590
2014-01-06 15.944049
2014-01-07 17.538454
2014-01-08 19.292299
2014-01-09 21.221529
2014-01-10 23.343682
To increment by a constant value you could simply use np.arange:
step=2
df['val'][-N:] = np.arange(last+step, last+(N+1)*step, step)
or cumsum:
step=2
df['val'][-N:] = last + np.full(N, fill_value=step).cumsum()
Some linear recurrence relations can be expressed using scipy.signal.lfilter. See for example,
Trying to vectorize iterative calculation with numpy and Recursive definitions in Pandas