Pandas DataFrame Transpose and Matrix Multiplication - pandas
I am looking for a way to perform a matrix multiplication on two sets of columns in a dataframe. One set of columns will need to be transposed and then multiplied with the other set. Then I need to take the resulting matrix and do an element wise product with a scalar matrix and add up. Below is an example:
Data for testing:
import pandas as pd
import numpy as np
dftest = pd.DataFrame(data=[['A',0.18,0.25,0.36,0.21,0,0.16,0.16,0.64,0.04,0,0],['B',0,0,0.5,0.5,0,0,0,0.25,0.75,0,0]],columns = ['Ticker','f1','f2','f3','f4','f5','p1','p2','p3','p4','p5','multiplier'])
Starting dataframe with data for Tickers. f1 through f5 represent one set of categories and p1 through p5 represent another.
dftest
Out[276]:
Ticker f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 multiplier
0 A 0.18 0.25 0.36 0.21 0 0.16 0.16 0.64 0.04 0 0
1 B 0.00 0.00 0.50 0.50 0 0.00 0.00 0.25 0.75 0 0
For each row, I need to do transpose columns p1 through p5 and then multiply them to columns f1 through f5. I think I have found the solution using below.
dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]))
Out[408]:
f1 f2 f3 f4 f5
Ticker
A p1 0.0288 0.04 0.0576 0.0336 0.0
p2 0.0288 0.04 0.0576 0.0336 0.0
p3 0.1152 0.16 0.2304 0.1344 0.0
p4 0.0072 0.01 0.0144 0.0084 0.0
p5 0.0000 0.00 0.0000 0.0000 0.0
B p1 0.0000 0.00 0.0000 0.0000 0.0
p2 0.0000 0.00 0.0000 0.0000 0.0
p3 0.0000 0.00 0.1250 0.1250 0.0
p4 0.0000 0.00 0.3750 0.3750 0.0
p5 0.0000 0.00 0.0000 0.0000 0.0
Next I need to do a element wise product of the above matrix against another 5x5 matrix that is in another DataFrame and then add up the columns or rows (you get the same result either way). If I extend the above statement as below, I get the result I want.
dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: pd.DataFrame(m.values * x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]).values, columns = m.columns, index = m.index).sum().sum())
Out[409]:
Ticker
A 2.7476
B 1.6250
dtype: float64
So far so good, I think. Happy to know a better and faster way to do this. The next question and this is where I am stuck.
How do I take this and update the "multiplier" column on my original dataFrame?
if I try to do the following:
dftest['multiplier']=dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: pd.DataFrame(m.values * x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]).values, columns = m.columns, index = m.index).sum().sum())
I get NaNs in the multiplier column.
dftest
Out[407]:
Ticker f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 multiplier
0 A 0.18 0.25 0.36 0.21 0 0.16 0.16 0.64 0.04 0 NaN
1 B 0.00 0.00 0.50 0.50 0 0.00 0.00 0.25 0.75 0 NaN
I suspect it has to do with indexing and whether all the indices after grouping are translating back to the original dataframe. Second, do I need a group by statement for this one? Since it is a row by row solution, can't I just do it without grouping or group by the index? any suggestions on that?
I need to do this without iterating row by row because the whole code will iterate due to some optimization I have to do. So I need to run this whole process, look at the results and if they are outside some constraints, calculate new f1 through f5 and p1 through p5 and run the whole thing again.
I posted a question on this earlier but it was confusing so this a second attempt. Hope it makes sense.
Thanks in advance for all your help.
Related
How to remove rows so that the values in a column match a sequence
I'm looking for a more efficient method to deal with the following problem. I have a Dataframe with a column filled with values that randomly range from 1 to 4, I need to remove all the rows of the data frame that do not follow the sequence (1-2-3-4-1-2-3-...). This is what I have: A B 12/2/2022 0.02 2 14/2/2022 0.01 1 15/2/2022 0.04 4 16/2/2022 -0.02 3 18/2/2022 -0.01 2 20/2/2022 0.04 1 21/2/2022 0.02 3 22/2/2022 -0.01 1 24/2/2022 0.04 4 26/2/2022 -0.02 2 27/2/2022 0.01 3 28/2/2022 0.04 1 01/3/2022 -0.02 3 03/3/2022 -0.01 2 05/3/2022 0.04 1 06/3/2022 0.02 3 08/3/2022 -0.01 1 10/3/2022 0.04 4 12/3/2022 -0.02 2 13/3/2022 0.01 3 15/3/2022 0.04 1 ... This is what I need: A B 14/2/2022 0.01 1 18/2/2022 -0.01 2 21/2/2022 0.02 3 24/2/2022 0.04 4 28/2/2022 0.04 1 03/3/2022 -0.01 2 06/3/2022 0.02 3 10/3/2022 0.04 4 15/3/2022 0.04 1 ... Since the data frame is quite big I need some sort of NumPy-based operation to accomplish this, the more efficient the better. My solution is very ugly and inefficient, basically, I made 4 loops like the following to check for every part of the sequence (4-1,1-2,2-3,3-4): df_len = len(df) df_len2 = 0 while df_len != df_len2: df_len = len(df) df.loc[(df.B.shift(1) == 4) & (df.B != 1), 'B'] = 0 df = df[df['B'] != 0] df_len2 = len(df)
By means of itertools.cycle (to define cycled range): from itertools import cycle c_rng = cycle(range(1, 5)) # cycled range start = next(c_rng) # starting point df[[(v == start) and bool(start := next(c_rng)) for v in df.B]] A B 14/2/2022 0.01 1 18/2/2022 -0.01 2 21/2/2022 0.02 3 24/2/2022 0.04 4 28/2/2022 0.04 1 03/3/2022 -0.01 2 06/3/2022 0.02 3 10/3/2022 0.04 4 15/3/2022 0.04 1
A simple improvement to speed this up is to not touch the dataframe within the loop, but just iterate over the values of B to construct a Boolean index, like this: is_in_sequence = [] next_target = 1 for b in df.B: if b == next_target: is_in_sequence.append(True) next_target = next_target % 4 + 1 else: is_in_sequence.append(False) print(df[is_in_sequence]) A B 14/2/2022 0.01 1 18/2/2022 -0.01 2 21/2/2022 0.02 3 24/2/2022 0.04 4 28/2/2022 0.04 1 03/3/2022 -0.01 2 06/3/2022 0.02 3 10/3/2022 0.04 4 15/3/2022 0.04 1
Find first and last positive value of every season over 50 years
i've seen some similar question but can't figure out how to handle my problem. I have a dataset with evereyday total snow values from 1970 till 2015. Now i want to find out when there was the first and the last day with snow. I want to do this for every season. One season should be from, for example 01.06.2000 - 30.5.2001, this season is then Season 2000/2001. I have already set my date column as index(format year-month-day, 2006-04-24) When I select a specific range with df_s = df["2006-04-04" : "2006-04-15"] I am able to find out the first and last day with snow in this period with firstsnow = df_c[df_c['Height'] > 0].head(1) lastsnow = df_c[df_c['Height'] > 0].tail(1) I want to do this now for the whole dataset, so that I'm able to compare each season and see how the time of first snow changed. My dataframe looks like this(here you see a selected period with values),Height is Snowheight, Diff is the difference to the previous day. Height and Diff are Float64. Height Diff Date 2006-04-04 0.000 NaN 2006-04-05 0.000 0.000 2006-04-06 0.000 0.000 2006-04-07 16.000 16.000 2006-04-08 6.000 -10.000 2006-04-09 0.001 -5.999 2006-04-10 0.000 -0.001 2006-04-11 0.000 0.000 2006-04-12 0.000 0.000 2006-04-13 0.000 0.000 2006-04-14 0.000 0.000 2006-04-15 0.000 0.000 (12, 2) <class 'pandas.core.frame.DataFrame'> I think i have to work with the groupby function, but i don't know how to apply this function in this case.
You can use the trick to create new column with only positive value, and None otherwise. Then use ffill and bfill to get the head and tail Sample data: df = pd.DataFrame({'name': ['a1','a2','a3','a4','a5','b1','b2','b3','b4','b5'], 'gr':[1]*5+[2]*5, 'val1':[None,-1,2,1,None,-1,4,7,3,-2]}) Input: name gr val1 0 a1 1 NaN 1 a2 1 -1.0 2 a3 1 2.0 3 a4 1 1.0 4 a5 1 NaN 5 b1 2 -1.0 6 b2 2 4.0 7 b3 2 7.0 8 b4 2 3.0 9 b5 2 -2.0 Set positive then ffill and bfill: df['positive'] = np.where(df['val1']>0, df['val1'], None) df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.ffill()) df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.bfill()) Check result: df.groupby('gr').head(1) df.groupby('gr').tail(1) name gr val1 positive 0 a1 1 NaN 2.0 5 b1 2 -1.0 4.0 name gr val1 positive 4 a5 1 NaN 1.0 9 b5 2 -2.0 3.0
Convert value counts of multiple columns to pandas dataframe
I have a dataset in this form: Name Batch DXYR Emp Lateral GDX MMT CN Joe 2 0 2 2 2 0 Alan 0 1 1 2 0 0 Josh 1 1 2 1 1 2 Max 0 1 0 0 0 2 These columns can have only three distinct values ie. 0, 1 and 2.. So, I need percent of value counts for each column in pandas dataframe.. I have simply make a loop like: for i in df.columns: (df[i].value_counts()/df[i].count())*100 I am getting the output like: 0 90.608831 1 0.391169 2 9.6787899 Name: Batch, dtype: float64 0 95.545455 1 2.235422 2 2.6243553 Name: MX, dtype: float64 and so on... These outputs are correct but I need it in pandas dataframe like this: Batch DXYR Emp Lateral GDX MMT CN Count_0_percent 98.32 52.5 22 54.5 44.2 53.4 76.01 Count_1_percent 0.44 34.5 43 43.5 44.5 46.5 22.44 Count_2_percent 1.3 64.3 44 2.87 12.6 1.88 2.567 Can someone please suggest me how to get it
You can melt the data, then use pd.crosstab: melt = df.melt('Name') pd.crosstab(melt['value'], melt['variable'], normalize='columns') Or a bit faster (yet more verbose) with melt and groupby().value_counts(): (df.melt('Name') .groupby('variable')['value'].value_counts(normalize=True) .unstack('variable', fill_value=0) ) Output: variable Batch CN DXYR Emp Lateral GDX MMT value 0 0.50 0.5 0.25 0.25 0.25 0.50 1 0.25 0.0 0.75 0.25 0.25 0.25 2 0.25 0.5 0.00 0.50 0.50 0.25 Update: apply also works: df.drop(columns=['Name']).apply(pd.Series.value_counts, normalize=True)
Fill missing values in DataFrame
I have a dataframe that is either missing two values in two columns, or one value in one column. Date 30 45 60 90 0 2004-01-02 0.88 0.0 0.0 0.93 1 2004-01-05 0.88 0.0 0.0 0.91 ... 20 2019-12-24 1.55 0 1.58 1.58 21 2019-12-26 1.59 0 1.60 1.58 I would like to compute all the zero values in the dataframe by some simple linear method. Here is the thing, if there is a value in the 60 column, use the average of the 60 and the 30 for the 45. Otherwise use some simple method to compute both the 45 and the 60. What is the pandas way to do this? [Prefer no loops] EDIT 1 As per the suggestions in the comment, I tried df.replace(0, np.nan, inplace=True) df=df.interpolate(method='linear', limit_direction='forward', axis=0) But the df still contains all the np.nan
Lag the date by considering data from another table
have a question on whether the following can be done without having to do a for loop i have a ctry table that looks like the below CTRY LAG AU 2 US 3 my data table looks like this CTRY DATE A B C AU 1960-01-31 0.3 0.4 0.5 US 1960-03-31 0.3 0.4 0.5 US 1960-04-30 0.35 0.42 0.54 What I would like to do is update the date column to month end date for each country by the given lag CTRY DATE A B C AU 1960-03-31 0.3 0.4 0.5 US 1960-06-30 0.3 0.4 0.5 US 1960-07-31 0.35 0.42 0.54 I am currently using a for loop but I am sure there is better and more efficient way to do this thanks so much
You can using merge firstly , then using pd.DateOffset, convert your LAG column to month. #df.DATE=pd.to_datetime(df.DATE) s=df.merge(ctry) s['DATE']=s['DATE']+s['LAG'].apply(lambda x : pd.DateOffset(months=x)) s Out[452]: CTRY DATE A B C LAG 0 AU 1960-03-31 0.30 0.40 0.50 2 1 US 1960-06-30 0.30 0.40 0.50 3 2 US 1960-07-30 0.35 0.42 0.54 3