Lag the date by considering data from another table - pandas

have a question on whether the following can be done without having to do a for loop
i have a ctry table that looks like the below
CTRY LAG
AU 2
US 3
my data table looks like this
CTRY DATE A B C
AU 1960-01-31 0.3 0.4 0.5
US 1960-03-31 0.3 0.4 0.5
US 1960-04-30 0.35 0.42 0.54
What I would like to do is update the date column to month end date for each country by the given lag
CTRY DATE A B C
AU 1960-03-31 0.3 0.4 0.5
US 1960-06-30 0.3 0.4 0.5
US 1960-07-31 0.35 0.42 0.54
I am currently using a for loop but I am sure there is better and more efficient way to do this
thanks so much

You can using merge firstly , then using pd.DateOffset, convert your LAG column to month.
#df.DATE=pd.to_datetime(df.DATE)
s=df.merge(ctry)
s['DATE']=s['DATE']+s['LAG'].apply(lambda x : pd.DateOffset(months=x))
s
Out[452]:
CTRY DATE A B C LAG
0 AU 1960-03-31 0.30 0.40 0.50 2
1 US 1960-06-30 0.30 0.40 0.50 3
2 US 1960-07-30 0.35 0.42 0.54 3

Related

Find first and last positive value of every season over 50 years

i've seen some similar question but can't figure out how to handle my problem.
I have a dataset with evereyday total snow values from 1970 till 2015.
Now i want to find out when there was the first and the last day with snow.
I want to do this for every season.
One season should be from, for example 01.06.2000 - 30.5.2001, this season is then Season 2000/2001.
I have already set my date column as index(format year-month-day, 2006-04-24)
When I select a specific range with
df_s = df["2006-04-04" : "2006-04-15"]
I am able to find out the first and last day with snow in this period with
firstsnow = df_c[df_c['Height'] > 0].head(1)
lastsnow = df_c[df_c['Height'] > 0].tail(1)
I want to do this now for the whole dataset, so that I'm able to compare each season and see how the time of first snow changed.
My dataframe looks like this(here you see a selected period with values),Height is Snowheight, Diff is the difference to the previous day. Height and Diff are Float64.
Height Diff
Date
2006-04-04 0.000 NaN
2006-04-05 0.000 0.000
2006-04-06 0.000 0.000
2006-04-07 16.000 16.000
2006-04-08 6.000 -10.000
2006-04-09 0.001 -5.999
2006-04-10 0.000 -0.001
2006-04-11 0.000 0.000
2006-04-12 0.000 0.000
2006-04-13 0.000 0.000
2006-04-14 0.000 0.000
2006-04-15 0.000 0.000
(12, 2)
<class 'pandas.core.frame.DataFrame'>
I think i have to work with the groupby function, but i don't know how to apply this function in this case.
You can use the trick to create new column with only positive value, and None otherwise. Then use ffill and bfill to get the head and tail
Sample data:
df = pd.DataFrame({'name': ['a1','a2','a3','a4','a5','b1','b2','b3','b4','b5'],
'gr':[1]*5+[2]*5,
'val1':[None,-1,2,1,None,-1,4,7,3,-2]})
Input:
name gr val1
0 a1 1 NaN
1 a2 1 -1.0
2 a3 1 2.0
3 a4 1 1.0
4 a5 1 NaN
5 b1 2 -1.0
6 b2 2 4.0
7 b3 2 7.0
8 b4 2 3.0
9 b5 2 -2.0
Set positive then ffill and bfill:
df['positive'] = np.where(df['val1']>0, df['val1'], None)
df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.ffill())
df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.bfill())
Check result:
df.groupby('gr').head(1)
df.groupby('gr').tail(1)
name gr val1 positive
0 a1 1 NaN 2.0
5 b1 2 -1.0 4.0
name gr val1 positive
4 a5 1 NaN 1.0
9 b5 2 -2.0 3.0

Convert value counts of multiple columns to pandas dataframe

I have a dataset in this form:
Name Batch DXYR Emp Lateral GDX MMT CN
Joe 2 0 2 2 2 0
Alan 0 1 1 2 0 0
Josh 1 1 2 1 1 2
Max 0 1 0 0 0 2
These columns can have only three distinct values ie. 0, 1 and 2..
So, I need percent of value counts for each column in pandas dataframe..
I have simply make a loop like:
for i in df.columns:
(df[i].value_counts()/df[i].count())*100
I am getting the output like:
0 90.608831
1 0.391169
2 9.6787899
Name: Batch, dtype: float64
0 95.545455
1 2.235422
2 2.6243553
Name: MX, dtype: float64
and so on...
These outputs are correct but I need it in pandas dataframe like this:
Batch DXYR Emp Lateral GDX MMT CN
Count_0_percent 98.32 52.5 22 54.5 44.2 53.4 76.01
Count_1_percent 0.44 34.5 43 43.5 44.5 46.5 22.44
Count_2_percent 1.3 64.3 44 2.87 12.6 1.88 2.567
Can someone please suggest me how to get it
You can melt the data, then use pd.crosstab:
melt = df.melt('Name')
pd.crosstab(melt['value'], melt['variable'], normalize='columns')
Or a bit faster (yet more verbose) with melt and groupby().value_counts():
(df.melt('Name')
.groupby('variable')['value'].value_counts(normalize=True)
.unstack('variable', fill_value=0)
)
Output:
variable Batch CN DXYR Emp Lateral GDX MMT
value
0 0.50 0.5 0.25 0.25 0.25 0.50
1 0.25 0.0 0.75 0.25 0.25 0.25
2 0.25 0.5 0.00 0.50 0.50 0.25
Update: apply also works:
df.drop(columns=['Name']).apply(pd.Series.value_counts, normalize=True)

sum over columns using different length

I have a pd df.
The table looks like:
df
lifetime 0 1 2 3 4 5 .... 30
0 2 0.12 0.14 0.18 0.12 0.13 0.14 .... 0.14
1 3 0.12 0.14 0.18 0.12 0.13 0.14 .... 0.14
2 4 0.12 0.14 0.18 0.12 0.13 0.14 .... 0.14
I want to sum the columns from 0 to 30 based on the column "lifetime" value, so the results looks like:
df
lifetime Total
0 2 sum(0.12+ 0.14) # sum column 0 and 1
1 3 sum(0.12+0.14+0.18) #sum from column 0 to 2
2 4 sum(0.12+0.14+0.18+0.12+0.13) #sum from column 0 to 3
How can I do it? Thank you for your help!
You can use where with broadcasting:
s = df.iloc[:,1:]
s.where(df.lifetime.to_numpy()[:,None] > np.arange(s.shape[1])).sum(1)
Output:
0 0.26
1 0.44
2 0.56
dtype: float64
Define the following function:
def mySum(row):
uLim = int(row.lifetime) + 1
return row.iloc[1:uLim].sum()
Then apply it and join the result with lifetime column:
df = df.lifetime.to_frame().join(df.apply(mySum, axis=1).rename('Total'))
The advantage over the other solution is that my solution creates
the target DataFrame, not only the new column.

Pandas: 1 dataframe comparing rows to create new column

I have a problem which I cannot seem to get my head round.
df1 is as follows:
Group item Quarter price quantity
1 A 2017Q3 0.10 1000
1 A 2017Q4 0.11 1000
1 A 2018Q1 0.11 1000
1 A 2018Q2 0.12 1000
1 A 2018Q3 0.11 1000
Result desired is a new dataframe call it df2 with an additional column.
Group item Quarter price quantity savings/lost
1 A 2017Q3 0.10 1000 0.00
1 A 2017Q4 0.11 1000 0.00
1 A 2018Q1 0.11 1000 0.00
1 A 2018Q2 0.12 1000 0.00
1 A 2018Q3 0.11 1000 10.00
1 A 2018Q4 0.13 1000 -20.00
Essentially, I want to go down each row, look at the quarter and find last year's similar quarter and do a calculation (price this quarter - price last quarter * quantity). If there are no previous quarter data, just have in the last column.
And to complete the picture, there are more groups and items in there, and even more quarters like 2016Q1, 2017Q1, 2018Q1 although i only need compare the year before. Quarters are in string format.
Use pandas.DataFrame.shift
The code below assumes that your column Quarter is sorted and there is no missing quarters. You can try with the below code:
# Input dataframe
Group item Quarter price quantity
0 1 A 2017Q3 0.10 1000
1 1 A 2017Q4 0.11 1000
2 1 A 2018Q1 0.11 1000
3 1 A 2018Q2 0.12 1000
4 1 A 2018Q3 0.11 1000
5 1 A 2018Q4 0.13 1000
# Code to generate your new column 'savings/lost'
df['savings/lost'] = df['price'] * df['quantity'] - df['price'].shift(4) * df['quantity'].shift(4)
# Output dataframe
Group item Quarter price quantity savings/lost
0 1 A 2017Q3 0.10 1000 NaN
1 1 A 2017Q4 0.11 1000 NaN
2 1 A 2018Q1 0.11 1000 NaN
3 1 A 2018Q2 0.12 1000 NaN
4 1 A 2018Q3 0.11 1000 10.0
5 1 A 2018Q4 0.13 1000 20.0
Update
I have updated my code to handle two things, first sort the Quarter and second handle the missing Quarter scenario. For grouping based on columns you can refer pandas.DataFrame.groupby and many pd.groupby related questions already answered in this site.
#Input dataframe
Group item Quarter price quantity
0 1 A 2014Q3 0.10 100
1 1 A 2017Q2 0.16 800
2 1 A 2017Q3 0.17 700
3 1 A 2015Q4 0.13 400
4 1 A 2016Q1 0.14 500
5 1 A 2014Q4 0.11 200
6 1 A 2015Q2 0.12 300
7 1 A 2016Q4 0.15 600
8 1 A 2018Q1 0.18 600
9 1 A 2018Q2 0.19 500
#Code to do the operations
df.index = pd.PeriodIndex(df.Quarter, freq='Q')
df.sort_index(inplace=True)
df2 = df.reset_index(drop=True)
df2['Profit'] = (df.price * df.quantity) - (df.reindex(df.index - 4).price * df.reindex(df.index - 4).quantity).values
df2['Profit'] = np.where(np.in1d(df.index - 4, df.index.values),
df2.Profit, ((df.price * df.quantity) - (df.price.shift(1) * df.quantity.shift(1))))
df2.Profit.fillna(0, inplace=True)
#Output dataframe
Group item Quarter price quantity Profit
0 1 A 2014Q3 0.10 100 0.0
1 1 A 2014Q4 0.11 200 12.0
2 1 A 2015Q2 0.12 300 14.0
3 1 A 2015Q4 0.13 400 0.0
4 1 A 2016Q1 0.14 500 18.0
5 1 A 2016Q4 0.15 600 0.0
6 1 A 2017Q2 0.16 800 38.0
7 1 A 2017Q3 0.17 700 -9.0
8 1 A 2018Q1 0.18 600 -11.0
9 1 A 2018Q2 0.19 500 0.0

Transpose some columns to row

I know that same kind of questions has been asked before. However I didn't succeed to do what I need to do. Therefore I'm asking you.
I have a table with client_ID and some probabilities of purchasing different product category corresponding to each client.
Client_ID | Prob_CategoryA | Prob_CategoryB | Prob_CategoryC
1 0.2 0.3 0.2
2 0.4 0.6 0.7
3 0.3 0.7 0.4
Now what I would like to do is transform the above table into this.
Client_ID | Category Name | Probability
1 A 0.2
1 B 0.3
1 C 0.2
2 A 0.4
2 B 0.6
2 C 0.7
3 A 0.3
3 B 0.7
3 C 0.4
Thank you very much
Simple UNPIVOT:
SELECT Client_Id, SUBSTRING(Cat, 14, 1) [Category Name], Probability
FROM Src
UNPIVOT (Probability FOR Cat IN (Prob_CategoryA, Prob_CategoryB, Prob_CategoryC)) UP
Result
Client_Id Category Name Probability
----------- ------------- -----------
1 A 0.2
1 B 0.3
1 C 0.2
2 A 0.4
2 B 0.6
2 C 0.7
3 A 0.3
3 B 0.7
3 C 0.4
Use UNPIVOT:
SELECT Client_ID, Cats, Probability
FROM
(SELECT Client_ID, Prob_CategoryA, Prob_CategoryB, Prob_CategoryC
FROM yourTable) t
UNPIVOT
(Probability FOR Cats IN (Prob_CategoryA, Prob_CategoryB, Prob_CategoryC)
) AS c