I have dataframe consists of accountId, date , return value of that account on that date, and inflation rate on that date.
Date column demonstrates that how long account has been in system, for example accountId 1 get into system on 2016-01 and get out on 2019-11.
formula:
df["Inflation"] = ((1+ df["Inflation"]).cumprod() - 1) * 100
I want to apply this formula to the all accounts but here is the problem.
When I have dataframe consists of only one account it's too easy to apply formula, but when I create a dataframe consists of all users(as I indicated in image) I don't want to apply that formula simply, because every account has different date interval some of them get into system 2016 some of them 2017.
You can imagine like this, let's suppose I have dataframe of all accounts, for example df1 for account1 df2 for account2 and so on. And I want to apply that formula to each dataframe individually, and finally I want to merge all of them and have one dataframe consists of all accounts.
df["Inflation2"] = ((1+df.groupby(["AccountId","Inflation"])).cumprod()-1) * 100
I tried this code but It gives me error like "unsupported operand type(s) for +: 'int' and 'DataFrameGroupBy'"
Thanks in advance...
I solved it as follows:
df["Inflation"] = df.groupby(["AccountId"]).Inflation.apply(lambda x: (x + 1).cumprod()-1) * 100
Related
Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334
I am pulling historical price data for the S&P500 index components with yfinance and would now like to convert the Close & Volume from USD into EUR.
This is what I tried:
data = yf.download(set(components), group_by="Ticker", start=get_data.start_date)
where start="2020-11-04" and components is a list of yfinance tickers of S&P500 members and the "EURUSD=X" -> the symbol for the conversion rate
#Group by Ticker and Date
df = data.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)
df = df.sort_values(by='Ticker',axis='index',kind='stable')
After adding columns for the name, sector & name of the currency (I need this as in my application I am appending several dataframes with tickers of different currency) and dropping columns I dont need, I have a dataframe that looks like this:
I now what to convert the Close & the Volume Column into EUR. I have found a way that works on most of the data except the S&P500 and other US stocks, which is why I am posting the question here.
# Check again if Currency is not EUR
if currency != "EUR":
df['Close in EUR'] = df.groupby('Ticker')['Close'].apply(lambda group: group.iloc[::]/df[df['Ticker']==currency]['Close'])
df['Volume in Mio. EUR'] = df['Volume']*df['Close in EUR']/1000000
else:
df['Volume in Mio. EUR'] = df['Volume']*df['Close']/1000000
This does not only take a lot of time (~46 seconds), but it also shows NaN values for "Close in EUR" and "Volume in Mio. EUR" columns. Do you have any idea?
I have found that df[df['Ticker']==currency] has more rows than the stock tickers have due to public holidays of the stock exchanges and even after deleting the unmatched rows, I am left with NaN values. Doing the whole process for other index members, e.g. ^JKLQ45 (Indonesia Stock Exchange index) works, which is surprising.
Please any help or even an idea how to do this more efficiently is highly appreciated!!!
If you want to get a sense of my final project - check out: https://equityanalysis.herokuapp.com/
I have a large dataset pertaining customer churn, where every customer has an unique identifier (encoded key). The dataset is a timeseries, where every customer has one row for every month they have been a customer, so both the date and customer-identifier column naturally contains duplicates. What I am trying to do is to add a new column (called 'churn') and set the column to 0 or 1 based on if it is that specific customer's last month as a customer or not.
I have tried numerous methods to do this, but each and every one fails, either do to tracebacks or they just don't work as intended. It should be noted that I am very new to both python and pandas, so please explain things like I'm five (lol).
I have tried using pandas groupby to group rows by the unique customer keys, and then checking conditions:
df2 = df2.groupby('customerid').assign(churn = [1 if date==max(date) else 0 for date in df2['date']])
which gives tracebacks because dataframegroupby object has no attribute assign.
I have also tried the following:
df2.sort_values(['date']).groupby('customerid').loc[df['date'] == max('date'), 'churn'] = 1
df2.sort_values(['date']).groupby('customerid').loc[df['date'] != max('date'), 'churn'] = 0
which gives a similar traceback, but due to the attribute loc
I have also tried using numpy methods, like the following:
df2['churn'] = df2.groupby(['customerid']).np.where(df2['date'] == max('date'), 1, 0)
which again gives tracebacks due to the dataframegroupby
and:
df2['churn'] = np.where((df2['date']==df2['date'].max()), 1, df2['churn'])
which does not give tracebacks, but does not work as intended, i.e. it applies 1 to the churn column for the max date for all rows, instead of the max date for the specific customerid - which in retrospect is completely understandable since customerid is not specified anywhere.
Any help/tips would be appreciated!
IIUC use GroupBy.transform with max for return maximal values per groups and compare with date column, last set 1,0 values by mask:
mask = df2['date'].eq(df2.groupby('customerid')['date'].transform('max'))
df2['churn'] = np.where(mask, 1, 0)
df2['churn'] = mask.astype(int)
I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
Assuming you have the following data frame df imported from your csv:
Closing Date 2014/12 2015/12 2016/12 2017/12 2018/12
0 Net Sales 31,634 49,924 62,051 68,137 72,590
1 Net increase -17,909 -16,962 -34,714 -26,220 -29,721
2 Net Received - - - - -
3 Net Paid -328 -6,038 -9,499 -9,375 -10,661
then by doing df = df[["2018/12"]] you create a new data frame with one column and df.iloc[0,0] will work perfectly well here returning 72,590. I you wrote df = df["2018/12"] you'd create a new series and here df.iloc[0,0] will throw an error 'too many indexers', because it's a one-dimensional series.
Anyway, if you need the values of a series, use the values attribute (or to_numpy() for version 0.24 or later) to get the data as array or to_list() to get them as a list.
But I guess what you really want is to have your table transposed:
df = df.set_index('Closing Date').T
to the following more logical form:
Closing Date Net Sales Net increase Net Received Net Paid
2014/12 31,634 -17,909 - -328
2015/12 49,924 -16,962 - -6,038
2016/12 62,051 -34,714 - -9,499
2017/12 68,137 -26,220 - -9,375
2018/12 72,590 -29,721 - -10,661
Here, df.loc['2018/12','Net Sales'] gives you 72,590 etc.
It's in reference to the Google Finance function in Google Sheets: https://support.google.com/docs/answer/3093281?hl=en
I would like to obtain the "all time LOW" (ATL) and "all time HIGH" (ATH) for a specific ticker (i.e. ABBV or GOOG) but only in 1 cell for each. Basically "What's the ATL/ATH value for this ticker?"
I've tried to do both formulas for ATL and ATH, but only ATL gives the expected result for now.
To get the ATL, you can use
=GOOGLEFINANCE("ABBV","low","01/12/1980",TODAY(),7)
and to get the ATH you can use:
=GOOGLEFINANCE("ABBV","high","01/12/1980",TODAY(),7)
The output of this is 2 columns of data:
Please note that column A, containing the timestamp, will be the one making trouble when it comes to computing the MAX function as it translates into some weird figures.
In order to get the ATL, I'll be using the MIN function which works perfectly fine:
=MIN(GOOGLEFINANCE("ABBV","low","01/01/1980",TODAY(),7))
as it will just scan the 2 columns of data and grab the lowest value which is 32.51 in USD.
BUT when I'm trying to do the same with MAX or MAXA for the ATH using for example
=MAX(GOOGLEFINANCE("ABBV","high","01/12/1980",TODAY(),7)
the result that comes out is 43616.66667 which seems to be a random computation of the column A containing the timestamp.
The expected result of the ATH should be 125.86 in USD.
I've tried using FILTER to excluded values >1000 but FILTER doesn't let me search in column B, so then I tried with VLOOKUP using this formula
=VLOOKUP(MAX(GOOGLEFINANCE("ABBV","high","01/12/1980",TODAY(),7)),GOOGLEFINANCE("ABBV","high","01/12/1980",TODAY(),7),2,FALSE)
but again it returns the value of column B but based on the MAX value of column A which end up giving me 80.1 and not the expected 125.86.
use:
=MAX(INDEX(GOOGLEFINANCE("ABBV", "high", "01/12/1980", TODAY(), 7), , 2))
43616.66667 is not a "random computation". it's date 31/05/2019 16:00:00 converted into a date value
MAX and MIN functions return single output from all possible cells in the included range which are in your case two columns. the date is considered as a number too so maxing out those two columns will output you the max value whenever it is from 1st or 2nd column. by introducing INDEX you can skip 1st column and look for a max value only in the 2nd column.
=MAX(INDEX(GOOGLEFINANCE("BTCSGD", "price", "01/12/1980", TODAY(), 7), , 2))
replace BTCSGD with any stock price you want to search up.
You can put ABCXYZ where ABC is the stock/ETF/Crypto and XYZ is the currency