Generating Percentages from Pandas - sql

0
I am working with a data set from SQL currently -
import pandas as pd
df = spark.sql("select * from donor_counts_2015")
df_info = df.toPandas()
print(df_info)
The output looks like this (I can't include the actual output for privacy reasons): enter image description here
As you can see, it's a data set that has the name of a fund and then the number of people who have donated to that fund. What I am trying to do now is calculate what percent of funds have only 1 donation, what percent have 2, 34, etc. I am wondering if there is an easy way to do this with pandas? I also would appreciate if you were able to see the percentage of a range of funds too, like what percentage of funds have between 50-100 donations, 500-1000, etc. Thanks!

You can make a histogram of the donations to visualize the distribution. np.histogram might help. Or you can also sort the data and count manually.

For the first task, to get the percentage the column 'number_of_donations', you can do:
df['number_of_donations'].value_counts(normalize=True) * 100
For the second task, you need to create a new column with categories, and then make the same:
# Create a Serie with categories
New_Serie = pd.cut(df.number_of_donations,bins=[0,100,200,500,99999999],labels = ['Few','Medium','Many','Too Many'])
# Change the name of the Column
New_Serie.name = Category
# Concat df and New_Serie
df = pd.concat([df, New_Serie], axis=1)
# Get the percentage of the Categories
df['Category'].value_counts(normalize=True) * 100

Related

How to work with multiple `item_id`'s of varying length for timeseries prediction in autogluon?

I'm currently building a model to predict daily stock price based on daily data for thousands of stocks. In the data, I've got the daily data for all stocks, however they are for different lengths. Eg: for some stocks I have daily data from 2000 to 2022, and for others I have data from 2010 to 2022.
Many dates are also obviously repeated for all stocks.
While I was learning autogluon, I used the following function to format timeseries data so it can work with .fit():
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index[0]
end = original_index[-1]
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
This worked, however I was trying this for data for just one item_id and now I have thousands.
When I use this now, I get the following error: ValueError: cannot reindex from a duplicate axis
It's important to note that I have already foreward filled my data with pandas, and the ts_dataframe shouldn't have any missing dates or values, but when I try to use it with .fit() I get the following error:
ValueError: Frequency not provided and cannot be inferred. This is often due to the time index of the data being irregularly sampled. Please ensure that the data set used has a uniform time index, or create the TimeSeriesPredictorsettingignore_time_index=True.
I assume that this is because I have only filled in missing data and dates, but not taken into account the varying number of days available for every stock individually.
For reference, here's how I have formatted the data with pandas:
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data.csv",
parse_dates=["Date"],
)
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df.fillna(method='ffill', inplace=True)
df = df.drop("Unnamed: 0", axis=1)
df[:11]
How can I format the data so I can use it with .fit()?
Thanks!

Plotting a graph of the top 15 highest values

I am working on a dataset which shows the budget spent on movies. I want make a plot which contains the top 15 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(dp['budget'].sort_values(ascending = False))
info['original_title'] = dp['original_title']
data = list(map(str,(info['original_title'])))
#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])
This was the ouput i got
C:\Users\Phillip\AppData\Local\Temp\ipykernel_7692\1681814737.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
y = list(info['budget'][:5])
I'm new to the data analysis scene so i'm confused on how else to go about the problem
A simple example using a movie dataset I found online:
import pandas as pd
url = "https://raw.githubusercontent.com/erajabi/Python_examples/master/movie_sample_dataset.csv"
df = pd.read_csv(url)
# Bar plot of 15 highest budgets:
df.nlargest(n=15, columns="budget").plot.bar(x="movie_title", y="budget")
You can customize your plot in various ways by adding arguments to the .bar(...) call.

How can I always choose the last column in a csv table that's updated monthly?

Automating small business reporting from my Quickbooks P&L. I'm trying to get the net income value for the current month from a specific cell in a dataframe, but that cell moves one column to the right every month when I update the csv file.
For example, for the code below, this month I want the value from Nov[0], but next month I'll want the value from Dec[0], even though that column doesn't exist yet.
Is there a graceful way to always select the second right most column, or is this a stupid way to try and get this information?
import numpy as np
import pandas as pd
nov = -810
dec = 14958
total = 8693
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
Sure, you can reference the last or second-to-last row or column.
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
x = df.iloc[-1,-2]
This will select the value in the last row for the second-to-last column, in this case 70. :)
If you plan to use the full file, #VincentRupp's answer will get you what you want.
But if you only plan to use the values in the second right most column and you can infer what it will be called, you can tell pd.read_csv that's all you want.
import pandas as pd # 1.5.1
# assuming we want this month's name
# can modify to use some other month
abbreviated_month_name = pd.to_datetime("today").strftime("%b")
df = pd.read_csv("path/to/file.csv", usecols=[abbreviated_month_name])
print(df.iloc[-1, 0])
References
pd.read_csv
strftime cheat-sheet

Update Column Value for previous entries of row value, based on new entry

enter image description here
I have a dataframe that is updated monthly as such, with a new row for each employee.
If an employee decides to change their gender (for example here, employee 20215 changed from M to F in April 2022, I want all previous entries for that employee number 20215 to be switched to F as well.
This is for a database with roughly 15 million entries, and multiple such changes every month, so I was hoping for a scalable solution (I cannot simply put df['Gender'] = 'F' for example)
Since we didn’t receive a df from you or any code, I neede to generate something myself in order to test it. Please provide enugh code a give us a sample next time as well.
Here the generated df, in case someone comes with a better answer:
import pandas as pd, numpy as np
length=100
df = pd.DataFrame({'ID': np.random.randint(1001,1020,length),
'Ticket': np.random.randint(length),
'salary_grade' : np.random.randint(0,10,size=length),
'date': np.arange(length),
'genre': 'M' })
df['date']=pd.to_numeric(df['date'])
df['date']=pd.to_datetime(df['date'],dayfirst=True,unit='D',origin='15.04.2022')
That is the base DF, now I needed to estimulate some gender changes:
test_id=df.groupby(['ID'])['genre'].count().idxmax() # gives me the employee with most entries.
test_id
df[df['ID']==test_id].loc[:,'genre'] # getting all indexes from test_id, for a testchange/later for checking
df[df['ID']==test_id] # getting indexes of test_ID for gender change
id_lst=[]
for idx in df[df['ID']==test_id].index:
if idx>28: # <-- change this value for you generated df, middle of list
id_lst.append(idx) # returns a list of indexes where gender chage will happen
df.loc[id_lst,'genre']='F' # applying a gender change
Answer:
Finally to your answer:
finder=df.groupby(['ID']).agg({'genre' : lambda x: len(list(pd.unique(x)))>1 , 'date' : 'min'}) # Will return True for every ID with more then 2 genres
finder[finder['genre']] # will return IDs from above condition.
Next steps...
Now with the ID you just need to discover if its M-->F or F-->M new_genreand assign the new genre for the ID_found (int or list).
df.loc[ID_found,'genre']=new_genre

how to convert income from one currency into another using a histroical fx table in python with pandas

I have two dataframes, one is a income df and the other is a fx df. my income df shows income from different accounts on different dates but it also shows extra income in a different currency. my fx df shows the fx rates for certain currency pairs on the same date the extra income came into the accounts.
I want to convert the currency of the extra income into the same currency as the account so for example, account HP on 23/3 has extra income = 35 GBP, i want to convert that into EUR as that's the currency of the account. Please note it has to use the fx table as i have a long history of data points to fill and other accounts so i do not want to manually code 35 * the fx rate. Finally i then want to create another column for income df that will sum the daily income + extra income in the same currency together
im not sure how to bring both df together so i can get the correct fx rate for that sepecifc date to convert the currency of the extra income into the currency of the account
my code is below
import pandas as pd
income_data = {'date': ['23/3/22', '23/3/22', '24/3/22', '25/3/22'], 'account': ['HP', 'HP', 'JJ', 'JJ'],
'daily_income': [1000, 1000, 2000, 2000], 'ccy of account': ['EUR', 'EUR', 'USD', 'USD'],
'extra_income': [50, 35, 10, 12.5], 'ccy of extra_income': ['EUR', 'GBP', 'EUR', 'USD']}
income_df = pd.DataFrame(income_data)
fx_data = {'date': ['23/3/22', '23/3/22', '24/3/22', '25/3/22'], 'EUR/GBP': [0.833522, 0.833522, 0.833621, 0.833066],
'USD/EUR': [0.90874, 0.90874, 0.91006, 0.90991]}
fx_df = pd.DataFrame(fx_data)
the final df should look like this (i flipped the fx rate so 1/0.833522 to get some of the values)
Would really appreicate if someone could help me with this. my inital thpought was merge but i dont have a common column and not sure if map function would work either as i dont have a dictionary. apologies in advance if any of my code is not greate - i am still self learning, thanks!
Consider creating a common column for merging in both data frames. Below uses assign to add columns and Series operators (over arithmetic ones: +, -, *, /).
# ADD NEW COLUMN AS CONCAT OF CCY COLUMNS
income_data = income_data.assign(
currency_ratio = lambda df: df["ccy of account"] + "/" + df["ccy of extra_income"]
)
# ADD REVERSED CURRENCY RATIOS
# RESHAPE WIDE TO LONG FORMAT
fx_data_long = pd.melt(
fx_data.assign(**{
"GBP/EUR": lambda df: df["EUR/GBP"].div(-1),
"EUR/USD": lambda df: df["USD/EUR"].div(-1)
}),
id_vars = "date",
var_name = "currency_ratio",
value_name = "fx_rate"
)
# MERGE AND CALCULATE
income_data = (
income_data.merge(
fx_data_long,
on = ["date", "currency_ratio"],
how = "left"
).assign(
total_income = lambda df: df["daily_income"].add(df["extra_income"].mul(df["fx_rate"]))
)
)