I'm performing a multidimensional lookup to assign a value in a new column.
I have a table that has some historical employee data by month. There are two unique people in this example, and they can have multiple jobs within a month.
I want to create a new column that tells me if each unique person has an eligible job based on the conditions below. The challenge is each row has to be considered by month/year.
import pandas as pd
import numpy as np
data = {'Month': ["January", "January", "January", "February", "February", "February", "March", "March", "March", "March"],
'Year': [2015,2015,2015,2015,2015,2015,2016,2016,2016,2016],
'Job #': [1,1,2,1,2,1,1,1,2,3],
'Pay Group': ["Excluded","Included","Excluded","Excluded","Included","Included","Excluded","Exclcuded","Excluded","Included"],
'Name': ["John","Bill","Bill","John","John","Bill","John","Bill","Bill","Bill"]}
df = pd.DataFrame(data, columns=['Month', 'Year', 'Job #', 'Pay Group', 'Name'])
df
Eligible Jobs Conditions:
If ( Job # = 1 AND Pay Group = Include ) AND if the prior condition is false, then look for the next largest Job # within the given month/year AND Pay Group = Includes
IIUC:
You want to for each person, within each month/year, you want to grab the smallest job # such that Pay Group == Included.
Filter only those that are included. Sort by Job #. Group by year, month, and name taking the index of the minimum obseravation. Use this to assign a new column.
dfi = df[df['Pay Group'] == 'Included'].sort_values('Job #')
gc = ['Year', 'Month', 'Name']
idx = dfi.groupby(gc, as_index=False)['Job #'].idxmin()
df['Eligible Job'] = 'Not Eligible'
df.ix[idx] = 'Eligible'
df
Related
The title almost says it already. I have a pyspark.sql.dataframe.Dataframe with a "ID", "TIMESTAMP", "CONSUMPTION" and "TEMPERATURE" column. I need the "TIMESTAMP" column to be resampled to daily intervals (from 15min intervals) and the "CONSUMPTION" and "TEMPERATURE" column aggregated by summation. However, this needs to be performed for each unique id in the "ID" column. How do I do this?
Efficiency/speed is of importance to me. I have a huge dataframe to start with, which is why I would like to avoid .toPandas() and for loops.
Any help would be greatly appreciated!
The following code will build a spark_df to play around with. The input_spark_df represents the input spark dataframe, the disred output is like desired_outcome_spark_df.
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
df_list = []
for unique_id in ['012', '345', '678']:
date_range = pd.date_range(pd.Timestamp('2022-12-28 00:00'), pd.Timestamp('2022-12-30 23:00'),freq = 'H')
df = pd.DataFrame()
df['TIMESTAMP'] = date_range
df['ID'] = unique_id
df['TEMPERATURE'] = np.random.randint(1, 10, df.shape[0])
df['CONSUMPTION'] = np.random.randint(1, 10, df.shape[0])
df = df[['ID', 'TIMESTAMP', 'TEMPERATURE', 'CONSUMPTION']]
df_list.append(df)
pandas_df = pd.concat(df_list)
spark = SparkSession.builder.getOrCreate()
input_spark_df = spark.createDataFrame(pandas_df)
desired_outcome_spark_df = spark.createDataFrame(pandas_df.set_index('TIMESTAMP').groupby('ID').resample('1d').sum().reset_index())
To condense the question thus: how do I go from input_spark_df to desired_outcome_spark_df as efficient as possible?
I found the answer to my own question. I first change the timestamp to "date only" using pyspark.sql.functions.to_date. Then I groupby both "ID" and "TIMESTAMP" and perfrom the aggregation.
from pyspark.sql.functions import to_date, sum, avg
# Group the DataFrame by the "ID" column
spark_df = input_spark_df.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
desired_outcome = (input_spark_df
.withColumn('TIMESTAMP', to_date(col('TIMESTAMP')))
.groupBy("ID", 'TIMESTAMP')
.agg(
sum(col("CONSUMPTION")).alias("CUMULATIVE_DAILY_POWER_CONSUMPTION"),
avg(col('TEMPERATURE')).alias("AVERAGE_DAILY_TEMPERATURE")
))
grouped_df.display()
I'm trying to convert a financial income statement spreadsheet into a table that I can easily plug into Power BI. The spreadsheet columns are as follows:
['Date', 'Budget Flag', 'Department', 'Account 1', 'Account 2', ..., 'Account N']
There are 78 columns of Accounts, and I need to consolidate. When I'm done, the spreadsheet columns need to be arranged like this:
['Date', 'Budget Flag', 'Department', 'Account Name', 'Value']
Here's what I've tried:
import pandas as pd
statement = pd.read_excel('Financial Dashboard Dataset.xlsx', sheet_name="Income Statement", header=0)
appended_data = []
for i in range(3,81):
value = statement.columns[i]
pivot = pd.pivot_table(statement, values=value, index=['Date', 'Budg', 'Depa'])
pivot = pivot.assign(Account=value)
appended_data.append(pivot)
appended_data = pd.concat(appended_data)
pivot.to_excel("final_product.xlsx", index=True, header=True)
So I'm essentially trying to pivot the values of each Account column against the three index columns, assign the Account column header as a value for each cell in a new column, and then union all of the pivot tables together.
My problem is that the resultant spreadsheet only contains the pivot table from the last account column. What gives?
The resultant spreadsheet only contains the pivot table from the last account column because that is the dataframe that I chose to write to excel.
I solved my problem by changing the last line to:
appended_data.to_excel("final_product.xlsx", index=True, header=True)
Sorry, all. It's been a long day. Hope this helps someone else!
I have two dataframes, one is a income df and the other is a fx df. my income df shows income from different accounts on different dates but it also shows extra income in a different currency. my fx df shows the fx rates for certain currency pairs on the same date the extra income came into the accounts.
I want to convert the currency of the extra income into the same currency as the account so for example, account HP on 23/3 has extra income = 35 GBP, i want to convert that into EUR as that's the currency of the account. Please note it has to use the fx table as i have a long history of data points to fill and other accounts so i do not want to manually code 35 * the fx rate. Finally i then want to create another column for income df that will sum the daily income + extra income in the same currency together
im not sure how to bring both df together so i can get the correct fx rate for that sepecifc date to convert the currency of the extra income into the currency of the account
my code is below
import pandas as pd
income_data = {'date': ['23/3/22', '23/3/22', '24/3/22', '25/3/22'], 'account': ['HP', 'HP', 'JJ', 'JJ'],
'daily_income': [1000, 1000, 2000, 2000], 'ccy of account': ['EUR', 'EUR', 'USD', 'USD'],
'extra_income': [50, 35, 10, 12.5], 'ccy of extra_income': ['EUR', 'GBP', 'EUR', 'USD']}
income_df = pd.DataFrame(income_data)
fx_data = {'date': ['23/3/22', '23/3/22', '24/3/22', '25/3/22'], 'EUR/GBP': [0.833522, 0.833522, 0.833621, 0.833066],
'USD/EUR': [0.90874, 0.90874, 0.91006, 0.90991]}
fx_df = pd.DataFrame(fx_data)
the final df should look like this (i flipped the fx rate so 1/0.833522 to get some of the values)
Would really appreicate if someone could help me with this. my inital thpought was merge but i dont have a common column and not sure if map function would work either as i dont have a dictionary. apologies in advance if any of my code is not greate - i am still self learning, thanks!
Consider creating a common column for merging in both data frames. Below uses assign to add columns and Series operators (over arithmetic ones: +, -, *, /).
# ADD NEW COLUMN AS CONCAT OF CCY COLUMNS
income_data = income_data.assign(
currency_ratio = lambda df: df["ccy of account"] + "/" + df["ccy of extra_income"]
)
# ADD REVERSED CURRENCY RATIOS
# RESHAPE WIDE TO LONG FORMAT
fx_data_long = pd.melt(
fx_data.assign(**{
"GBP/EUR": lambda df: df["EUR/GBP"].div(-1),
"EUR/USD": lambda df: df["USD/EUR"].div(-1)
}),
id_vars = "date",
var_name = "currency_ratio",
value_name = "fx_rate"
)
# MERGE AND CALCULATE
income_data = (
income_data.merge(
fx_data_long,
on = ["date", "currency_ratio"],
how = "left"
).assign(
total_income = lambda df: df["daily_income"].add(df["extra_income"].mul(df["fx_rate"]))
)
)
How to qualify the timeseries data as year wise data using pd.infer_freq() as using pd.infer_freq() only give cfrequency checks on months and days but not years .how can we automate it
import pandas as pd
df = pd.DataFrame({'Date': ['2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01'],
'Date1': ['2005-01-01', '2005-01-02', '2005-01-03', '2005-01-04'],
'Date2': ['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01'],
'Date3': ['2006-01-01', '2006-02-06', '2006-03-11', '2006-04-01']})
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Date3'] = pd.to_datetime(df['Date3'])
o/p :
pd.infer_freq(df.Date)
'MS'
pd.infer_freq(df.Date1)
'D'
pd.infer_freq(df.Date2)
'AS-JAN
'
But how do i get Year
I have a list of transactions in a data frame and want to group by Symbols and take the sum of one of the columns. Additionally, I want the first instance of this column (per symbol).
My code:
local_filename= 'C:\Users\\nshah\Desktop\Naman\TEMPLATE.xlsx'
data_from_local_file = pd.read_excel(local_filename, error_bad_lines=False, sheet_name='JP_A')
data_from_local_file = data_from_local_file[['Symbol','Security Name', 'Counterparty', 'Manager', 'Rate', 'LocatedAmt']]
data_grouped = data_from_local_file.groupby(['Symbol'])
pivoted = data_grouped['LocatedAmt'].sum().reset_index()
Next I want first instance of let's say rate with same symbol.
Thank you in advance!
You can achieve the sum and first observed instance as follows:
data_grouped = data_from_local_file.groupby(['Symbol'], as_index=False).agg({'LocatedAmt':[sum, 'first']})
To accomplish this for all columns, you can pass the agg function across all columns:
all_cols = ['Symbol','Security Name', 'Counterparty', 'Manager', 'Rate', 'LocatedAmt']
data_grouped_all = data_from_local_file.groupby(['Symbol'], as_index=False)[all_cols].agg([sum, 'first'])