How can I always choose the last column in a csv table that's updated monthly? - pandas

Automating small business reporting from my Quickbooks P&L. I'm trying to get the net income value for the current month from a specific cell in a dataframe, but that cell moves one column to the right every month when I update the csv file.
For example, for the code below, this month I want the value from Nov[0], but next month I'll want the value from Dec[0], even though that column doesn't exist yet.
Is there a graceful way to always select the second right most column, or is this a stupid way to try and get this information?
import numpy as np
import pandas as pd
nov = -810
dec = 14958
total = 8693
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)

Sure, you can reference the last or second-to-last row or column.
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
x = df.iloc[-1,-2]
This will select the value in the last row for the second-to-last column, in this case 70. :)

If you plan to use the full file, #VincentRupp's answer will get you what you want.
But if you only plan to use the values in the second right most column and you can infer what it will be called, you can tell pd.read_csv that's all you want.
import pandas as pd # 1.5.1
# assuming we want this month's name
# can modify to use some other month
abbreviated_month_name = pd.to_datetime("today").strftime("%b")
df = pd.read_csv("path/to/file.csv", usecols=[abbreviated_month_name])
print(df.iloc[-1, 0])
References
pd.read_csv
strftime cheat-sheet

Related

Interpolate values based in date in pandas

I have the following datasets
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1")
df2 = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet2")
df2.dropna(inplace = True)
For each group of values on the first df X-Axis Value, Y-Axis Value, where the first one is the date and the second one is a value, I would like to create rows with the same date. For instance, df.iloc[0,0] the timestamp is Timestamp('2020-08-25 23:14:12'). However, in the following columns of the same row maybe there is other dates with different Y-Axis Value associated. The first one in that specific row being X-Axis Value NCVE-064 HPNDE with a timestap 2020-08-25 23:04:12 and a Y-Axis Value associated of value 0.952.
What I want to accomplish is to interpolate those values for a time interval, maybe 10 minutes, and then merge those results to have the same date for each row.
For the df2 is moreless the same, interpolate the values in a time interval and add them to the original dataframe. Is there any way to do this?
The trick is to realize that datetimes can be represented as seconds elapsed with respect to some time.
Without further context part the hardest things is to decide at what times you wants to have the interpolated values.
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.read_excel(
"https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1",
)
x_columns = [col for col in df.columns if col.startswith("X-Axis")]
# What time do we want to align the columsn to?
# You can use anything else here or define equally spaced time points
# or something else.
target_times = df[x_columns].min(axis=1)
def interpolate_column(target_times, x_times, y_values):
ref_time = x_times.min()
# For interpolation we need to represent the values as floats. One options is to
# compute the delta in seconds between a reference time and the "current" time.
deltas = (x_times - ref_time).dt.total_seconds()
# repeat for our target times
target_times_seconds = (target_times - ref_time).dt.total_seconds()
return interp1d(deltas, y_values, bounds_error=False,fill_value="extrapolate" )(target_times_seconds)
output_df = pd.DataFrame()
output_df["Times"] = target_times
output_df["Y-Axis Value NCVE-063 VPNDE"] = interpolate_column(
target_times,
df["X-Axis Value NCVE-063 VPNDE"],
df["Y-Axis Value NCVE-063 VPNDE"],
)
# repeat for the other columns, better in a loop

How to categorize a range of hours in Pandas?

In my project I am trying to create a new column to categorize records by range of hours, let me explain, I have a column in the dataframe called 'TowedTime' with time series data, I want another column to categorize by full hour without minutes, for example if the value in the 'TowedTime' column is 09:32:10 I want it to be categorized as 9 AM, if says 12:45:10 it should be categorized as 12 PM and so on with all the other values. I've read about the .cut and bins function but I can't get the result I want.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
df = pd.read_excel("Baltimore Towing Division.xlsx",sheet_name="TowingData")
df['Month'] = pd.DatetimeIndex(df['TowedDate']).strftime("%b")
df['Week day'] = pd.DatetimeIndex(df['TowedDate']).strftime("%a")
monthOrder = ['Jan', 'Feb', 'Mar', 'Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dayOrder = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
pivotHours = pd.pivot_table(df, values='TowedDate',index='TowedTime',
columns='Week day',
fill_value=0,
aggfunc= 'count',
margins = False, margins_name='Total').reindex(dayOrder,axis=1)
print(pivotHours)
First, make sure the type of the column 'TowedTime' is datetime. Second, you can easily extract the hour from this data type.
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S')
df['hour'] = df['TowedTime'].dt.hour
hope it answers your question
With the help of #Fabien C I was able to solve the problem.
First, I had to check the data type of values in the 'TowedTime' column with dtypes function. I found that were a Object.
I proceed to try convert 'TowedTime' to datetime:
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.time
Then to create a new column in the df, for only the hours:
df['Hour'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.hour
And the result was this:
You can notice in the image that 'TowedTime' column remains as an object, but the new 'Hour' column correctly returns the hour value.
Originally, the dataset already had the date and time separated into different columns, I think they used some method to separate date and time in excel and this created the time ('TowedTime') to be an object, I could not convert it, Or at least that's what the dtypes function shows me.
I tried all this Pandas methods for converting the Object to Datetime :
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = df['TowedTime'].astype('datetime64[ns]')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')

Pandas Dataframe Flatten values to cell based on column value

I have a simple data frame that I am reading in from Excel. To process further, I need to combine the "store" values into a list in one cell that corresponds to the given zone. To clarify, I want to have only one row per zone. In the corresponding "store" column will be a list of all the corresponding stores in one cell.
Current State
Desired State
I have tried to implement melt with no success.
store_df = pd.read_excel("Zones_by_Store.xlsx")
store_df.groupby(store_df['Price Zone Name'])
pd.melt(store_df, id_vars=['Price Zone Name'], value_vars=['Store No'])
store_df.to_csv('Stores.csv')
Try this:
import pandas as pd
price_zone = ['CA2', 'CA2', 'CA2']
store_num = [112, 162, 726]
df = pd.DataFrame(price_zone, columns=['Price Zone'])
df['Store No'] = store_num
df = (df.groupby(['Price Zone']).agg({'Store No': lambda x: x.tolist()}).reset_index())
print(df)

if str-dtype, convert. Date triggers value change in another column

Trying to figure out what I'm doing wrong, new to python and not sure how to solve my AssertionError. If the date in column "sa_death_date" is less than or =to 2020, column "Final Abstraction? Change to 2 if yes." needs to be changed to 2, is greater than 2021, should be 0
Thoughts?
import pandas as pd
import datetime
df=pd.read_csv("C:\\Users\\nm236211\\Desktop\OctoberConsented.DI.REDCap_API.csv")
df.dropna(subset = ["sa_death_date"], inplace=True)
df["sa_death_date"] = pd.to_datetime(df['sa_death_date'])
df.sa_death_date.apply(str)
df.loc[df["sa_death_date"]<= pd.to_datetime(2020,1,1),"Final Abstraction? Change to 2 if yes."] = 2
df.loc[df["sa_death_date"]>= pd.to_datetime(2020,12,31),"Final Abstraction? Change to 2 if yes."] = 0
print(df)
Change the two places of pd.to_datetime(y,m,d) to
pd.datetime(y,m,d)
which is the correct datetime object constructor.

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))