How to populate a column in a dataframe with a function - pandas

I'm very very new in python and pandas so my question is very basic.
I have a simple dataframe that has its index and a column 'years':
Years = pd.DataFrame({'Years': range(1900,2000,1)})
To this dataframe, I need to add a column that for each year it performs a specific calculation, say: Year i = X*i
The year (i.e. 1900, 1991, etc.) doesn't matter as such, more that each "i" belongs to a specific year.
I hope you can help me resolve this. Thansks very much.

Does this solve your problem?
Years = pd.DataFrame({'Years': range(1900,2000,1)})
Years['calculation'] = 0
for row in Years.index:
Years['calculation'][row] = 10**row
You can also specifically use the value in the row before, e.g. like
for row in Years.index:
Years['calculation'][row] = 10 * (row - 1)

df = pd.DataFrame({'Years': range(1900,2000,1)})
df.head()
to do calculations all you have to do is this:
df['new_column'] = df['Years'] * 3
df.head()

Related

How can I always choose the last column in a csv table that's updated monthly?

Automating small business reporting from my Quickbooks P&L. I'm trying to get the net income value for the current month from a specific cell in a dataframe, but that cell moves one column to the right every month when I update the csv file.
For example, for the code below, this month I want the value from Nov[0], but next month I'll want the value from Dec[0], even though that column doesn't exist yet.
Is there a graceful way to always select the second right most column, or is this a stupid way to try and get this information?
import numpy as np
import pandas as pd
nov = -810
dec = 14958
total = 8693
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
Sure, you can reference the last or second-to-last row or column.
d = {'Jan': [50], 'Feb': [70], 'Total':[120]}
df = pd.DataFrame(data=d)
x = df.iloc[-1,-2]
This will select the value in the last row for the second-to-last column, in this case 70. :)
If you plan to use the full file, #VincentRupp's answer will get you what you want.
But if you only plan to use the values in the second right most column and you can infer what it will be called, you can tell pd.read_csv that's all you want.
import pandas as pd # 1.5.1
# assuming we want this month's name
# can modify to use some other month
abbreviated_month_name = pd.to_datetime("today").strftime("%b")
df = pd.read_csv("path/to/file.csv", usecols=[abbreviated_month_name])
print(df.iloc[-1, 0])
References
pd.read_csv
strftime cheat-sheet

Pyspark: how to fix 'could not parse datatype: interval' error

I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.
I already had a date_of_birth column available, so I inserted a current_date column with the following code:
import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))
df= df.withColumn("current_date", lit(today))
Displaying my df confirms that it worked. Looks a little something like this:
id
date_of_birth
current_date
01
1995-01-01
2021-12-2021
02
1987-02-16
2021-12-2021
I inserted the age column by substracting the values of date_of_birth and current_date.
df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))
Cell runs without a problem.
Here's where I'm stuck:
Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:
'could not parse datatype: interval'
I used df.types() to check what's happening, and apparently my newly inserted age column is of interval type.
How can I fix this?
Is there a way to display the age in years (int) in this particular scenario?
PS: both the date_of_birth and current_date cols have a date type.
Solved it. Mike's comment helped tons. Thank you!
Here's how I solved it:
# insert new column current_date with dummy data (in this case, 1s)
df = df.withColumn("current_date", lit(1))
# update data with current_date() function
df = df .withColumn("current_date", f.current_date())
# insert new column age with dummy data (in this case, 1s)
df = df .withColumn("age", lit(1))
# update data with months_between() function, divide by 12 to obtain years.
df = df .withColumn("age", f.months_between(df.current_date, df .date_of_birth)/12)
# round and cast as interger to get rid of decimals
df = df .withColumn("age", f.round(df["age"]).cast('integer'))
Would use one of the pyspark functions for calculating difference between dates.
pyspark.sql.functions.datediff
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.datediff.html
pyspark.sql.functions.months_between
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.months_between.html

Delete All Rows with Year != Pandas

I have a huge panda df with hourly data from years 1991-2021 and I need to drop all rows with year != 2021 or the current year. In my dataframe there is a column "year" with years ranging from 1991-2021 of hourly data. I am using this line of code below but it does not seem to be doing anything for dataframe df1. Is there a better way to delete all rows that do not equal year == 2021?:
trimmed_df1 = df1.drop(df1[df1.year != '2021'].index)
My data is a 4532472 X 10 column df in this format:
df1.columns.values
Out[20]:
array(['plant_name', 'business_name', 'business_code',
'maint_region_name', 'power_kwh', 'wind_speed_ms', 'mos_time',
'dataset', 'month', 'year'], dtype=object)
This should do the job:
>>> trimmed_df1 = df1.query(‘year != 2021’).reset_index()
Maybe you don’t even need to reset the index - it’s up to you.
Instead of deleting lines, why not use a .loc[] call to select the lines you do want?
trimmed_df1 = df1.loc[df1.year == '2021']

Excluding specfic columns in Pandas for column based computations

Year A B C D
1900 1 2 3 4
1901 2 3 4 5
I have a dataset which aligns with the above format.
When i want to perform calculations on column values the year is getting added to the column values and distorting the result. For example
df['mean'] = df.mean(axis='columns')
In the above example i just want to exclude year from calculations. I have 100 plus columns in my data frame and i cannot manually use each of the columns . 'year' is also the Index for my dataframe
I realized the problem and solution
df.set_index(['Year']
df['mean'] = df.mean(axis='columns')
This did not work
But when i added inplace = True , it worked.
'df.set_index(['Year'],inplace = True)'
df['mean'] = df.mean(axis='columns')
You can also drop the year column and create a new dataframe and after applying the mean to individual columns we can add the year column.
df2 = df.drop('Year')
df2['Mean']=df.mean(axis='columns')
df2.concat(df.Year,df2)

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.