How to select data from multiple dataframes - numpy

I'm beginner on pandas i have to dataframes first called
DATA_DF which contains many fields and i'm interested for DATA_DF['Date effet'] as type datetime
and i have other dataframe called TAUX_DF contains years and every year has a value
TAUX_DF =
Année <10 ans >10 ans
1987 2,8168% 3,4664%
1988 2,8168% 3,4664%
1989 2,8168% 3,4664%
1990 2,8168% 3,4664%
i want to create new column on DATA_DF called "DATA_DF['Taux technique']"
it take from DATA_DF['Date effet'].dt.year compare it with the year on TAUX_DF['Année'] and put value like this on Excel
=SI(G5>120;RECHERCHEV(ANNEE(C5);Taux!$A$2:$C$29;3;FAUX);RECHERCHEV(ANNEE(C5);Taux!$A$2:$C$29;2;FAUX))

DATA_DF['Année'] = DATA_DF['Date effet'].dt.year ## ==> make column with year of the data_df in order to compare (merge) it later on with the 'TAUX_DF'.
DATA_DF = pd.merge(DATA_DF,TAUX_DF,left_on='Année',right_on='Année', how='left')

Related

How to change all values in a column (pandas), based on values in same row of different column

I have a csv as a data frame in python and as an example row of ages and year built:
AGE
BUILT
82
2016
How can I change all the values in the age column of the df to equal the current year - the year the house was built (from built column)?
import pandas as pd
print('\n***Data Analysis for Housing CSV***')
housing_df = pd.read_csv('Housing.csv', header=0)
current_year = 2021
housing_df = housing_df.loc[housing_df['AGE'] != current_year -
housing_df['BUILT'], 'AGE'] = current_year - housing_df[
'BUILT']
housing_df["AGE"] = current_year - housing_df["BUILT"]
Edit: Pandas will understand you when you perform these arithmetic operations that look illogical between a column and an integer or a column and another column, example:
df["rectArea"] = df["height"] * df['width']

Delete All Rows with Year != Pandas

I have a huge panda df with hourly data from years 1991-2021 and I need to drop all rows with year != 2021 or the current year. In my dataframe there is a column "year" with years ranging from 1991-2021 of hourly data. I am using this line of code below but it does not seem to be doing anything for dataframe df1. Is there a better way to delete all rows that do not equal year == 2021?:
trimmed_df1 = df1.drop(df1[df1.year != '2021'].index)
My data is a 4532472 X 10 column df in this format:
df1.columns.values
Out[20]:
array(['plant_name', 'business_name', 'business_code',
'maint_region_name', 'power_kwh', 'wind_speed_ms', 'mos_time',
'dataset', 'month', 'year'], dtype=object)
This should do the job:
>>> trimmed_df1 = df1.query(‘year != 2021’).reset_index()
Maybe you don’t even need to reset the index - it’s up to you.
Instead of deleting lines, why not use a .loc[] call to select the lines you do want?
trimmed_df1 = df1.loc[df1.year == '2021']

Excluding specfic columns in Pandas for column based computations

Year A B C D
1900 1 2 3 4
1901 2 3 4 5
I have a dataset which aligns with the above format.
When i want to perform calculations on column values the year is getting added to the column values and distorting the result. For example
df['mean'] = df.mean(axis='columns')
In the above example i just want to exclude year from calculations. I have 100 plus columns in my data frame and i cannot manually use each of the columns . 'year' is also the Index for my dataframe
I realized the problem and solution
df.set_index(['Year']
df['mean'] = df.mean(axis='columns')
This did not work
But when i added inplace = True , it worked.
'df.set_index(['Year'],inplace = True)'
df['mean'] = df.mean(axis='columns')
You can also drop the year column and create a new dataframe and after applying the mean to individual columns we can add the year column.
df2 = df.drop('Year')
df2['Mean']=df.mean(axis='columns')
df2.concat(df.Year,df2)

ArcPy & Python - Get Latest TWO dates, grouped by Value

I've looked around for the last week to an answer but only see partial answers. Being new to python, I really could use some assistance.
I have two fields in a table [number] and [date]. The date format is date and time, so: 07/09/2018 3:30:30 PM. The [number] field is just an integer, but each row may have the same number.
I have tried a few options to gain access to the LATEST date, and I can get these using Pandas:
myarray = arcpy.da.FeatureClassToNumPyArray (fc, ['number', 'date'])
mydf = pd.DataFrame(myarray)
date_index = mydf.groupby(['number'])['date'].transform(max)==mydf['date']
However, I need the latest TWO dates. I've moved on to trying an "IF" statement because I feel arcpy.da.UpdateCursor is better suited to look through the record and update another field by grouping by NUMBER and returning the rows with the latest TWO dates.
End result would like to see the following table grouped by number, latest two dates (as examples):
Number : Date
1 7/29/2018 4:30:44 PM
1 7/30/2018 5:55:34 PM
2 8/2/2018 5:45:23 PM
2 8/3/2018 6:34:32 PM
Try this.
import pandas as pd
import numpy as np
# Some data.
data = pd.DataFrame({'number': np.random.randint(3, size = 15), 'date': pd.date_range('2018-01-01', '2018-01-15')})
# Look at the data.
data
Which gives some sample data like this:
So in our output we'd expect to see number 0 with the 5th and the 9th, 1 with the 14th and 15th, and 2 with the 6th and the 12th.
Then we group by number, grab the last two rows, and set and sort the index.
# Group and label the index.
last_2 = data.groupby('number').tail(2).set_index('number').sort_index()
last_2
Which gives us what we expect.

Pandas: how do I group a Data Frame by a set of ordinal values?

I'm starting to learn about Python Pandas and want to generate a graph with the sum of arbitrary groupings of an ordinal value. It can be better explained with a simple example.
Suppose I have the following table of food consumption data:
And I have two groups of foods defined as two lists:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
Now I want to plot a graph with the evolution of consumption of junk and healthy food. I believe I must then process my data to get a DataFrame like:
Suppose the first table is already in a Dataframe called food, how do I transform it to get the second one?
I also welcome suggestions to reword my question to make it clearer, or for different approaches to generate the plot.
First create dictinary with lists and then swap keys with values.
Then groupby by mapped column food by dict and year, aggregate sum and last reshape by unstack:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
d1 = {'healthy':healthy, 'junk':junk}
##http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
print (d)
{'brocolli': 'healthy', 'cheetos': 'junk', 'apple': 'healthy', 'coke': 'junk'}
df1 = df.groupby([df.food.map(d), 'year'])['amount'].sum().unstack(0)
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
Another solution with pivot_table:
df1 = df.pivot_table(index='year', columns=df.food.map(d), values='amount', aggfunc='sum')
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24