I'm trying to write a program that will compute the average press, temp and humidity within a specified date and time.... but not sure why im getting 'nan' values... ? here is my code: any ideas?
import pandas as pd
import numpy as np
df = pd.DataFrame.from_csv('C:\Users\Joey\Desktop\Python\CSV\TM4CVC.csv',index_col = None)
df2 = pd.DataFrame({'temp':df['Ch1_Value'],
'press':df['Ch2_Value'],
'humid':df['Ch3_Value'], 'Date' : df['Date'], 'Time' : df['Time']})
df2['DateTime'] = pd.to_datetime(df2.apply(lambda x: x['Date']+ ' '+ x['Time'], 1))
df2.index = pd.to_datetime(df2.pop('DateTime'))
df3 = df2.drop(['Date', 'Time'], 1)
#------------------------------------------------------------------------------
def TempPressHumid(datetime_i, datetime_e):
index = df3[datetime_i:datetime_e]
out = {'temp_avg':np.mean(index['temp']),
'temp_std':np.std(index['temp']),
'press_avg':np.mean(index['press']),
'press_std':np.std(index['press']),
'humid_avg':np.mean(index['humid']),
'humid_std':np.std(index['humid'])}
print out
TempPressHumid(datetime_i = '2012-06-25 08:27:19', datetime_e = '2012-01-25 10:59:33')
My output is:
{'humid_std': nan, 'press_std': nan, 'humid_avg': nan, 'temp_avg': nan, 'temp_std': nan, 'press_avg': nan}
print df3 gives me:
humid press temp
DateTime
2012-06-25 08:21:19 1004.0 21.2 26.0
2012-06-25 08:22:19 1004.0 21.2 26.0
2012-06-25 08:23:19 1004.1 21.3 26.0
-----------------------------------------
etc...
You could try something like this:
a = pd.Series(np.random.random_sample(1000))
b = pd.Series(np.random.random_sample(1000))
c = pd.Series(np.random.random_sample(1000))
df = pd.DataFrame({"temp": a, "press": b, "humid": c})
i = pd.date_range('20120625', periods=1000, freq="h")
df.index = pd.to_datetime(i)
At this point data frame df looks like
humid press temp
2012-06-25 00:00:00 0.910517 0.588777 ...
2012-06-25 01:00:00 0.742219 0.501180
2012-06-25 02:00:00 0.810515 0.172370
2012-06-25 03:00:00 0.215735 0.046797
2012-06-25 04:00:00 0.094144 0.822310
2012-06-25 05:00:00 0.662934 0.629981
2012-06-25 06:00:00 0.876086 0.586799
...
Now let's calculate the mean and standard deviation of the desired date ranges
def TempPressHumid(start, end, df):
values = {'temp_mean':np.mean(df['temp'][start:end]),
'temp_std':np.std(df['temp'][start:end]),
'press_mean':np.mean(df['press'][start:end]),
'press_std':np.std(df['press'][start:end]),
'humid_mean':np.mean(df['humid'][start:end]),
'humid_std':np.std(df['humid'][start:end]),
}
print(values)
return
So if you call TempPressHumid('2012-06-25 08:00:00', '2012-07-25 10:00:00', df) you should see the dictionary of desired values.
Related
I have the following df:
I want to group this df on the first column(ID) and on the second column(key), from there to build a cumsum for each day. The cumsum should be on the last column(speed).
I tried this with the following code :
df = pd.read_csv('df.csv')
df['Time'] = pd.to_datetime(df['Time'], format='%Y-%m-%d %H:%M:%S')
df = df.sort_values(['ID','key'])
grouped = df.groupby(['ID','key'])
test = pd.DataFrame()
test2 = pd.DataFrame()
for name, group in grouped:
test = group.groupby(pd.Grouper(key='Time', freq='1d'))['Speed'].cumsum()
test = test.reset_index()
test['ID'] = ''
test['ID'] = name[0]
test['key'] = ''
test['key'] = name[1]
test2 = test2.append(test)
But the result seem quite off, there are more rows than 5. For each day one row with the cumsum of each ID and key.
Does anyone see the reason for my problem ?
thanks in advance
Friendly reminder, it's useful to include a runable example
import pandas as pd
data = [{"cid":33613,"key":14855,"ts":1550577600000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550579340000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550584800000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550682000000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550685900000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550773380000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550858400000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550941200000,"value":25.0},
{"cid":33613,"key":14855,"ts":1550978400000,"value":50.0}]
df = pd.DataFrame(data)
df['ts'] = pd.to_datetime(df['ts'], unit='ms')
I believe what you need can be accomplished as follows:
df.set_index('ts').groupby(['cid', 'key'])['value'].resample('D').sum().cumsum()
Result:
cid key ts
33613 14855 2019-02-19 150.0
2019-02-20 250.0
2019-02-21 300.0
2019-02-22 350.0
2019-02-23 375.0
2019-02-24 425.0
Name: value, dtype: float64
I am trying to create a new column for a data frame, but it seems giving incorrect result in the new column, The data is below:
df = pd.DataFrame(np.random.randint(0,30,size=10),
columns=["Random"],
index=pd.date_range("20180101", periods=10))
df=df.reset_index()
df.loc[:,'Random'] = '20'
df['Recommandation']=['No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No']
df['diff']=[3,2,4,1,6,1,2,2,3,1]
df
I am trying to create another column in 'new' by using the following condition:
If the 'index' is in the first three date, then, 'new'='random',
elif the 'Recommendation' is yes, than 'new'= 'Value of the previous row of the new column'+'diff'
else: 'new'= 'Value of the previous row of the new column'
My code is below:
import numpy as np
df['new'] = 0
df['new'] = np.select([df['index'].isin(df['index'].iloc[:3]), df['Recommandation'].eq('Yes')],
[df['new'], df['diff']+df['new'].shift(1)],
df['new'].shift(1)
)
#The expected output
df[new]=[20,20,20,21,27,28,28,28,31,31]
df
try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,30,size=10),
columns=["Random"],
index=pd.date_range("20180101", periods=10))
df = df.reset_index()
df.loc[:,'Random'] = 20
df['Recommandation'] = ['No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No']
df['diff'] = [3,2,4,1,6,1,2,2,3,1]
df.loc[5, 'index'] = pd.to_datetime('2018-01-02') # I modified this data
df['new'] = df['diff']
df['new'] = df['new'].where(df.Recommandation.eq('Yes'))
# the mask that 'index' is in the first three date
m = df['index'].isin(df['index'][:3])
df.loc[m, 'new'] = df.Random
idx = m[m].index.drop([df.index.min()], errors='ignore')
df['new'] = pd.concat(map(lambda x: x.cumsum().ffill(), np.split(df.new, idx)))
df
>>>
index Random Recommandation diff new
0 2018-01-01 20 No 3 20.0
1 2018-01-02 20 Yes 2 20.0
2 2018-01-03 20 No 4 20.0
3 2018-01-04 20 Yes 1 21.0
4 2018-01-05 20 Yes 6 27.0
5 2018-01-02 20 Yes 1 20.0
6 2018-01-07 20 No 2 20.0
7 2018-01-08 20 No 2 20.0
8 2018-01-09 20 Yes 3 23.0
9 2018-01-10 20 No 1 23.0
I am using the following code:
import pandas as pd
from yahoofinancials import YahooFinancials
mutual_funds = ['PRLAX', 'QASGX', 'HISFX']
yahoo_financials_mutualfunds = YahooFinancials(mutual_funds)
daily_mutualfund_prices = yahoo_financials_mutualfunds.get_historical_price_data('2015-01-01', '2021-01-30', 'daily')
I get a dictionary as the output file. I would like to get a pandas dataframe with the columns: data, PRLAX, QASGX, HISFX where data is the formatted_date and the Open price for each ticker
pandas dataframe
What you can do is this:
df = pd.DataFrame({
a: {x['formatted_date']: x['adjclose'] for x in daily_mutualfund_prices[a]['prices']} for a in mutual_funds
})
which gives:
PRLAX QASGX HISFX
2015-01-02 19.694817 17.877445 11.852874
2015-01-05 19.203604 17.606575 11.665626
2015-01-06 19.444574 17.316357 11.450289
2015-01-07 19.963596 17.616247 11.525190
2015-01-08 20.260176 18.003208 11.665626
... ... ... ...
2021-01-25 21.799999 33.700001 14.350000
2021-01-26 22.000000 33.139999 14.090000
2021-01-27 21.620001 32.000000 13.590000
2021-01-28 22.120001 32.360001 13.990000
2021-01-29 21.379999 31.709999 13.590000
[1530 rows x 3 columns]
or any other of the values in the dict.
I am stuck understanding the method to use. I have the following dataframe:
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df
I need to:
groupby the same 'CODE',
check if the 'DESC' is not the same
check if the 'TYPE' is the same
calculate the month difference between dates that satisfy the previous 2 commands
The expected output is the below:
The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.
How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.
import pandas as pd
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
# only keep rows that have the same code and type
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]
# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)
# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
# drop rows that don't have a previous date
df = df.dropna()
# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
This results in the following df:
CODE DATE TYPE DESC previous_date difference_in_dates
AACCBD 2020-07-21 PUB OK 2020-07-16 5 days
BBLGLC70M 2019-09-25 PRI OK 2019-05-16 132 days
BCCDN 2020-02-27 PUB OK 2020-02-13 14 days
I wrote a function that explicitly defines which rows of a particular column ('UNEXPLAINED_Sq') to sum in a Pandas df based on two other columns ('REPORTING_DATE', 'WindowEnd').
def sum_scores (row):
s = row['REPORTING_DATE']
e = row['WindowEnd']
s_sum = df.loc[df['REPORTING_DATE'] <= s, 'UNEXPLAINED_Sq'].sum()
e_sum = df.loc[df['REPORTING_DATE'] < e, 'UNEXPLAINED_Sq'].sum()
return (s_sum - e_sum)
df.loc[:, 'UNEXPLAINED_Sq_SUM'] = df.apply(sum_scores, axis=1)
Then I made it more generic so I could pass in a column variable to sum:
def sum_scores_5 (row, c_Name):
s = row['REPORTING_DATE']
e = row['WindowEnd']
s_sum = df.loc[df['REPORTING_DATE'] <= s, df[c_Name]].sum
e_sum = df.loc[df['REPORTING_DATE'] < e, df[c_Name]].sum
return (s_sum - e_sum)
df.loc[:, 'UNEXPLAINED_Sq_SUM'] =df.apply(sum_scores_5, 'UNEXPLAINED_Sq', axis=1)
But returns: TypeError: apply() got multiple values for argument 'axis'
Then I thought I would use a lambda function for the multiple variables
df.loc[:, 'UNEXPLAINED_Sq_SUM'] = df.apply(lambda x: sum_scores_5(x, df['UNEXPLAINED_Sq']), axis=1)
But this returns
KeyError: ('[2.01018345e+13 ...\n 1.67234745e+ 3.02534089e+14] not in index', 'occurred at index 0')
which i think is because I'm attempting to pass the entire column in to be evaluated at each row.
Snippet of the data is listed below. How can I index the column variable to be summed?
*****Data Table*****
REPORTING_DATE UNEXPLAINED_Sq WindowEnd
2019-02-01 2.010183e+13 2018-08-01
2019-02-04 6.136327e+13 2018-08-04
2019-02-05 1.123688e+13 2018-08-05
2019-02-06 1.253237e+12 2018-08-06
2019-02-07 5.673673e+13 2018-08-07
def sum_scores_5 (row, c_Name):
s = row['REPORTING_DATE']
e = row['WindowEnd']
s_sum = df.loc[df['REPORTING_DATE'] <= s, c_Name].sum()
e_sum = df.loc[df['REPORTING_DATE'] < e, c_Name].sum()
return (s_sum - e_sum)
df.loc[:, 'UNEXPLAINED_Sq_SUM'] = df.apply(lambda x: sum_scores_5(x, 'UNEXPLAINED_Sq'), axis=1)