col_A vi_B data_source index_as_date
2017-01-21 0.000000 0.199354 sat 2017-01-21
2017-01-22 0.000000 0.204250 NaN NaT
2017-01-23 0.000000 0.208077 NaN NaT
2017-01-27 0.000000 0.215081 NaN NaT
2017-01-28 0.000000 0.215300 NaN NaT
In the pandas dataframe above, I want to insert a row for 24th January 2017 with value of 0.01, 0.4, sat, NaT, how do I do that? I could use iloc and manually insert but I would prefer an automated solution which takes the datetime index into account
I think you need setting with enlargement with sort_index:
#if necessary convert to datetime
df.index = pd.to_datetime(df.index)
df['index_as_date'] = pd.to_datetime(df['index_as_date'])
df.loc[pd.to_datetime('2017-01-24')] = [0.01,0.4,'sat', pd.NaT]
df = df.sort_index()
print (df)
col_A vi_B data_source index_as_date
2017-01-21 0.00 0.199354 sat 2017-01-21
2017-01-22 0.00 0.204250 NaN NaT
2017-01-23 0.00 0.208077 NaN NaT
2017-01-24 0.01 0.400000 sat NaT
2017-01-27 0.00 0.215081 NaN NaT
2017-01-28 0.00 0.215300 NaN NaT
Related
I am trying to obtain the minimum value of three consecutive cells in pandas. The calculation should take into account the one cell above and one below.
I have tried scipy's argelextrema but I have a feeling it does not perform a rolling window.
Thanks
This is a wild approach but it did not perform as expected.
def pivot_swing_low(df):
data = df.copy()
data['d1'] = data.Close.shift(-1)
data['d3'] = data.Close.shift(0)
data['d4'] = data.Close.shift(1)
data['minPL'] = data[['d1', 'd3', 'd4']].min(axis=1)
data['PL'] = np.where(data['minPL'] == data['d3'], data['d3'], "NaN")
data['recentPL'] = data.PL.shift(2).astype(float).fillna(method='ffill')
data = data.drop(columns=['d1', 'd3', 'd4'])
return data
It will always capture the row number 33, but to me row 31 is relevant as well.
38.78 1671068699999 2022-12-15 01:44:59.999 NaN NaN -0.37 0.00 0.37 0.023571 0.054286 0.023125 0.057698 0.400805 28.612474 NaN NaN 38.78 38.78 39.15
30 38.79 1671068999999 2022-12-15 01:49:59.999 NaN NaN 0.01 0.01 0.00 0.022857 0.054286 0.022188 0.053576 0.414137 29.285496 NaN NaN 38.48 NaN 39.15
31 38.48 1671069299999 2022-12-15 01:54:59.999 NaN NaN -0.31 0.00 0.31 0.021429 0.076429 0.020603 0.071892 0.286583 22.274722 22.274722 NaN 38.48 38.48 38.78
32 38.67 1671069599999 2022-12-15 01:59:59.999 NaN NaN 0.19 0.19 0.00 0.035000 0.074286 0.032703 0.066757 0.489878 32.880419 NaN NaN 38.37 NaN 38.78
33 38.37 1671069899999 2022-12-15 02:04:59.999 38.37000000 NaN -0.30 0.00 0.30 0.035000 0.093571 0.030367 0.083417 0.364036 26.688174 NaN NaN 38.37 38.37 38.48
34 38.58 1671070199999 2022-12-15 02:09:59.999 NaN NaN 0.21 0.21 0.00 0.050000 0.090000 0.043198 0.077459 0.557687 35.802263 NaN NaN 38.37 NaN 38.48
35 38.70 1671070499999 2022-12-15 02:14:59.999 NaN NaN 0.12 0.12 0.00 0.058571 0.090000 0.048684 0.071926 0.676857 40.364625 NaN 40.364625 38.58 NaN 38.37
import pandas as pd
# Load the data into a dataframe
df = pd.read_csv('data.csv')
# Calculate the minimum of the current cell, the cell above, and the cell below
min_three_cells = df['value'].rolling(3, min_periods=1).min()
# View the results
print(min_three_cells)
This might help.
I have a pandas dataframe with monthly date index till the current month. I would like to impute NA values n periods into the future (in my case 1 year). I tried adding future dates into the existing index in the following manner:
recentDate = inputFileDf.index[-1]
outputFileDf.index = outputFileDf.index.append(pd.date_range(recentDate , periods=12, freq="M"))
This throws ValueError: Length mismatch: Expected axis has 396 elements, new values have 408 elements.
Would appreciate any help to "extend" the dataframe by adding the dates and NA values.
You can use df.reindex here.
Example data:
df = pd.DataFrame(
{'num': [*range(5)]},
index=pd.date_range('2022-10-10', periods=5, freq='D'))
print(df)
num
2022-10-10 0
2022-10-11 1
2022-10-12 2
2022-10-13 3
2022-10-14 4
recentDate = df.index[-1]
new_data = pd.date_range(recentDate , periods=4, freq="M")
new_idx = df.index.append(new_data)
new_df = df.reindex(new_idx)
print(new_df)
num
2022-10-10 0.0
2022-10-11 1.0
2022-10-12 2.0
2022-10-13 3.0
2022-10-14 4.0
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
Use concat - if indices are unique or not:
recentDate = inputFileDf.index[-1]
df = pd.DataFrame(index=pd.date_range(recentDate, periods=12, freq="M"))
outputFileDf = pd.concat([inputFileDf, df])
If indices are unique in idx use DataFrame.reindex:
recentDate = inputFileDf.index[-1]
idx = inputFileDf.index.append(pd.date_range(recentDate , periods=12, freq="M"))
outputFileDf = outputFileDf.reindex(idx)
EDIT: If original DataFrame has months and need append new one need add 1 month to last index for avoid double last indices from original DataFrame:
inputFileDf = pd.DataFrame(columns=['col'],
index=pd.date_range('2022-10-31', periods=4, freq='M'))
print(inputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
recentDate = inputFileDf.index[-1]
idx = inputFileDf.index.append(pd.date_range(recentDate + pd.DateOffset(months=1) , periods=12, freq="M"))
outputFileDf = inputFileDf.reindex(idx)
print (outputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
2023-09-30 NaN
2023-10-31 NaN
2023-11-30 NaN
2023-12-31 NaN
2024-01-31 NaN
Or use first index value:
firstDate = inputFileDf.index[0]
idx = pd.date_range(firstDate, periods=12 + len(inputFileDf), freq="M")
outputFileDf = inputFileDf.reindex(idx)
print (outputFileDf)
col
2022-10-31 NaN
2022-11-30 NaN
2022-12-31 NaN
2023-01-31 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
2023-09-30 NaN
2023-10-31 NaN
2023-11-30 NaN
2023-12-31 NaN
2024-01-31 NaN
I have missing data and would like to replace the NaN's with random values from between the existing min and max for that column (different filled values for each NaN). I have been trying things like the below but it doesn't work and I am not sure how to loop through the columns correctly as the min max will change for each column.
import datetime
import numpy as np
import pandas as pd
def fill_blanks(df):
for i in list(df):
for x in i:
if type(x) is datetime.datetime:
return x
continue
if pd.isnull(x):
#print (i,x)
x=(np.random.uniform(df[i].min(), df[i].max()))
return x
else:
return x
df.applymap(fill_blanks)
example data
d = {'Date': ['2015-09-01 09:00:00', '2015-09-02 09:00:00','2015-09-03 09:00:00','2015-09-01 09:00:00',], 'col2': [np.nan, 102,np.nan,105],'col3': [1, np.nan,3,2.5,],'col4': [0.0001, 0.0002,np.nan,0.0003]}
df = pd.DataFrame(data=d)
df
gives
Out[5]:
Date col2 col3 col4
0 2015-09-01 09:00:00 NaN 1.0 0.0001
1 2015-09-02 09:00:00 102.0 NaN 0.0002
2 2015-09-03 09:00:00 NaN 3.0 NaN
3 2015-09-01 09:00:00 105.0 2.5 0.0003
desired output might be:
Out[5]:
Date col2 col3 col4
0 2015-09-01 09:00:00 102.5 1.0 0.0001
1 2015-09-02 09:00:00 102.0 2.0 0.0002
2 2015-09-03 09:00:00 104.5 3.0 0.0002
3 2015-09-01 09:00:00 105.0 2.5 0.0003
You can use:
numeric_cols = df.select_dtypes([np.number]).columns
df[numeric_cols] = df[numeric_cols].apply(lambda x: x.fillna(np.random.uniform(x.min(), x.max(), 1)[0]))
Output:
Date col2 col3 col4
0 2015-09-01 09:00:00 100.00000 1.000000 0.000100
1 2015-09-02 09:00:00 102.00000 1.435334 0.000200
2 2015-09-03 09:00:00 103.97625 3.000000 0.962672
3 2015-09-01 09:00:00 105.00000 2.500000 0.000300
If you want every nan in column to be filled with different random value, use:
df[numeric_cols] = df[numeric_cols].apply(lambda x: x.fillna(pd.Series(np.random.uniform(x.min(), x.max(), len(x)))))
I am trying to test different sell Simple Moving Average criteria based on the given buy date.
I have a database of buy data as follows (df_buy). I want to fill out the NaN values. (FCU = First Close Under)
Symbol Time buy_price LOD date_FCU_10dMA price_FCU_10dMA
0 AMD 2019-12-12 09:36:00 39.52 27.43 NaN NaN
1 AMD 2020-01-16 09:33:00 49.21 27.43 NaN NaN
2 BITF 2021-08-03 09:47:00 4.26 2.81 NaN NaN
3 DOCN 2021-06-14 09:32:00 41.76 35.35 NaN NaN
4 NVDA 2020-07-29 09:38:00 416.81 169.32 NaN NaN
5 NVDA 2020-09-25 10:34:00 499.72 169.32 NaN NaN
6 UPST 2021-02-09 09:32:00 76.83 22.61 NaN NaN
7 UPST 2021-03-18 09:32:00 88.56 22.61 NaN NaN
I have another database with stock daily data as follows (df_day)
Symbol Time Close LOD 10MA 20MA
2722244 AEHR 2019-11-25 16:00:00 1.90 1.29 2.005 1.8870
2722289 AEHR 2019-11-26 16:00:00 1.92 1.29 2.032 1.8925
2722383 AEHR 2019-12-02 16:00:00 1.88 1.29 2.056 1.8985
2722435 AEHR 2019-12-03 16:00:00 1.88 1.29 2.046 1.8995
2722471 AEHR 2019-12-04 16:00:00 1.89 1.29 2.020 1.9055
2722569 AEHR 2019-12-06 16:00:00 1.93 1.29 1.993 1.9140
Based on the strategy, the first close must be at least 2 days after the buy date.
df_filt2['price_FCU_10dMA'] = (df_buy['buy_price'] > df_day['20MA'])
Error: ValueError: Can only compare identically-labeled Series objects
for i, row in df_buy.iterrows():
# drop unlisted symbols
filt1 = (df_day['Symbol'] != df_buy['Symbol'].loc[i])
df_filt1 = df_day.drop(index=df_day[filt1].index)
# drop trades from before buy date + 2
filt2 = (pd.to_datetime(df_day['Time']) < (pd.to_datetime(df_buy['Time'].loc[i] + pd.to_timedelta(2, unit='d'))))
df_filt2 = df_filt1.drop(index=df_filt1[filt2].index)
# sort values
df_filt2.sort_values(by=['Symbol', 'Time'], inplace=True)
# drop rows where Close is above the 10MA
df_filt2['price_FCU_10dMA'] = (df_buy['buy_price'] > df_day['20MA'])
filt3 = (df_filt2['price_FCU_10dMA'] == False)
# trail_sell = df_filt2[filt3].loc[10]
df_buy['price_FCU_10dMA'].loc[i] = df_filt2.loc[filt3, 'Close'].iloc[0] # returns the single value of first close under 10ma
df_buy['date_FCU_10dMA'].loc[i] = df_filt2.loc[filt3, 'Time'].iloc[0] # returns the single value of first close under 10ma
You can combine all 3 conditions in a logical statement:
two_days = pd.to_timedelta('2d')
no_match = pd.Series({'Time': np.nan, 'Close': np.nan})
for i, row in df_buy.iterrows():
cond = (
(df_day['Symbol'] == row['Symbol']) &
(df_day['Time'] > row['Time'] + two_days) &
(df_day['Close'] < df_day['10MA'])
)
match = no_match if not cond.any() else df_day[cond].head(1)
df_buy.loc[i, ['price_FCU_10dMA', 'date_FCU_10dMA']] = match[['Close', 'Time']]
I have been trying to access elements from the following pivot table using the pandas dataframe slicing .IX notation. however I am getting errors:
No Key.
pivot = c.pivot("date","stock_name","close").resample("A",how="ohlc")
pt = pd.DataFrame(pivot,index=pivot.index.year)
pt
What is the correct way to slice out only one or more rows and or columns from a pandas pivot table?
For example if I just want the prices for the year 2016 for Billabong?
pivot["2016-12-31"]["BBG"]
You can use loc, docs:
print c
date stock_name close
0 2012-08-31 ibm 1
1 2013-08-31 aapl 1
2 2014-08-31 goog 1
3 2015-08-31 bhp 1
4 2016-08-31 bhp 1
pivot = c.pivot("date","stock_name","close").resample("A",how="ohlc")
print pivot
aapl bhp goog ibm \
open high low close open high low close open high low close open
date
2012-12-31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1
2013-12-31 1 1 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-12-31 NaN NaN NaN NaN NaN NaN NaN NaN 1 1 1 1 NaN
2015-12-31 NaN NaN NaN NaN 1 1 1 1 NaN NaN NaN NaN NaN
2016-12-31 NaN NaN NaN NaN 1 1 1 1 NaN NaN NaN NaN NaN
high low close
date
2012-12-31 1 1 1
2013-12-31 NaN NaN NaN
2014-12-31 NaN NaN NaN
2015-12-31 NaN NaN NaN
2016-12-31 NaN NaN NaN
print pivot.loc["2014", ('goog', slice(None))]
goog
open high low close
date
2014-12-31 1 1 1 1
In my example, I create a data frame of late shipments and group by freight_cost_group and get the value_counts(). My goal was to calculate p-value and test h0 and ha outcomes. I used a pivot table and loc to access the result set.
data="""id country managed_by fulfill_via vendor_inco_term weight_kilograms freight_cost_usd freight_cost_groups line_item_insurance_usd freight_cost_group late
36203.0 Nigeria PMO-US Direct_Drop EXW 1426.0 33279.83 expensive 373.83 expensive Yes
30998.0 Botswana PMO-US Direct_Drop EXW 10.0 559.89 reasonable 1.72 reasonable No
69871.0 Vietnam PMO-US Direct_Drop EXW 3723.0 19056.13 expensive 181.57 expensive No
17648.0 South_Africa PMO-US Direct_Drop DDP 7698.0 11372.23 expensive 779.41 expensive No
5647.0 Uganda PMO-US Direct_Drop EXW 56.0 360.00 reasonable 0.01 reasonable No
13608.0 Uganda PMO-US Direct_Drop DDP 43.0 199.00 reasonable 12.72 reasonable No
80394.0 Congo_DRC PMO-US Direct_Drop EXW 99.0 2162.55 reasonable 13.10 reasonable No
61675.0 Zambia PMO-US Direct_Drop EXW 881.0 14019.38 expensive 210.49 expensive Yes
39182.0 South_Africa PMO-US Direct_Drop DDP 16234.0 14439.17 expensive 1421.41 expensive No
5645.0 Botswana PMO-US Direct_Drop EXW 46.0 1028.18 reasonable 23.04 reasonable No
"""
late_shipments = pd.read_csv(io.StringIO(data), sep='\s+', header=0,index_col=["id"])
#print(late_shipments.head)
#late_by_freight_cost_group = late_shipments.groupby("freight_cost_group")["late"].value_counts()
#results=(late_by_freight_cost_group.unstack(fill_value=0))
#print(results)
results=late_shipments.pivot_table(index=['freight_cost_group'], columns='late', aggfunc='size', fill_value=0)
success_expensive=results.loc["expensive"]["Yes"]
fail_expensive=results.loc["expensive"]["No"]
success_reasonable=results.loc["reasonable"]["Yes"]
fail_reasonable=results.loc["reasonable"]["No"]
success_counts = np.array([success_expensive, success_reasonable])
n = np.array([success_expensive + fail_expensive, success_reasonable + fail_reasonable])
from statsmodels.stats.proportion import proportions_ztest
stat, p_value = proportions_ztest(count=success_counts, nobs=n,
alternative="larger")
print(stat, p_value)