Replacing .loc method of pandas - pandas

When I try this statement, I get an error...
spy_daily.loc[fomc_events.index, "FOMC"] = spy_daily.loc[fomc_events.index, "days_since_fomc"]
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: DatetimeIndex(['2020-03-15', '2008-03-16'], dtype='datetime64[ns]', name='Date', freq=None).
Not sure how to correct it. The complete code is available here...
https://www.wrighters.io/analyzing-stock-data-events-with-pandas/

Try to convert your index from your other dataframe to a list to subsetting your first dataframe:
rows = fomc_events.index.tolist()
spy_daily.loc[rows, "FOMC"] = spy_daily.loc[rows, "days_since_fomc"]

Related

I can only slice my pandas dataframe with a slicing window such as [0:1] to get a specific row, doing [0 raises a KeyError why?

I have a dataframe df.
type(df) # pandas.core.frame.DataFrame
df[0] # KeyError 0
df[0:1] # Gives row 0 as expected
What is going on? I am sorry I come from an R background and have done some work with Python in the past but thought this was possible. What am I missing?
You're close! If you use .iloc it will return what you are expecting. So for this case you would be
first_row = df.iloc[0]
When you do df[0] it is looking for a column named 0.
reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

DataFrame to DataFrameRow conversion (Julia)

I'm using Pingouin.jl to test normality.
In their docs, we have
dataset = Pingouin.read_dataset("mediation")
Pingouin.normality(dataset, method="jarque_bera")
Which should return a DataFrame with normality true or false for each name in the dataset.
Currently, this broadcasting is deprecated, and I'm unable to concatenate the result in one DataFrame for each unique-column-output (which is working and outputs a DataFrame).
So, what I have so far.
function var_norm(df)
norm = DataFrame([])
for i in 1:1:length(names(df))
push!(norm, Pingouin.normality(df[!,names(df)[i]], method="jarque_bera"))
end
return norm
end
The error I get:
julia> push!(norm, Pingouin.normality(df[!,names(df)[1]], method="jarque_bera"))
ERROR: ArgumentError: `push!` does not allow passing collections of type DataFrame to be pushed into a DataFrame. Only `Tuple`, `AbstractArray`, `AbstractDict`, `DataFrameRow` and `NamedTuple` are allowed.
Stacktrace:
[1] push!(df::DataFrame, row::DataFrame; promote::Bool)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1603
[2] push!(df::DataFrame, row::DataFrame)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1601
[3] top-level scope
# REPL[163]:1
EDIT: push! function was not properly written at my first version of the post. But, the error persists after the change. How can I reformat the output of type DataFrame from Pingouin into DataFrameRow?
As Pengouin.normality returns a DataFrame, you will have to iterate over its results and push one-by-one:
df = Pengouin.normality(…)
for row in eachrow(df)
push!(norms, row)
end
If you are sure Pengouin.normality returns a DataFrame with exactly one row, you can simply write
push!(norms, only(Pengouin.normality(…)))

SettingWithCopyWarning Error in Pandas when using groupby and transform

I received error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
on my code df['mean'] = df.groupby('col1').transform('mean')
I want to add the calculation as a new column.
I understood the error message and learnt that I need to use .loc to solve the issue but I don't know how to include .loc into my current code.
Try following:
df.loc[:, 'mean'] = df.groupby('col1').transform('mean')

KeyError: "None of [Index(['', ''], dtype='object')] are in the [columns]" when trying to select columns on a dask dataframe

I am creating a dask dataframe from a pandas dataframe using the from_pandas() function. When I try to select two columns from the dask dataframe using the square brackets [[ ]], I am getting a KeyError.
According to dask documentation, the dask dataframe supports the square bracket column selection like the pandas dataframe.
# data is a pandas dataframe
dask_df = ddf.from_pandas(data, 30)
data = data[dask_df[['length', 'country']].apply(
lambda x: myfunc(x, countries),
meta=('Boolean'),
axis=1
).compute()].reset_index(drop=True)
This is the error I am getting:
KeyError: "None of [Index(['length', 'country'], dtype='object')] are in the [columns]"
I was thinking that this might be something to do with providing the correct meta for the apply, but from the error it seems like the dask dataframe is not able to select the two columns, which should happen before the apply.
This works perfectly with if I replace "dask_df" with "data"(pandas df) in the apply line.
Is the index not being preserved when I am doing the from_pandas?
Try loading less data at once.
I had the same issue, but when I loaded only a subset of my data, it worked.
With the large dataset, I was able to run print(dask_df.columns) and see e.g.
Index(['apple', 'orange', 'pear'], dtype='object', name='fruit').
But when I ran dask_df.compute I would get KeyError: "None of [Index(['apple', 'orange', 'pear'], dtype='object')] are in the [columns]".
I knew that the data set was too big for my memory, and was trying dask hoping it would just figure it out for me =) I guess I have more work to do, but in any case I am glad to be in dask!
As the error states: columns ['length', 'country']
do not exist in dask_df.
Create them first than run your function.

Workaround for Pandas FutureWarning when sorting a DateTimeIndex

As described here, Pandas.sort_index() sometimes emits a FutureWarning when doing a sort on a DateTimeIndex. That question isn't actionable, since it contains no MCVE. Here's one:
import pandas as pd
idx = pd.DatetimeIndex(['2017-07-05 07:00:00', '2018-07-05 07:15:00','2017-07-05 07:30:00'])
df = pd.DataFrame({'C1':['a','b','c']},index=idx)
df = df.tz_localize('UTC')
df.sort_index()
The warning looks like:
FutureWarning: Converting timezone-aware DatetimeArray to
timezone-naive ndarray with 'datetime64[ns]' dtype
The stack (Pandas 0.24.1) is:
__array__, datetimes.py:358
asanyarray, numeric.py:544
nargsort, sorting.py:257
sort_index, frame.py:4795
The error is emitted from datetimes.py, requesting that it be called with a dtype argument. However, there's no way to force that all the way up through nargsort -- it looks like obeying datetimes.py's request would require changes to both pandas and numpy.
Reported here. In the meantime, can you think of a workaround that I've missed?
Issue confirmed for the 0.24.2 milestone. Workaround is to filter the warning, thus:
with warnings.catch_warnings():
# Pandas 0.24.1 emits useless warning when sorting tz-aware index
warnings.simplefilter("ignore")
ds = df.sort_index()