Key error, parse error using *.agg method - pandas

Suppose I have a dataframe of some stock price where the column 'Open' is the 0th column and 'Close' is the 3rd column. Suppose further that I want to find the maximum difference between Close and Open price. That can be done easily without using the agg method, but let me show what the error is when I use them.
def daily_value(df):
df.iloc[:, 0] = df.iloc[:,3] - df.iloc[:, 0]
return df.max()
def daily_value(df):
df['Open'] = df['Close'] - df['Open']
return df.max()
Both work as to replace the 0th column, namely 'Open', and return the maximum difference between Open and Close.
This works fine when I have df1 and I type daily_value(df1).
However, when I try df1.agg(daily_value), both version fail. The first says IndexingError: Too many indexers while the latter say KeyError: Close.
How do I proceed if I indeed need to pass the function into *.agg method?
Thanks very much!

You need to provide axis. Aggregration function call needs axis. You should call your function with
df1.agg(daily_value, axis=1)

You don't have to user agg method with function what I am understating that you are trying to get the maximum different between 2 columns after creating the new column simply you can use df.agg({'Column name' : 'max'}
see the below example :

1- assign df to return instead of df.max()
2- pass max as agg keyword and call the function in a print to check the result as the below
print(daily_value(df1.agg('max')))

Related

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

DataFrame difference between where and query

I was able to solve a problem with pandas thanks to the answer provided in Grouping by with Where conditions in Pandas.
I was first trying to make use of the .where() function like the following:
df['X'] = df['Col1'].where(['Col1'] == 'Y').groupby('Z')['S'].transform('max').astype(int)
but got this error: ValueError: Array conditional must be same shape as self
By writing it like
df['X'] = df.query('Col1 == "Y"').groupby('Z')['S'].transform('max').astype(int)
it worked.
I'm trying to understand what is the difference as I thought .where() would do the trick.
You have a typo in your first statement. .where(['Col1'] == 'Y') is comparing a single element list with 'Y'. I think you meant to use .where(df['Col1'] == 'Y', however, this will not work either because you filtering dataframe columns to just 'Col1' in front of the where method. This is what you really wanted to do, in my opinion.
df['X'] = df.where(df['Col1'] == 'Y').groupby('Z')['S'].transform('max')
Which is equalivant to using
df['X'] = df.query('Col1 == "Y"').groupby('Z')['S'].transform('max').astype(int)
Also, not the astype(int) is not going to do any good on either of these statements because one side effect in pandas is that any column with a 'int' dtype that contains a NaN will automatically change that column to a 'float'.

Count will not work for unique elements in my dataframe, only when repeated

I want to count the number of occurences in a dataframe, and I need to do it using the following function:
for x in homicides_prec.reset_index().DATE.drop_duplicates():
count= homicides_prec.loc[x]['VICTIM_AGE'].count()
print(count)
However, this only works for when the specific Date is repeated. It does not work when dates only appear once, and I don't understand why. I get this error:
TypeError: count() takes at least 1 argument (0 given)
That said, it really doesn't make sense to me, because I get that error for this specific value (which only appears once on the dataframe):
for x in homicides_prec.reset_index().DATE[49:50].drop_duplicates():
count= homicides_prec.loc[x]['VICTIM_AGE'].count()
print(count)
However, I don't get the error if I run this:
homicides_prec.loc[homicides_prec.reset_index().DATE[49:50].drop_duplicates()]['VICTIM_AGE'].count()
Why does that happen??? I can't use the second option because I need to use the for loop.
More info, in case it helps: The problem seems to be that, when I run this (without counting), the output is just a number:
for x in homicides_prec.reset_index().DATE[49:50].drop_duplicates(): count= homicides_prec.loc[x]['VICTIM_AGE']
print(count)
Output: 33
So, when I add the .count it will not accept that input. How can I fix this?
There are a few issues with the code you shared, but the shortest answer is that when x appears only once you are not doing a slice, rather you are accessing some value.
if x == '2019-01-01' and that value appears twice then
homicides_prec.loc[x]
will be a pd.DataFrame with two rows, and
homicides_prec.loc[x]['VICTIM_AGE']
will be a pd.Series object with two rows, and it will happily take a .count() method.
However, if x == '2019-01-02' and that date is unique, then
homicides_prec.loc[x]
will be a pd.Series representing row where the index is x
From that we see that
homicides_prec.loc[x]['VICTIM_AGE']
is a single value, so .count() does not make sense.

how to print by default the first n row for a pandas dataframe?

when print a pandas dataframe, how to print the first n row by default?
I find myself frequently doing df.head(10) to view the column names and first a couple of rows.
I prefer when I type 'df', it prints the first n row by default, instead of printing the whole df, which in this case I cannot see the column names.
If I understand you correctly, you may set
pd.options.display.max_rows = 10
and whenever you do just df in your notebook, only 10 rows would be displayed.
You can always set back to the default value doing
pd.reset_option('display.max_rows')
Check pd.describe_option('display') for more information
Curry DataFrame.head using functools.partial.
from functools import partial
head10 = partial(pd.DataFrame.head, n=10)
Now you can either call the function passing your DataFrame as an argument,
head10(df)
Or, pass the function to df.pipe (which internally passes df as an argument to your function),
df.pipe(head10)
To get the first 10 rows by default.
The other option is to create a new class that extends DataFrame and add your own function (e.g., headXX) which internally calls df.head(n=10) and returns the result.
See the subclassing DataFrame section in the docs.

How can I use `apply` with a function that takes multiple inputs

I have a function that has multiple inputs, and would like to use SFrame.apply to create a new column. I can't find a way to pass two arguments into SFrame.apply.
Ideally, it would take the entry in the column as the first argument, and I would pass in a second argument. Intuitively something like...
def f(arg_1,arg_2):
return arg_1 + arg_2
sf['new_col'] = sf.apply(f,arg_2)
suppose the first argument of function f is one of the column.
Say argcolumn1 in sf, then
sf['new_col'] = sf['argcolumn1'].apply(lambda x:f(x,arg_2))
should work
Try this.
sf['new_col'] = sf.apply(lambda x : f(arg_1, arg_2))
The way i understand your question (and because none of the previous answers are marked as accepted), it seems to me that you are trying to apply a transformation using two different columns of a single SFrame, so:
As specified in the online documentation, the function you pass to the SFrame.apply method will be called for every row in the SFrame.
So you should rewrite your function to receive a single argument representing the current row, as follow:
def f(row):
return row['column_1'] + row['column_2']
sf['new_col'] = sf.apply(f)