pandas questions about argmin and timestamp - pandas

final_month = pd.Timestamp('2018-02-01')
df_final_month = df[df['week'] >= final_month]
df_final_month.iloc[:, 1:].sum().argmax()
index = df.set_index('week')
index['storeC'].argmin()
the code above is correct, i just don't exactly understand how does it work inside. i have some questions:
1.the type(week) is datetime, the reason why set final_month as Timestamp is that the datetime is almost as same as Timestamp, they recognise each other in Python?
2.about the argmax(), and argmin(), for the df_final_month.iloc[:, 1:].sum().argmax(), i removed sum() and tried like df_final_month.iloc[:, 1:].argmax(), it returns
`AttributeError: 'DataFrame' object has no attribute 'argmax'`
why is it? why the second code doesn't need a max() or something to call argmin(), what's the requirement for using argmin()/argmax() ?
please explaining the details of how python or pandas deal with these data, the more detail the better.
thanks!
i am new in Python.

Is Timestamp almost as same as datetime?
Here is quote from pandas documentation itself:
TimeStamp is the pandas equivalent of python’s Datetime and is interchangable with it in most cases
In fact, if you look at source code of pandas you will see that Timestamp actually inherits from datetime. Here is code to check these statements are true:
dt = datetime.datetime(2018, 1, 1)
ts = pd.Timestamp('2018-01-01')
dt == ts # True
isinstance(ts, datetime.datetime) # True
Why calling argmax method on DataFrame, without calling sum throws an error?
Because DataFrame object doesn't have argmax method, only Series do. And sum, in your case, returns a Series instance.

Related

Convert dataframe column from Object to numeric

Hello I have a conversion question. I'm using some code to conditionally add a value to a new column in my dataframe (df). the new column ('new_col') is created in type object. How do I convert 'new_col' in dataframe to float for aggregation in code to follow. I'm new to python, tried several function and methods. Any help would be greatly appreciated.
conds = [(df['sc1']=='UP_MJB'),(df['sc1']=='UP_MSCI')]
actions = [df['st1'],df['st2']]
df['new_col'] = np.select(conds,actions,default=df['sc1'])
tried astype(float), got value Error. Talked to teammate, tried df.to_numeric(np.select(conds,actions,default=df['sc1'])). That worked.

Is there a built-in pandas method to compare a Series of Timestamps to an Interval

I have a series of Timestamps, and I want to test to see if they fall within a global Interval. I feel like there should be a natural pandas API to achieve what I have done here. The only way I could find is to use the between function:
df['end_date'].between(MY_INTERVAL.left, MY_INTERVAL.right)
but that misses out the subtleties around the closed_left or closed_right properties of the Interval. Is there something better?
Things I have tried:
df['end_date'].isin(MY_INTERVAL) # raises as it expects a collection as argument
pd.IntervalArray(df['end_date'], df['end_date']).overlaps(MY_INTERVAL)
# works feels backward to create an array just for this!
No sure it fits exactly your needs, but I would do:
s = df['end_date']
s.where(s >= '2020-01-09').where('2020-02-15' > s)
or
df.query("'2020-01-02' <= end_date < '2020-01-17'")
df.query("#lower_bound <= end_date < #upper_bound")
to directly select the data in my interval.
It looks like there is no specific pandas api for this, but one way that respects the closedness of the interval:
df['end_date'].map(MY_INTERVAL.__contains__)
Thanks to nmay123 here: https://github.com/pandas-dev/pandas/issues/40243

Koalas GroupBy > Apply > Lambda > Series

I am trying to port some code from Pandas to Koalas to take advantage of Spark's distributed processing. I am taking a dataframe and grouping it on A and B and then applying a series of functions to populate the columns of the new dataframe. Here is the code that I was using in Pandas:
new = old.groupby(['A', 'B']) \
.apply(lambda x: pd.Series({
'v1': x['v1'].sum(),
'v2': x['v2'].sum(),
'v3': (x['v1'].sum() / x['v2'].sum()),
'v4': x['v4'].min()
})
)
I believe that it is working well and the resulting dataframe appears to be correct value-wise.
I just have a few questions:
Does this error mean that my method will be deprecated in the future?
/databricks/spark/python/pyspark/sql/pandas/group_ops.py:76: UserWarning: It is preferred to use 'applyInPandas' over this API. This API will be deprecated in the future releases. See SPARK-28264 for more details.
How can I rename the group-by columns to 'A' and 'B' instead of "__groupkey_0__ __groupkey_1__"?
As you noticed, I had to call pd.Series -- is there a way to do this in Koalas? Calling ks.Series gives me the following error that I am unsure how to implement:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Thanks for any help that you can provide!
I'm not sure about the error. I am using koalas==1.2.0 and pandas==1.0.5 and I don't get the error so I wouldn't worry about it
The groupby columns are already called A and B when I run the code. This again may have been a bug which has since been patched.
For this you have 3 options:
Keep utilising pd.Series. As long as your original Dataframe is a koalas Dataframe, your output will also be a koalas Dataframe (with the pd.Series automatically converted to ks.Series)
Keep the function and the data exactly the same and just convert the final dataframe to koalas using the from_pandas function
Do the whole thing in koalas. This is slightly more tricky because you are computing an aggregate column based on two GroupBy columns and koalas doesn't support lambda functions as a valid aggregation. One way we can get around this is by computing the other aggregations together and adding the multi-column aggregation afterwards:
import databricks.koalas as ks
ks.set_option('compute.ops_on_diff_frames', True)
# Dummy data
old = ks.DataFrame({"A":[1,2,3,1,2,3], "B":[1,2,3,3,2,3], "v1":[10,20,30,40,50,60], "v2":[4,5,6,7,8,9], "v4":[0,0,1,1,2,2]})
new = old.groupby(['A', 'B']).agg({'v1':'sum', 'v2':'sum', 'v4': 'min'})
new['v3'] = old.groupby(['A', 'B']).apply(lambda x: x['v1'].sum() / x['v2'].sum())

Parse Datetime in Pandas Dataframe

I have the checkout column in Dataframe of type 'object' in '2017-08-04T23:31:19.000+02:00' format.
But i want it in the format as shown in the image.
Can anyone help me please.
Thank you:)
You should be able to convert the object column to a date time column, then use the built in date and time functions.
# create an intermediate column that we won't store on the DataFrame
checkout_as_datetime = pd.to_datetime(df['checkout'])
# Add the desired columns to the dataframe
df['checkout_date'] = checkout_as_datetime.dt.date
df['checkout_time'] = checkout_as_datetime.dt.time
Though, if you're goal isn't to write these specific new columns out somewhere, but to use them for other calculations, it may be simpler to just overwrite your original column and use the datetime methods from there.
df['checkout'] = pd.to_datetime(df['checkout'])
df['checkout'].dt.date # to access the date
I haven't tested this, but something along the lines of:
df['CheckOut_date'] = pd.to_datetime(df["CheckOut_date"].dt.strftime('%Y-%m-%d'))
df['CheckOut_time'] = pd.to_datetime(df["CheckOut_time"].dt.strftime('%H:%m:%s'))

TypeError: <class 'datetime.time'> is not convertible to datetime

The problem is somewhat simple. My objective is to compute the days difference between two dates, say A and B.
These are my attempts:
df['daydiff'] = df['A']-df['B']
df['daydiff'] = ((df['A']) - (df['B'])).dt.days
df['daydiff'] = (pd.to_datetime(df['A'])-pd.to_datetime(df['B'])).dt.days
These works for me before but for some reason, I'm keep getting this error this time:
TypeError: class 'datetime.time' is not convertible to datetime
When I export the df to excel, then the date works just fine. Any thoughts?
Use pd.Timestamp to handle the awkward differences in your formatted times.
df['A'] = df['A'].apply(pd.Timestamp) # will handle parsing
df['B'] = df['B'].apply(pd.Timestamp) # will handle parsing
df['day_diff'] = (df['A'] - df['B']).dt.days
Of course, if you don't want to change the format of the df['A'] and df['B'] within the DataFrame that you are outputting, you can do this in a one-liner.
df['day_diff'] = (df['A'].apply(pd.Timestamp) - df['B'].apply(pd.Timestamp)).dt.days
This will give you the days between as an integer.
When I applied the solution offered by emmet02, I got TypeError: Cannot convert input [00:00:00] of type as well. It's basically saying that the dataframe contains missing timestamp values which are represented as [00:00:00], and this value is rejected by pandas.Timestamp function.
To address this, simply apply a suitable missing-value strategy to clean your data set, before using
df.apply(pd.Timestamp)