How to convert a dataframe of Timestamps to a numpy list of Timestamps - pandas

I have a dataframe like the following:
df = pd.DataFrame(
{
"timestamp1": [
pd.Timestamp("2021-01-01"),
pd.Timestamp("2021-03-01"),
],
"timestamp2": [
pd.Timestamp("2022-01-01"),
pd.Timestamp("2022-03-01"),
],
})
I want to convert this to a list of numpy arrays so I get something like the following:
array([[Timestamp('2021-01-01 00:00:00'),
Timestamp('2022-01-01 00:00:00')],
[Timestamp('2021-01-01 00:00:00'),
Timestamp('2022-03-01 00:00:00')]], dtype=object)
I have tried df.to_numpy() but this doesn't seem to work as each item is a numpy.datetime64 object.

In [176]: df
Out[176]:
timestamp1 timestamp2
0 2021-01-01 2022-01-01
1 2021-03-01 2022-03-01
I don't know much about pd.Timestamp, but it looks like the values are actually stored as you got from to_numpy(), as numpy.datetime64[ns]:
In [179]: df.dtypes
Out[179]:
timestamp1 datetime64[ns]
timestamp2 datetime64[ns]
dtype: object
An individual column, a Series, has a tolist() method
In [190]: df['timestamp1'].tolist()
Out[190]: [Timestamp('2021-01-01 00:00:00'), Timestamp('2021-03-01 00:00:00')]
That's why `#jezrael's answer works
In [191]: arr = np.array([list(df[x]) for x in df.columns])
In [192]: arr
Out[192]:
array([[Timestamp('2021-01-01 00:00:00'),
Timestamp('2021-03-01 00:00:00')],
[Timestamp('2022-01-01 00:00:00'),
Timestamp('2022-03-01 00:00:00')]], dtype=object)
Once you have an array, you can easily transpose it:
In [193]: arr.T
Out[193]:
array([[Timestamp('2021-01-01 00:00:00'),
Timestamp('2022-01-01 00:00:00')],
[Timestamp('2021-03-01 00:00:00'),
Timestamp('2022-03-01 00:00:00')]], dtype=object)
An individual Timestamp object can be converted/displayed in various ways:
In [196]: x=arr[0,0]
In [197]: type(x)
Out[197]: pandas._libs.tslibs.timestamps.Timestamp
In [198]: x.to_datetime64()
Out[198]: numpy.datetime64('2021-01-01T00:00:00.000000000')
In [199]: x.to_numpy()
Out[199]: numpy.datetime64('2021-01-01T00:00:00.000000000')
In [200]: x.to_pydatetime()
Out[200]: datetime.datetime(2021, 1, 1, 0, 0)
In [201]: print(x)
2021-01-01 00:00:00
In [202]: repr(x)
Out[202]: "Timestamp('2021-01-01 00:00:00')"

Use list comprehension for convert values to lists and then to numpy arrays:
print (np.array([list(df[x]) for x in df.columns]))
[[Timestamp('2021-01-01 00:00:00') Timestamp('2021-03-01 00:00:00')]
[Timestamp('2022-01-01 00:00:00') Timestamp('2022-03-01 00:00:00')]]

Related

Select rows based on multiple columns that are later than a certain date

I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(0)
# create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2021-07-29', periods=5, freq='D')
rng_1 = pd.date_range('2021-07-30', periods=5, freq='D')
rng_2 = pd.date_range('2021-07-31', periods=5, freq='D')
df_status = ['received', 'send', 'received', 'send', 'send']
df = pd.DataFrame({ 'Date': rng, 'Date_1': rng_1, 'Date_2': rng_2, 'status': df_status })
print(df)
I would like to print out all the rows if at least one column contains a date that is equal to, or at least 2021-08-01. What would be the most effective way to do this?
I have tried to do this with the following code, however, I get the following error:
start_date = '2022-08-01'
start_date = pd.to_datetime(start_date, format="%Y/%m/%d")
mask = (df['Date'] >= start_date | df['Date_1'] >= start_date | df['Date_3'] >= start_date)
TypeError: unsupported operand type(s) for &: 'Timestamp' and 'DatetimeArray'
Thank you in advance.
Adjusted dataframe:
df = {'sample_received': {1: nan,
2: nan,
3: '2022-08-01 20:31:24',
4: '2022-08-01 20:25:45',
5: '2022-08-01 20:41:22'},
'result_received': {1: '2022-08-01 16:25:33',
2: '2022-08-01 13:25:36',
3: '2022-08-02 09:45:34',
4: '2022-08-02 09:52:59',
5: '2022-08-02 08:22:45'},
'status': {1: 'Approved',
2: 'Approved',
3: 'Approved',
4: 'Approved',
5: 'Approved'}}
Use boolean indexing with any:
df[df.ge('2021-08-01').any(1)]
output:
Date Date_1 Date_2
1 2021-07-30 2021-07-31 2021-08-01
2 2021-07-31 2021-08-01 2021-08-02
3 2021-08-01 2021-08-02 2021-08-03
4 2021-08-02 2021-08-03 2021-08-04
intermediate:
df.ge('2021-08-01').any(1)
0 False
1 True
2 True
3 True
4 True
dtype: bool
using only the date columns
filtering by name (Date in the column name):
df[df.filter(like='Date').ge('2021-08-01').any(1)]
filtering by type:
df[df.select_dtypes('datetime64[ns]').ge('2021-08-01').any(1)]
You may use any inside apply:
df[df.apply(lambda x: any([x[col] >= pd.to_datetime('2021-08-01') for col in df.columns]), axis=1)]

pandas: preserve dtypes saved to feather format

My understanding is that the feather format's advantage is that it preserves types. So I expected that the object dtype of variable state would be preserved, but it's not. Why? Is there a way around this?
import sys
import pandas
from pandas import Timestamp
print(pandas.__version__)
## 1.3.4
print(sys.version)
## 3.9.7 (default, Sep 16 2021, 08:50:36)
## [Clang 10.0.0 ]
d = pandas.DataFrame({'Date': {0: Timestamp('2020-12-01 00:00:00'), 1: Timestamp('2020-11-01 00:00:00'), 2: Timestamp('2020-10-01 00:00:00'), 3: Timestamp('2020-09-01 00:00:00'), 4: Timestamp('2020-08-01 00:00:00')}, 'state': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 'value': {0: 3.1, 1: 3.4, 2: 3.9, 3: 5.9, 4: 6.4}})
d.dtypes
# Date datetime64[ns]
# state int64
# value float64
# dtype: object
d["state"] = d["state"].astype(object)
d.dtypes
# Date datetime64[ns]
# state object
# value float64
# dtype: object
d.to_feather("test.feather")
d = pandas.read_feather("test.feather")
d.dtypes
# Date datetime64[ns]
# state int64
# value float64
# dtype: object
I want state to be a "string" or "object", but not an "int64". I don't want to have to recast every time I load the dataframe. Thanks!
A while back Quang Hoang suggested in the comments that the following works:
d["state"] = d["state"].astype(str)
I have no explanation to offer. I'll be happy to select any other, better answer.

How to assigne a new column after groupby in pandas

I want to groupby my data and create a new column assignment.
Given the following data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6]})
df['col3']=df[['col1','col2']].groupby('col1').rolling(2).mean().reset_index()
Expected output = pd.DataFrame({'col1': ['x1', 'x1', 'x1', 'x2', 'x2', 'x2'], 'col2': [1, 2, 3, 4, 5, 6], 'col3': [NAN, 1.5, 2.5, NAN, 4.5, 5.5]})
However, this does not work. Is there an straightforward way to do it?
A combination of groupby, apply and assign:
df.groupby('col1', as_index = False).apply(lambda g: g.assign(col3 = g['col2'].rolling(2).mean())).reset_index(drop = True)
output:
col1 col2 col3
0 x1 1 NaN
1 x1 2 1.5
2 x1 3 2.5
3 x2 4 NaN
4 x2 5 4.5
5 x2 6 5.5

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]

Pandas apply to index

I have a df that has some columns and a multi-index with bytes datatype. to clean the columns I can do
for c in df.columns:
df[c] = df[c].apply(lambda x: x.decode('UTF-8'))
and for a single index this should work
df.index.map(lambda x: x.decode('UTF-8'))
But it appears to fail with multi-index. Is there anything similar I can do for the multi-index?
EDIT:
example
pd.DataFrame().from_dict({'val': {(b'A', b'a'): 1,
(b'A', b'b'): 2,
(b'B', b'a'): 3,
(b'B', b'b'): 4,
(b'B', b'c'): 5}})
and the desired output it
pd.DataFrame().from_dict({'val': {('A', 'a'): 1,
('A', 'b'): 2,
('B', 'a'): 3,
('B', 'b'): 4,
('B', 'c'): 5}})
Method 1:
df.index = pd.MultiIndex.from_tuples([(x[0].decode('utf-8'), x[1].decode('utf-8')) for x in df.index])
%timeit result: 1000 loops, best of 3: 573 µs per loop
Method 2:
df.reset_index().set_index('val').applymap(lambda x: x.decode('utf-8')).reset_index().set_index(['level_0', 'level_1'])
%timeit result: 100 loops, best of 3: 4.17 ms per loop
df.index.levels = ([ names.map(lambda x: x.decode('UTF-8')) for i, names in enumerate(df.index.levels)])
OUTPUT:
val
A a 1
b 2
B a 3
b 4
c 5