Using pandas query method to search for a datetime object

Using pandas query method to search for a datetime object - pandas

I'm trying to match a datetime object in a Pandas DataFrame with the query method. Given this code
import datetime
import pandas as pd
search_time = datetime.datetime(2019, 10, 27, 0, 0, 6)
df = pd.DataFrame([[0, 0, datetime.datetime(2019, 10, 27, 0, 0, 0)],
[1, 0, search_time]],
columns=(['0', '1', 'datetime']))
df1 = df[df.datetime == search_time]
print(df1)
df2 = df.query('datetime == #search_time')
I want df1 and df2 to equal. While df1 returns what I expect,
0 1 datetime
1 1 0 2019-10-27 00:00:06
df2 raises KeyError: False. How can I correct the query syntax?

Problem is column name datetime collide with datetime object, solution is rename it, e.g. datetime1:
import datetime
import pandas as pd
search_time = datetime.datetime(2019, 10, 27, 0, 0, 6)
df = pd.DataFrame([[0, 0, datetime.datetime(2019, 10, 27, 0, 0, 0)],
[1, 0, search_time]],
columns=(['0', '1', 'datetime1']))
df1 = df[df.datetime1 == search_time]
print(df1)
0 1 datetime1
1 1 0 2019-10-27 00:00:06
df2 = df.query('datetime1 == #search_time')
print (df2)
0 1 datetime1
1 1 0 2019-10-27 00:00:06
Also is possible rename it by pandas:
df = pd.DataFrame([[0, 0, datetime.datetime(2019, 10, 27, 0, 0, 0)],
[1, 0, search_time]],
columns=(['0', '1', 'datetime']))
df2 = df.rename(columns={'datetime':'datetime1'}).query('datetime1 == #search_time')
print (df2)
0 1 datetime1
1 1 0 2019-10-27 00:00:06

Related

Pandas : How to Apply a Condition on Every Values of a Dataframe, Based on a Second Symmetrical Dataframe

I have a dictionary with 2 DF : "quantity variation in %" and "prices". They are both symmetrical DF.
Let's say I want to set the price = 0 if the quantity variation in percentage is greater than 100 %
import numpy as np; import pandas as pd
d = {'qty_pct': pd.DataFrame({ '2020': [200, 0.5, 0.4],
'2021': [0.9, 0.5, 500],
'2022': [0.9, 300, 0.4]}),
'price': pd.DataFrame({ '2020': [-6, -2, -9],
'2021': [ 2, 3, 4],
'2022': [ 4, 6, 8]})}
# I had something like that in mind ...
df = d['price'].applymap(lambda x: 0 if x[d['qty_pct']] >=1 else x)
P.S. If by any chance there is a way to do this on asymmetrical DF, I would be curious to see how it's done.
Thanks,
I want to obtain this DF :
price = pd.DataFrame({'2020': [ 0, -2, -9],
'2021': [ 2, 3, 0],
'2022': [ 4, 0, 8]})

Assume price and qty_pct always have the same dimension, then you can just do:
d['price'][d['qty_pct'] >= 1] = 0
d['price']
2020 2021 2022
0 0 2 4
1 -2 3 0
2 -9 0 8

how to change value based on criteria pandas

I have a following problem. I have this df:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
I would like to have a new column where value will be 1 if in any other cases it is also 1. See desired output:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1], 'newvalue': [1, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
How can I do it please?

If need set 0,1 by condition - here at least one 1 use GroupBy.transform with GroupBy.any for mask and casting to integers for True, False to 1,0 map:
df['newvalue'] = df['value'].eq(1).groupby(df['id']).transform('any').astype(int)
Alternative:
df['newvalue'] = df['id'].isin(df.loc[df['value'].eq(1), 'id']).astype(int)
Or if only 0,1 values is possible simplify solution for new column by maximal values per groups:
df['newvalue'] = df.groupby('id')['value'].transform('max')
print (df)
id value newvalue
0 1 0 1
1 1 1 1
2 2 0 0
3 2 0 0
4 3 1 1

Use pandas cut function in Dask

How can I use pd.cut() in Dask?
Because of the large dataset, I am not able to put the whole dataset into memory before finishing the pd.cut().
Current code that is working in Pandas but needs to be changed to Dask:
import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Output:
name sum count
0 namebin1 5 3
1 namebin2 9 2
2 namebin3 8 1
I tried:
import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut,
df['name'],
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))
Gives error:
TypeError("cut() got multiple values for argument 'bins'",)

The reason why you're seeing this error is that pd.cut() is being called with the partition as the first argument which it doesn't expect (see the docs).
You can wrap it in a custom function and call that instead, like so:
import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,
bins=[0,4,8,100],
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name sum count
namebin1 5 3
namebin2 9 2
namebin3 8 1

Generating boolean dataframe based on contents in series and dataframe

I have:
df = pd.DataFrame(
[
[22, 33, 44],
[55, 11, 22],
[33, 55, 11],
],
index=["abc", "def", "ghi"],
columns=list("abc")
) # size(3,3)
and:
unique = pd.Series([11, 22, 33, 44, 55]) # size(1,5)
then I create a new df based on unique and df, so that:
df_new = pd.DataFrame(index=unique, columns=df.columns) # size(5,3)
From this newly created df, I'd like to create a new boolean df based on unique and df, so that the end result is:
df_new = pd.DataFrame(
[
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
],
index=unique,
columns=df.columns
)
This new df is either true or false depending on whether the value is present in the original dataframe or not. For example, the first column has three values: [22, 55, 33]. In a df with dimensions (5,3), this first column would be: [0, 1, 1, 0, 1] i.e. [0, 22, 33, 0 , 55]
I tried filter2 = unique.isin(df) but this doesn't work, also notnull. I tried applying a filter but the dimensions returned were incorrect. How can I do this?

Use DataFrame.stack with DataFrame.reset_index, DataFrame.pivot, then check if not missing values by DataFrame.notna, cast to integers for True->1 and False->0 mapping and last remove index and columns names by DataFrame.rename_axis:
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
Helper Series is not necessary, but if there is more values or is necessary change order by helper Series use add DataFrame.reindex:
#added 66
unique = pd.Series([11, 22, 33, 44, 55,66])
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.reindex(unique)
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
66 0 0 0

Assign conditional values to columns in Dask

I am trying to do a conditional assignation to the rows of a specific column: target. I have done some research, and it seemed that the answer was given here: "How to do row processing and item assignment in dask".
I will reproduce my necessity. Mock data set:
x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))
The look of mock is:
In [4]: mock.head(7)
Out [4]:
speed target
0 200 3
1 300 0
2 400 3
3 215 4
4 219 0
5 360 0
6 280 0
Having this Pandas DataFrame, I convert it into a Dask DataFrame:
mock_dask = dd.from_pandas(mock, npartitions = 2)
I apply my conditional rule: all values in target above 0, must be 1, all others 0 (binaryze target). Following the mentioned thread above, it should be:
result = mock_dask.target.where(mock_dask.target > 0, 1)
I have a look at the result dataset and it is not working as expected:
In [7]: result.head(7)
Out [7]:
0 3
1 1
2 3
3 4
4 1
5 1
6 1
Name: target, dtype: object
As we can see, the column target in mock and result are not the expected results. It seems that my code is converting all 0 original values to 1, instead of the values that are greater than 0 into 1 (the conditional rule).
Dask newbie here, Thanks in advance for your help.

OK, the documentation in Dask DataFrame API is pretty clear. Thanks to #MRocklin feedback I have realized my mistake. In the documentation, where function (the last one in the list) is used with the following syntax:
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Thus, the correct code line would be:
result = mock_dask.target.where(mock_dask.target <= 0, 1)
This will output:
In [7]: result.head(7)
Out [7]:
0 1
1 0
2 1
3 1
4 0
5 0
6 0
Name: target, dtype: int64
Which is the expected output.

They seem to be the same to me
In [1]: import pandas as pd
In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
...: mock = pd.DataFrame(dict(target = x, speed = y))
...:
In [3]: import dask.dataframe as dd
In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)
In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64
In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using pandas query method to search for a datetime object - pandas

Related

Pandas : How to Apply a Condition on Every Values of a Dataframe, Based on a Second Symmetrical Dataframe

how to change value based on criteria pandas

Use pandas cut function in Dask

Generating boolean dataframe based on contents in series and dataframe

Assign conditional values to columns in Dask

Categories

Resources