When there is missing data in a Pandas DataFrame the indexing is not working as I would expect it it.
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), datetime(2014, 1, 1)]})
df > datetime(2012, 1, 1)
works as expected:
a b
0 False False
1 True True
but if there is a missing value
none_df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), None]})
none_df > datetime(2012, 1, 1)
the selection returns all True
a b
0 True True
1 True True
Am I doing something wrong? Is this desired behavior?
Python 3.5 64bit, Pandas 0.18.0, Windows 10
I agree that the behavior is unusual.
This is a work-around solution:
>>> df.apply(lambda col: col > datetime(2012, 1, 1))
a b
0 False False
1 True False
Related
I am trying to read a csv with 1s and 0s and convert them to True and False, because I have a lot of columns I would like to use the true_values and flase_values arguments, but I got
ValueError: Must be all encoded bytes:
from io import StringIO
import numpy as np
import pandas as pd
pd.read_csv(StringIO("""var1, var2
0, 0
0, 1
1, 1
0, 0
0, 1
1, 0"""), true_values=[1],false_values=[0])
I cannot find the problem with the code that I wrote.
You don't need true_values and false_values parameters. Use dtype instead:
>>> pd.read_csv(StringIO("""var1,var2
0,0
0,1
1,1
0,0
0,1
1,0"""), dtype={'var1': bool, 'var2': bool})
var1 var2
0 False False
1 False True
2 True True
3 False False
4 False True
5 True False
If your columns have same prefix, use filter:
df = pd.read_csv(StringIO("""..."""))
cols = df.filter(like='var').columns
df[cols] = df[cols].astype(bool)
If your columns are consecutive, use iloc:
df = pd.read_csv(StringIO("""..."""))
cols = df.iloc[:, 0:2].columns
df[cols] = df[cols].astype(bool)
Auto-detection:
m = df.min().eq(0) & df.max().eq(1)
df.loc[:, m] = df.loc[:, m].astype(bool)
I have a df and I need to select rows based on some conditions in multiple columns.
Here is what I have
import pandas as pd
dat = [('p','q', 5), ('k','j', 2), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3), ('pkjq','q', 2)
df = pd.DataFrame(dat, columns = ['a', 'b', 'c'])
df_dat = df[(df[['a','b']].isin(['k','p','q','j']) & df['c'] > 3)] | df[(~df[['a','b']].isin(['k','p','q','j']) & df['c'] > 2 )]
Expected result = [('p','q', 5), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3)]
Result I am getting is an all false dataframe
When you have the complicate condition I recommend, make the condition outside the slice
cond1 = df[['a','b']].isin(['k','p','q','j']).any(1) & df['c'].gt(3)
cond2 = (~df[['a','b']].isin(['k','p','q','j'])).any(1) & df['c'].gt(2)
out = df.loc[cond1 | cond2]
Out[305]:
a b c
0 p q 5
2 p - 5
3 - p 4
4 q pkjq 3
I have a dataframe with two integer columns that represent the start and end of a string of text. I'd like to group my rows by length of text (end - start), but with a margin of error of +- 5 characters so that something like this would happen:
start end
0 251
1 250
2 250
0 500
1 500
0 499
How would I achieve something like this?
Here is the code I am using right now
d = {'text': ["aaa", "bbb", "ccc", "ddd", "eee", "fff"],
'start': [0, 1, 0, 2, 1, 0],
'end': [250, 500, 501, 251, 249, 499]}
df = pd.DataFrame(data=d)
df = df.groupby(['start', 'end'])
I ended up solving the problem by rounding the length of my text.
df['rounded_length'] = (df['end'] - df['start']).round(-1)
df = df.groupby('rounded_length')
All my values become multiples of 10, and I can group them this way.
How can I get the differ between two pandas dataframe with the same dimensions:
import pandas as pd
df1 = pd.DataFrame({
'x': ['a', 'b', 'c', 'd', 'e'],
'y': [1, 1, 1, 1, 1],
'z': [2, 2, 2, 2, 2]})
print(df1)
df2 = pd.DataFrame({
'x': ['a', 'b', 'c', 'd', 'e'],
'y': [1, 1, 1, 1, 1],
'z': [3, 3, 3, 3, 3]})
print(df2)
I would like the output delta data frame is:
x y z
0 a 0 1
1 b 0 1
2 c 0 1
3 d 0 1
4 e 0 1
Set x as the common index, subtract and reset the index (pandas aligns on the index before any operation):
df2.set_index('x').sub(df1.set_index('x')).reset_index()
x y z
0 a 0 1
1 b 0 1
2 c 0 1
3 d 0 1
4 e 0 1
date_0 = list(pd.date_range('2017-01-01', periods=6, freq='MS'))
date_1 = list(pd.date_range('2017-01-01', periods=8, freq='MS'))
data_0 = [9, 8, 4, 0, 0, 0]
data_1 = [9, 9, 0, 0, 0, 7, 0, 0]
id_0 = [0]*6
id_1 = [1]*8
df = pd.DataFrame({'ids': id_0 + id_1, 'dates': date_0 + date_1, 'data': data_0 + data_1})
For each id (here 0 and 1) I want to know how long is the series of zeros at the end of the time frame.
For the given example, the result is id_0 = 3, id_1 = 2.
So how do I limit the timestamps, so I can run something like that:
df.groupby('ids').agg('count')
First need get all consecutive 0 with trick by compare with shifted values for not equal and cumsum.
Then count pre groups, remove first level of MultiIndex and get last values per group by drop_duplicates with keep='last':
s = df['data'].ne(df['data'].shift()).cumsum().mul(~df['data'].astype(bool))
df = (s.groupby([df['ids'], s]).size()
.reset_index(level=1, drop=True)
.reset_index(name='val')
.drop_duplicates('ids', keep='last'))
print (df)
ids val
1 0 3
4 1 2