Pandas Boolean selection updated? - pandas

I was used to getting a single number, which told me how many cases are TRUE for either for the conditions in the code. However, since I used conda update all I now get a list of values with either 0 or 1. I wonder what is now the simplest method in pandas to get this task done. I guess that this is a pandas update. I did a google search but could not find that they changed boolean indexing. What is the easiest way to get this sum of booleans (I know how to get it but I cannot imagine that this extra step is required).
import pandas as pd
import numpy as np
x = np.random.randint(10,size=10)
y = np.random.randint(10,size=10)
d ={}
d['x'] = x
d['y'] = y
df = pd.DataFrame(d)
sum([df['x']>=6] or [df['y']<=3])

You need to use vectorized or |:
(df.x.ge(6) | df.y.le(3)).sum()
# 9
Or: ((df.y <= 3) | (df.x >= 6)).sum(), sum((df.y <= 3) | (df.x >= 6)).

Related

Pandas rounding

I have the following sample dataframe:
Market Value
0 282024800.37
1 317460884.85
2 1260854026.24
3 320556927.27
4 42305412.79
I am trying to round the values in this dataframe to the nearest whole number. Desired output:
Market Value
282024800
317460885
1260854026
320556927
42305413
I tried:
df.values.round()
and the result was
Market Value
282025000.00
317461000.00
1260850000.00
320557000.00
42305400.00
What am I doing wrong?
Thanks
This might be more appropriate posted as a comment, but put here for proper format.
I can't produce your result. With numpy 1.18.1 and Pandas 1.1.0,
df.round().astype('int')
gives me:
Market Value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413
The only thing I can think of is that you may have a 32 bit system, where
df.astype('float32').round().astype('int')
gives me
Market Value
0 282024800
1 317460896
2 1260854016
3 320556928
4 42305412
The following will keep your data information intact as a float put will have it display/print to the nearest int.
Big caveat: it is only possible to have this apply to ALL dataframes at once (it is a pandas wide option) rather than just a single dataframe.
pd.set_option("display.precision", 0)
If you like #noah's solution but don't want to have to change the variables back if you output something, you can use the following helper function:
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_pandas_options(options):
seen_options = set()
old_values = {}
if isinstance(options, dict):
options_pairs = list(options.items())
else:
options_pairs = options
for option, value in options_pairs:
assert not option in seen_options, f"Already saw option {option}"
old_values[option] = pd.get_option(option)
pd.set_option(option, value)
yield
for option, old_value in old_values.items():
pd.set_option(option, old_value)
Then you can run
with temp_pandas_options({'display.float_format': '{:.0f}'.format}):
print(market_value_df)
and get
Market value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

There are three problems(Load database, loop, and append series)

Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop
I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)

Log values by SFrame column

Please, can anybody tell me, how I can take logarithm from every value in SFrame, graphlab (or DataFrame, pandas) column, without to iterate through the whole length of the SFrame column?
I specially interest on similar functionality, like by Groupby Aggregators for the log-function. Couldn't find it someself...
Important: Please, I don't interest for the for-loop iteration for the whole length of the column. I only interest for specific function, which transform all values to the log-values for the whole column.
I'm also very sorry, if this function is in the manual. Please, just give me a link...
numpy provides implementations for a wide number of basic mathematical transformations. You can use those on all data structures that build on numpy's ndarray.
import pandas as pd
import numpy as np
data = pd.Series([np.exp(1), np.exp(2), np.exp(3)])
np.log(data)
Outputs:
0 1
1 2
2 3
dtype: float64
This example is for pandas data types, but it works for all data structures that are based on numpy arrays.
The same "apply" pattern works for SFrames as well. You could do:
import graphlab
import math
sf = graphlab.SFrame({'a': [1, 2, 3]})
sf['b'] = sf['a'].apply(lambda x: math.log(x))
#cel
I think, in my case it could be possible also to use next pattern.
import numpy
import pandas
import graphlab
df
a b c
1 1 1
1 2 3
2 1 3
....
df['log c'] = df.groupby('a')['c'].apply(lambda x: numpy.log(x))
for SFrame (sf instead df object) it could look little be different
logvals = numpy.log(sf['c'])
log_sf = graphlab.SFrame(logvals)
sf = sf.join(log_sf, how = 'outer')
Probably with numpy the code fragment is a little bit to long, but it works...
The main problem is of course time perfomance. I did hope, I can fnd some specific function to minimise my time....

Selecting from pandas dataframe (or numpy ndarray?) by criterion

I find myself coding this sort of pattern a lot:
tmp = <some operation>
result = tmp[<boolean expression>]
del tmp
...where <boolean expression> is to be understood as a boolean expression involving tmp. (For the time being, tmp is always a pandas dataframe, but I suppose that the same pattern would show up if I were working with numpy ndarrays--not sure.)
For example:
tmp = df.xs('A')['II'] - df.xs('B')['II']
result = tmp[tmp < 0]
del tmp
As one can guess from the del tmp at the end, the only reason for creating tmp at all is so that I can use a boolean expression involving it inside an indexing expression applied to it.
I would love to eliminate the need for this (otherwise useless) intermediate, but I don't know of any efficient1 way to do this. (Please, correct me if I'm wrong!)
As second best, I'd like to push off this pattern to some helper function. The problem is finding a decent way to pass the <boolean expression> to it. I can only think of indecent ones. E.g.:
def filterobj(obj, criterion):
return obj[eval(criterion % 'obj')]
This actually works2:
filterobj(df.xs('A')['II'] - df.xs('B')['II'], '%s < 0')
# Int
# 0 -1.650107
# 2 -0.718555
# 3 -1.725498
# 4 -0.306617
# Name: II
...but using eval always leaves me feeling all yukky 'n' stuff... Please let me know if there's some other way.
1E.g., any approach I can think of involving the filter built-in is probably ineffiencient, since it would apply the criterion (some lambda function) by iterating, "in Python", over the panda (or numpy) object...
2The definition of df used in the last expression above would be something like this:
import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = range(5)
ix = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
names=('Alpha', 'Int'))
c = ('I', 'II', 'III')
df = pd.DataFrame(np.random.randn(len(idx), len(c)), index=ix, columns=c)
Because of the way Python works, I think this one's going to be tough. I can only think of hacks which only get you part of the way there. Something like
def filterobj(obj, fn):
return obj[fn(obj)]
filterobj(df.xs('A')['II'] - df.xs('B')['II'], lambda x: x < 0)
should work, unless I've missed something. Using lambdas this way is one of the usual tricks for delaying evaluation.
Thinking out loud: one could make a this object which isn't evaluated but just sticks around as an expression, something like
>>> this
this
>>> this < 3
this < 3
>>> df[this < 3]
Traceback (most recent call last):
File "<ipython-input-34-d5f1e0baecf9>", line 1, in <module>
df[this < 3]
[...]
KeyError: u'no item named this < 3'
and then either special-case the treatment of this into pandas or still have a function like
def filterobj(obj, criterion):
return obj[eval(str(criterion.subs({"this": "obj"})))]
(with enough work we could lose the eval, this is simply proof of concept) after which something like
>>> tmp = df["I"] + df["II"]
>>> tmp[tmp < 0]
Alpha Int
A 4 -0.464487
B 3 -1.352535
4 -1.678836
Dtype: float64
>>> filterobj(df["I"] + df["II"], this < 0)
Alpha Int
A 4 -0.464487
B 3 -1.352535
4 -1.678836
Dtype: float64
would work. I'm not sure any of this is worth the headache, though, Python simply isn't very conducive to this style.
This is as concise as I could get:
(df.xs('A')['II'] - df.xs('B')['II']).apply(lambda x: x if (x<0) else np.nan).dropna()
Int
0 -4.488312
1 -0.666710
2 -1.995535
Name: II