Log values by SFrame column - pandas

Please, can anybody tell me, how I can take logarithm from every value in SFrame, graphlab (or DataFrame, pandas) column, without to iterate through the whole length of the SFrame column?
I specially interest on similar functionality, like by Groupby Aggregators for the log-function. Couldn't find it someself...
Important: Please, I don't interest for the for-loop iteration for the whole length of the column. I only interest for specific function, which transform all values to the log-values for the whole column.
I'm also very sorry, if this function is in the manual. Please, just give me a link...

numpy provides implementations for a wide number of basic mathematical transformations. You can use those on all data structures that build on numpy's ndarray.
import pandas as pd
import numpy as np
data = pd.Series([np.exp(1), np.exp(2), np.exp(3)])
np.log(data)
Outputs:
0 1
1 2
2 3
dtype: float64
This example is for pandas data types, but it works for all data structures that are based on numpy arrays.

The same "apply" pattern works for SFrames as well. You could do:
import graphlab
import math
sf = graphlab.SFrame({'a': [1, 2, 3]})
sf['b'] = sf['a'].apply(lambda x: math.log(x))

#cel
I think, in my case it could be possible also to use next pattern.
import numpy
import pandas
import graphlab
df
a b c
1 1 1
1 2 3
2 1 3
....
df['log c'] = df.groupby('a')['c'].apply(lambda x: numpy.log(x))
for SFrame (sf instead df object) it could look little be different
logvals = numpy.log(sf['c'])
log_sf = graphlab.SFrame(logvals)
sf = sf.join(log_sf, how = 'outer')
Probably with numpy the code fragment is a little bit to long, but it works...
The main problem is of course time perfomance. I did hope, I can fnd some specific function to minimise my time....

Related

In numpy, is there a function to find the inverse of ix_?

With numpy, my goal is to select a quadratic submatrix from a quadratic matrix, and then also look at the collection of elements that are not in the first submatrix.
For the first submatrix, I'm using np.ix_:
import numpy as np
r = np.random.rand(3,3)
l = [1,2]
r[np.ix_(l, l)]
Then, r[np.ix_(l, l)] will pick out a 2x2 matrix, marked by **:
0
1
2
0
r0,0
r0,1
r0,2
1
r1,0
** r1,1 **
** r1,2**
2
r2,0
** r2,1 **
** r2,2 **
But now what is the best approach to select the difference between the submatrix and the parent matrix?
I have looked at:
~np.ix_, like ~np.eye, but this doesn't seem to be supported
np.subtract, but the problem is that I need to select the elements by their indices and not by their values.
Based on a comment by #hpaulj, I followed the approach with the numpy.ma submodule:
import numpy as np
r = np.random.rand(3,3)
l = [1,2]
r[np.ix_(l, l)]
import numpy.ma as ma
mask = ma.zeros(r.shape)
mask[np.ix_(l, l)] = 1
Then, ma.compressed() gives the desired result:
ma.compressed(ma.array(r, mask=mask))
Using np.ix_ is equivalent to using basic indexing, but by triggering advanced indexing.
So it lets you fetch all the elements belonging to the 1st and 2nd rows and 1st and 2nd columns completely as a copy (basic indexing yields a view)
import numpy as np
r = np.random.rand(3,3)
l = [1,2]
r[np.ix_(l, l)]
array([[0.46899841, 0.49051596],
[0.00256912, 0.86447371]])
Equivalent to np.ix_, using basic indexing (this is a view and not a copy!) -
r[1:3, 1:3]
array([[0.46899841, 0.49051596],
[0.00256912, 0.86447371]])
If you, however, want to fetch the (1,1) and (2,2) index elements, then you can directly use advance indexing as below -
r[l,l]
array([0.46899841, 0.86447371])
As you can see, this returns the diagonal elements which you are looking for (with the np.eye for example)
Read more about how indexing (basic and advance) works here or check out a detailed answer where I explain this as well here.

Pandas rounding

I have the following sample dataframe:
Market Value
0 282024800.37
1 317460884.85
2 1260854026.24
3 320556927.27
4 42305412.79
I am trying to round the values in this dataframe to the nearest whole number. Desired output:
Market Value
282024800
317460885
1260854026
320556927
42305413
I tried:
df.values.round()
and the result was
Market Value
282025000.00
317461000.00
1260850000.00
320557000.00
42305400.00
What am I doing wrong?
Thanks
This might be more appropriate posted as a comment, but put here for proper format.
I can't produce your result. With numpy 1.18.1 and Pandas 1.1.0,
df.round().astype('int')
gives me:
Market Value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413
The only thing I can think of is that you may have a 32 bit system, where
df.astype('float32').round().astype('int')
gives me
Market Value
0 282024800
1 317460896
2 1260854016
3 320556928
4 42305412
The following will keep your data information intact as a float put will have it display/print to the nearest int.
Big caveat: it is only possible to have this apply to ALL dataframes at once (it is a pandas wide option) rather than just a single dataframe.
pd.set_option("display.precision", 0)
If you like #noah's solution but don't want to have to change the variables back if you output something, you can use the following helper function:
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_pandas_options(options):
seen_options = set()
old_values = {}
if isinstance(options, dict):
options_pairs = list(options.items())
else:
options_pairs = options
for option, value in options_pairs:
assert not option in seen_options, f"Already saw option {option}"
old_values[option] = pd.get_option(option)
pd.set_option(option, value)
yield
for option, old_value in old_values.items():
pd.set_option(option, old_value)
Then you can run
with temp_pandas_options({'display.float_format': '{:.0f}'.format}):
print(market_value_df)
and get
Market value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

Pandas Boolean selection updated?

I was used to getting a single number, which told me how many cases are TRUE for either for the conditions in the code. However, since I used conda update all I now get a list of values with either 0 or 1. I wonder what is now the simplest method in pandas to get this task done. I guess that this is a pandas update. I did a google search but could not find that they changed boolean indexing. What is the easiest way to get this sum of booleans (I know how to get it but I cannot imagine that this extra step is required).
import pandas as pd
import numpy as np
x = np.random.randint(10,size=10)
y = np.random.randint(10,size=10)
d ={}
d['x'] = x
d['y'] = y
df = pd.DataFrame(d)
sum([df['x']>=6] or [df['y']<=3])
You need to use vectorized or |:
(df.x.ge(6) | df.y.le(3)).sum()
# 9
Or: ((df.y <= 3) | (df.x >= 6)).sum(), sum((df.y <= 3) | (df.x >= 6)).

Pandas fill cells in a column with NaN values, derive the value from other cells in the row

I have a dataframe:
a b c
0 1 2 3
1 1 1 1
2 3 7 NaN
3 2 3 5
...
I want to fill column "three" inplace (update the values) where the values are NaN using a machine learning algorithm.
I don't know how to do it inplace. Sample code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df=pd.DataFrame([range(3), [1, 5, np.NaN], [2, 2, np.NaN], [4,5,9], [2,5,7]],columns=['a','b','c'])
x=[]
y=[]
for row in df.iterrows():
index,data = row
if(not pd.isnull(data['c'])):
x.append(data[['a','b']].tolist())
y.append(data['c'])
model = LinearRegression()
model.fit(x,y)
#this line does not do it in place.
df[~df.c.notnull()].assign(c = lambda x:model.predict(x[['a','b']]))
But this gives me a copy of the dataframe. Only option I have left is using a for loop however, I don't want to do that. I think there should be more pythonic way of doing it using pandas. Can someone please help? Or is there any other way of doing this?
You'll have to do something like :
df.loc[pd.isnull(df['three']), 'three'] = _result of model_
This modifies directly dataframe df
This way you first filter the dataframe to keep the slice you want to modify (pd.isnull(df['three'])), then from that slice you select the column you want to modify (three).
On the right hand side of the equal, it expects to get an array / list / series with the same number of lines than the filtered dataframe ( in your example, one line)
You may have to adjust depending on what your model returns exactly
EDIT
You probably need to do stg like this
pred = model.predict(df[['a', 'b']])
df['pred'] = model.predict(df[['a', 'b']])
df.loc[pd.isnull(df['c']), 'c'] = df.loc[pd.isnull(df['c']), 'pred']
Note that a significant part of the issue comes from the way you are using scikit learn in your example. You need to pass the whole dataset to the model when you predict.
The simplest way is yo transpose first, then forward fill/backward fill at your convenience.
df.T.ffill().bfill().T