Vaex column doesn't evaluate - vaex

I have the following calculation:
df.t =100.0*((1.25/1023)*df.t-0.5)
Strangely, >>>df doesn't show result, only old values in that column. However, df.t shows calculated values.
So, when I export result to pandas with dfp = df.to_pandas_df(), it gets raw old values as well. And even dfp.t = df.t doesn't help.
So, how do I get calculated values?

Looking through the vaex docs I found that df.materialize('t', inplace=True) should likely help, but it didn't.
Finally, I tried
df['t'] =100.0*((1.25/1023)*df.t-0.5)
and that worked. I'm not sure why, since I thought df.t and df['t'] to be synonyms.
But anyways, problem is solved.

Related

Speeding up pandas multi row assignment with loc()

I am trying to assign value to a column for all rows selected based on a condition. Solutions for achieving this are discussed in several questions like this one.
The standard solution are of the following syntax:
df.loc[row_mask, cols] = assigned_val
Unfortunately, this standard solution takes forever. In fact, in my case, I didn't manage to get even one assignment complete.
Update: More info about my dataframe: I have ~2 Million rows in my dataframe and I am trying to update the value of one column in my dataframe for rows that are selected based on a condition. On average, the selection condition is satisfied by ~10 rows.
Is it possible to speed up this assignment operation? Also, are there any general guidelines for multiple assignments with pandas in general.
I believe .loc and .at are the differences you're looking for. .at is meant to be faster based on this answer.
You could give np.where a try.
Here is an simple example of np.where
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['B'] = np.where(df['B']< 50, 100000, df['B'])
np.where() do nothing if condition fails
has another example.
In your case, it might be
df[col] = np.where(df[col]==row_condition, assigned_val, df[col])
I was thinking it might be a little quicker because it is going straight to numpy instead of going through pandas to the underlying numpy mechanism. This article talks about Pandas vs Numpy on large datasets: https://towardsdatascience.com/speed-testing-pandas-vs-numpy-ffbf80070ee7#:~:text=Numpy%20was%20faster%20than%20Pandas,exception%20of%20simple%20arithmetic%20operations.

Sparse columns in pandas: directly access the indices of non-null values

I have a large dataframe (approx. 10^8 rows) with some sparse columns. I would like to be able to quickly access the non-null values in a given column, i.e. the values that are actually saved in the array. I figured that this could be achieved by df.<column name>[<indices of non-null values>]. However, I can't see how to access <indices of non-null values> directly, i.e. without any computation. When I try df.<column name>.index it tells me that it's a RangeIndex, which doesn't help. I can even see <indices of non-null values> when I run df.<column name>.values, but looking through dir(df.<column name>.values) I still cant't see a way to access them.
To make clear what I mean, here is a toy example:
In this example <indices of non-null values> is [0,1,3].
EDIT: The answer below by #Piotr Żak is a viable solution, but it requires computation. Is there a way to access <indices of non-null values> directly via an attribute of the column or array?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1], [np.nan], [4], [np.nan], [9]]),
columns=['a'])
just filter without nan:
filtered_df = df[df['a'].notnull()]
transform column from df to array:
s_array = filtered_df[["a"]].to_numpy()
or - transform indexes from df to array:
filtered_df.index.tolist()

How to make pandas show the entire dataframe without cropping it by columns?

I am trying to represent cubic spline interpolation information for function f(x) as a dataframe.
When trying to print into a spyder, I find that the columns are being cut off. When trying to reproduce the output in Jupiter Lab, I got the same thing.
When I ran in ipython via terminal I got the desired full dataframe output.
I searched the integnet and tried the pandas commands setting options pd.set_options(), but nothing came of it.
I attach a screenshot with the output in ipython.
In Juputer can use:
from IPython.display import display, HTML
and instead of
print(dataframe)
use of in anyway place
display(HTML(dataframe.to_html()))
This will create a nice table.
Unfortunately, this will not work in the spyder. So you can try to adjust the width of the ipython were suggested. But in most cases this will make the output poorly or unreadable.
After trying the dataframe methods, I found what appears to be a cropping setting.
In Spyder I used:
pd.set_option('expand_frame_repr', False)
print(dataframe)
This method explains why increasing max_column didn't help me previously.
You can specify a maximum number for rows or columns using pd.set_options(display.max_columns=1000)
But you don't have to set an arbitrary value, but rather use None instead to make sure every size will be covered.
For rows, use:
pd.set_option('display.max_rows', None)
And for columns, use:
pd.set_option('display.max_columns', None)
It is a result of the display width. You can use the following set_options():
pd.set_options(display.width=1000) #make huge
You may also have to raise max columns but it should be smart enough to adjust automatically after you make width bigger:
pd.set_options(display.max_columns=None)

Holoviz panel will not print pandas dataframe row in Jupyter notebook

I'm trying to recreate the first panel.interact example in the Holoviz tutorial using a Pandas dataframe instead of a Dask dataframe. I get the slider, but the pandas dataframe row does not show.
See the original example at: http://holoviz.org/tutorial/Building_Panels.html
I've tried using Dask as in the Holoviz example. Dask rows print out just fine, but it demonstrates that panel seem to treat Dask dataframe rows differently for printing than Pandas dataframe rows. Here's my minimal code:
import pandas as pd
import panel
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
def select_row(rowno=0):
row = df.loc[rowno]
return row
panel.extension()
panel.extension('katex')
panel.interact(select_row, rowno=(0, 5))
I've included a line with the katex extension, because without it, I get a warning that it is needed. Without it, I don't even get the slider.
I can call the select_row(rowno=0) function separately in a Jupyter cell and get a nice printout of the row, so it appears the function is working as it should.
Any help in getting this to work would be most appreciated. Thanks.
Got a solution. With Pandas, loc[rowno:rowno] returns a pandas.core.frame.DataFrame object of length 1 which works fine with panel while loc[rowno] returns a pandas.core.series.Series object which does not work so well. Thus modifying the select_row() function like this makes it all work:
def select_row(rowno=0):
row = df.loc[rowno:rowno]
return row
Still not sure, however, why panel will print out the Dataframe object and not the Series object.
Note: if you use iloc, then you use add +1, i.e., df.iloc[rowno:rowno+1].

Tensorflow classifier.predict write to csv

I've not used Tensorflow for a while and when I updated it seemed to have broken my old code as many of the old functions are deprecated. I fixed them with the new code and it all seems to be running except for when I write out the results:
y_predicted = classifier.predict(X_test)
There is an as iterable option as well - which I don't think I need.
I use to write out the results of the predictions using:
pandas.DataFrame(y_predicted).to_csv(/dir/)
but now I am getting an error that not all elements can be converted into String type. Is there a class in y_predicted I am suppose to be calling instead of the whole thing?
Anyways, I found a solution using np.array instead of a pandas dataframe:
result = np.asarray(y_predicted)
formatInt = result.astype(np.int)
np.savetxt("dir",formatInt,delimiter=",")
You can also try,
df = pandas.DataFrame({'Prediction':list(y_predicted)})
df.to_csv('filename.csv')