What is the Vaex command for pd.isnull().sum()? - pandas

Someone please give me a VAEX alternative for this code:
df_train = vaex.open('../input/ms-malware-hdf5/train.csv.hdf5')
total = df_train.isnull().sum().sort_values(ascending = False)

Vaex does not at this time support counting missing values on a dataframe level, only on an expression (column) level. So you will have to do a bit of work yourself.
Consider the following example:
import vaex
import vaex.ml
import pandas as pd
df = vaex.ml.datasets.load_titanic()
count_na = [] # to count the missing value per column
for col in df.column_names:
count_na.append(df[col].isna().sum().item())
s = pd.Series(data=count_na, index=df.column_names).sort_values(ascending=True)
If you think this is something you might need to use often, it might be worth it to create your own dataframe method following this example.

Related

Pandas: Value_counts() function explanation

I wanted to ask if anyone knows if its possible to return a single value using the value_counts() in pandas or if there is a possible way to isolate a single value?
thanks!
In answer to 'a possible way to isolate a single value' question:
iat accesses a single value:
import pandas as pd
data = pd.Series([f"value_{x}" for x in range(20)])
data.iat[4]
Combining this with value_counts() method:
import pandas as pd
numbers = [f"value_{x}" for x in range(20)]
numbers[5] = numbers[6]
data = pd.Series(numbers)
data.value_counts().iat[0]

koalas Column assignment doesn't support type ndarray

All - I am trying to add a new column to an existing koalas dataframe but it fails with the error above. The value I am assigning with is an np array. Am I missing something at all? This works well with pandas.
import databricks.koalas as ks
from sklearn.datasets import load_iris
iris = load_iris()
df = ks.DataFrame(data=iris.data, columns=iris.feature_names)
# works so far!!
df["target"] = iris.target ## this errors out!
TypeError: Column assignment doesn't support type ndarray
Am I missing anything here?
thanks.
Unfortunately, even df.assign did not solve the problem and I was getting the same error:
I had to do this:
ks.reset_option('compute.ops_on_diff_frames')
# convert target to a koalas series so that it can be assigned to the dataframe as a column
ks_series = ks.Series(iris.target)
df["target"] = ks_series
ks.reset_option('compute.ops_on_diff_frames')
My bad:
I misread where and what the issue was. Try the following:
...
df.assign(target=iris.target)
Could you try the following:
...
df = ks.DataFrame(data=iris.data, columns=list(iris.feature_names))
...
Looking into the load_iris documentation, they make a not of converting the returned array into a list.

Good alternative to exec in python 2.7

I have code where I need to create pandas dataframe with the name from list. I know this can be achieved by using exec() function. But looks like its slowing down my app. Is there any better alternative to it ?
import pandas as pd
df_names = ["first","second","third"]
col_names = ['A','B','C']
for names in df_names:
exec("%s=pd.DataFrame(columns=col_names)"%(names))
Found below method and its working for me
import pandas as pd
df_names = ["first","second","third"]
col_names = ['A','B','C']
d={}
for names in df_names:
d[names]=pd.DataFrame(columns=col_names)

How can I make different columns render as different colors in holoviews / hvplot?

I have a pandas dataframe with two columns of time series data. In my actual data, these columns are large enough that the render is unwieldy without datashader. I am attempting to compare events from these two timeseries. However, I need to be able to tell which data point is from which column. A simple functional example is below. How would I get columns A and B to use different color maps?
import numpy as np
import hvplot.pandas
import pandas as pd
A = np.random.randint(10, size=10000)
B = np.random.randint(30, size=10000)
d = {'A':A,'B':B}
df = pd.DataFrame(d)
df.hvplot(kind='scatter',datashade=True, height=500, width=1000, dynspread=False)
You will have to use the count_cat aggregator that counts each category separately, e.g. in the example above that would look like this:
import datashader as ds
df.hvplot(kind='scatter', aggregator=ds.count_cat('Variable'), datashade=True,
height=500, width=1000)
The 'Variable' here corresponds to the default group_label that hvplot assigns to the columns. If you provided a different group_label you would have to update the aggregator to match. However instead of supplying an aggregator explicitly you can also use the by keyword:
df.hvplot(kind='scatter', by='Variable', datashade=True,
height=500, width=1000)
Once hvplot 0.3.1 is released you'll also be able to supply an explicit cmap, e.g.:
df.hvplot(kind='scatter', by='Variable', datashade=True,
height=500, width=1000, cmap={'A': 'red', 'B': 'blue'})

How to Render Math Table Properly in IPython Notebook

The math problem that I'm solving gives different analytical solutions in different scenarios, and I would like to summarize the result in a nice table. IPython Notebook renders the list nicely:
for example:
import sympy
from pandas import DataFrame
from sympy import *
init_printing()
a, b, c, d = symbols('a b c d')
t = [[a/b, b/a], [c/d, d/c]]
t
However, when I summarize the answers into a table using DataFrame, the math cannot be rendered any more:
df = DataFrame(t, index=['Situation 1', 'Situation 2'], columns=['Answer1','Answer2'])
df
"print df.to_latex()" also gives the same result. I also tried "print(latex(t))" but it gives this after compiling in LaTex, which is alright, but I still need to manually convert it to a table:
How should I use DataFrame properly in order to render the math properly? Or is there any other way to export the math result into a table in Latex? Thanks!
Update: 01/25/14
Thanks again to #Jakob for solving the problem. It works perfectly for simple matrices, though there are still some minor problems for more complicated math expressions. But I guess like #asmeurer said, perfection requires an update in IPython and Pandas.
Update: 01/26/14
If I render the result directly, i.e. just print the list, it works fine:
MathJax is currently not able to render tables, hence the most obvious approach (pure latex) does not work.
However, following the advise of #asmeurer you should use an html table and render the cell content as latex. In your case this could be easily achieved by the following intermediate step:
from sympy import latex
tl = map(lambda tc: '$'+latex(tc)+'$',t)
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer'])
df
which gives:
Update:
In case of two dimensional data, the simple map function will not work directly. To cope with this situation the numpy shape, reshape and ravel functions could be used like:
import numpy as np
t = [[a/b, b/a],[a*a,b*b]]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t)),np.shape(t))
df = DataFrame(tl, index=['Situation 1', 'Situation 2'], columns=['Answer 1','Answer 2'])
df
This gives:
Update 2:
Pandas crops cell content if the string length exceeds a certain number. E.g a more complicated expression like
t1 = [a/2+b/2+c/2+d/2]
tl=np.reshape(map(lambda tc: '$'+latex(tc)+'$',np.ravel(t1)),np.shape(t1))
df = DataFrame(tl, index=['Situation 1'], columns=['Answer 1'])
df
gives:
To cope with this issue a pandas package option has to be altered, for details see here. For the present case the max_colwidth has to be changed. The default value is 50, hence let's change it to 100:
import pandas as pd
pd.options.display.max_colwidth=100
df
gives: