What's the equivalent of Panda's value_counts() in PySpark? - dataframe

I am having the following python/pandas command:
df.groupby('Column_Name').agg(lambda x: x.value_counts().max()
where I am getting the value counts for ALL columns in a DataFrameGroupBy object.
How do I do this action in PySpark?

It's more or less the same:
spark_df.groupBy('column_name').count().orderBy('count')
In the groupBy you can have multiple columns delimited by a ,
For example groupBy('column_1', 'column_2')

try this when you want to control the order:
data.groupBy('col_name').count().orderBy('count', ascending=False).show()

Try this:
spark_df.groupBy('column_name').count().show()

from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc
spark = SparkSession.builder.appName('whatever_name').getOrCreate()
spark_sc = spark.read.option('header', True).csv(your_file)
value_counts=spark_sc.select('Column_Name').groupBy('Column_Name').agg(count('Column_Name').alias('counts')).orderBy(desc('counts'))
value_counts.show()
but spark is much slower than pandas value_counts() on a single machine

df.groupBy('column_name').count().orderBy('count').show()

Related

Converting Python code to pyspark environment

How can I have the same functions as shift() and cumsum() from pandas in pyspark?
import pandas as pd
temp = pd.DataFrame(data=[['a',0],['a',0],['a',0],['b',0],['b',1],['b',1],['c',1],['c',0],['c',0]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID').apply(lambda x: (x["X"].shift() != x["X"]).cumsum()).reset_index()['X']
print(temp)
My question is how to achieve in pyspark.
Pyspark have handle these type of queries with Windows utility functions.
you can read its documentation here
Your pyspark code would be something like this :
from pyspark.sql import functions as F
from pyspark.sql Import Window as W
window = W.partitionBy('id').orderBy('time'?)
new_df = (
df
.withColumn('shifted', F.lag('X').over(window))
.withColumn('isEqualToPrev', (F.col('shifted') == F.col('X')).cast('int'))
.withColumn('cumsum', F.sum('isEqualToPrev').over(window))
)

Flatten and rename multi-index agg columns

I have some Pandas / cudf code that aggregates a particular column using two aggregate methods, and then renames the multi-index columns to flattened columns.
df = (
some_df
.groupby(["some_dimension"])
.agg({"some_metric" : ["sum", "max"]})
.reset_index()
.rename(columns={"some_dimension" : "some_dimension__id", ("some_metric", "sum") : "some_metric_sum", ("some_metric", "max") : "some_metric_max"})
)
This works great in cudf, but does not work in Pandas 0.25 -- the hierarchy is not flattened out.
Is there a similar approach using Pandas? I like the cudf tuple syntax and how they just implicitly flatten the columns. Hoping to find a similarly easy way to do it in Pandas.
Thanks.
In pandas 0.25.0+ there is something called groupby aggregation with relabeling.
Here is a stab at your code
df = (some_df
.groupby(["some_dimension"])
.agg(some_metric_sum=("some_metric", "sum"),
some_metric_max=("some_metric", "max"]})
.reset_index()
.rename(colunms = {"some_dimension":"some_dimension_id"}))

PySpark Dataframe: append to each value of a column a word

I would like to append to each value of a column in a pyspark dataframe a word( for example from a list of words). I though to just convert it to pandas framework because it is easier but I need to do it on pyspark. Any Ideas? Thank you :)
you can do it easily with concat function:
from pyspark.sql import functions as F
for col in df.columns:
df.withColumn(col, F.concat(F.col(col), F.lit("new_word"))

conditional aggragation in pySpark groupby

Easy question from a newbie in pySpark:
I have a df and I would like make a conditional aggragation, returning the aggregation result if denominator is different than 0 otherwise 0.
My tentative produces an error:
groupBy=["K"]
exprs=[(sum("A")+(sum("B"))/sum("C") if sum("C")!=0 else 0 ]
grouped_df=new_df.groupby(*groupBy).agg(*exprs)
Any hint?
Thank you
You have to use when/otherwise for if/else:
import pyspark.sql.functions as psf
new_df.groupby("K").agg(
psf.when(psf.sum("C")==0, psf.lit(0)).otherwise((psf.sum("A") + psf.sum("B"))/psf.sum("C")).alias("sum")
)
But you can also do it this way:
import pyspark.sql.functions as psf
new_df.groupby("K").agg(
((psf.sum("A") + psf.sum("B"))/psf.sum("C")).alias("sum")
).na.fill({"sum": 0})

pandas HDFStore select rows with non-null values in the data column

In pandas Dataframe/Series there's a .isnull() method. Is there something similar in the syntax of where= filter of the select method of HDFStore?
WORKAROUND SOLUTION:
The /meta section of a data column inside hdf5 can be used as a hack solution:
import pandas as pd
store = pd.HDFStore('store.h5')
print(store.groups)
non_null = list(store.select("/df/meta/my_data_column/meta"))
df = store.select('df', where='my_data_column == non_null')