I have a dataframe like below
from pyspark import SparkContext, SparkConf,SQLContext
import numpy as np
config = SparkConf("local")
sc = SparkContext(conf=config)
sqlContext=SQLContext(sc)
df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","word1").withColumnRenamed("_3","word2").withColumnRenamed("_4","word3")
Now I need to keep the first column and the rest columns as a numpy array (two columns : "doc" and a numpy array column)
I know that
sdf=np.array(df.select([c for c in df.columns if c not in {'doc'}]).collect())
print sdf
Translate all the columns into a numpy array but how do I append the numpy array with the first column ? Any help is appreciated.
Unfortunately you cannot make numpy.array column in pyspark dataframe, but you can use regular python list instead, and convert it while reading:
>>> df = sqlContext.createDataFrame([("doc_3",[1,3,9]), ("doc_1",[9,6,0]), ("doc_2",[9,9,3]) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","words")
>>> df.show()
+-----+---------+
| doc| words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+
>>> df
DataFrame[doc: string, words: array<bigint>]
And to get this from 4 columns you had, you can:
>>> from pyspark.sql.functions import *
>>> df2=df.select("doc", array("word1", "word2", "word3").alias("words"))
>>> df2
DataFrame[doc: string, words: array<bigint>]
>>> df2.show()
+-----+---------+
| doc| words|
+-----+---------+
|doc_3|[1, 3, 9]|
|doc_1|[9, 6, 0]|
|doc_2|[9, 9, 3]|
+-----+---------+
Related
I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.
However, the most straight forward solution doesn't seem to be supported by the Pandas on Spark API:
A very simple example like below
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
import pandas as pd
#F.pandas_udf(IntegerType())
def addition(arr: pd.Series, addition: int) -> pd.Series:
return arr.add(addition)
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show()
df.withColumn("added", addition(F.col("array"),F.col("addition")))
throws the following exception on the udf definition line
NotImplementedError: Unsupported signature: (arr: pandas.core.series.Series, addition: int) -> pandas.core.series.Series.
Am i tackling this problem in a wrong way? I could reimplement the whole "addition" function in native PySpark, but the real function I am talking about is terribly complex and would mean an enormous amount of rework.
Loading the example, adding import array
from pyspark.sql.types as T
import pyspark.sql.functions as F
import pandas as pd
from array import array
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show(truncate=False)
print(df.schema.fields)
The response is,
+---------+--------+
| array|addition|
+---------+--------+
|[1, 2, 3]| 10|
|[4, 5, 6]| 20|
+---------+--------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True)]
If you must use a Pandas function to complete your task here is an option for a solution that uses a Pandas function within a PySpark UDF,
The Spark DF arr column is ArrayType, convert it into a Pandas Series
Apply the Pandas function
Then, convert the Pandas Series back to an array
#F.udf(T.ArrayType(T.LongType()))
def addition_pd(arr, addition):
pd_arr = pd.Series(arr)
added = pd_arr.add(addition)
return array("l", added)
df = df.withColumn("added", addition_pd(F.col("array"),F.col("addition")))
df.show(truncate=False)
print(df.schema.fields)
Returns
+---------+--------+------------+
|array |addition|added |
+---------+--------+------------+
|[1, 2, 3]|10 |[11, 12, 13]|
|[4, 5, 6]|20 |[24, 25, 26]|
+---------+--------+------------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True), StructField('added', ArrayType(LongType(), True), True)]
However, it is worth stating that when possible it is recommended to use PySpark Functions over the use of PySpark UDF (see here)
Strangely enough I cant find any where on the internet if its possible to be done.
I have a datafrme of array column.
arr_col
[1,3,4]
[4,3,5]
I want result
Result
3
4
I want the median for each row.
I managed to do it with a pandas udf but it iterates the column and applies np.median to each row. .
I dont want it as it's slow and tow at a time. I want it to act at all rows the same time.
Either in pandas or pyspark
Use numpy
import numpy as np
df['Result'] = np.median(np.vstack(df['arr_col']), axis=1)
Or explode and groupby.median:
df['Result'] = (df['arr_col'].explode()
.groupby(level=0).median()
)
Output:
arr_col Result
0 [1, 3, 4] 3.0
1 [4, 3, 5] 4.0
Used input:
df = pd.DataFrame({'arr_col': [[1,3,4], [4,3,5]]})
Can use a udf in pyspark.
m =udf(lambda x: int(np.median(x)),IntegerType())
df.withColumn('Result', m(col('arr_col'))).show()
+---+---------+------+
| Id| arr_col|Result|
+---+---------+------+
| 1|[1, 3, 4]| 3.0|
| 1|[4, 3, 6]| 4.0|
+---+---------+------+
How to implement a User Defined Aggregate Function (UDAF) in PySpark SQL?
pyspark version = 3.0.2
python version = 3.7.10
As a minimal example, I'd like to replace the AVG aggregate function with a UDAF:
sc = SparkContext()
sql = SQLContext(sc)
df = sql.createDataFrame(
pd.DataFrame({'id': [1, 1, 2, 2], 'value': [1, 2, 3, 4]}))
df.createTempView('df')
rv = sql.sql('SELECT id, AVG(value) FROM df GROUP BY id').toPandas()
where rv will be:
In [2]: rv
Out[2]:
id avg(value)
0 1 1.5
1 2 3.5
How can a UDAF replace AVG in the query?
For example this does not work
import numpy as np
def udf_avg(x):
return np.mean(x)
sql.udf.register('udf_avg', udf_avg)
rv = sql.sql('SELECT id, udf_avg(value) FROM df GROUP BY id').toPandas()
The idea is to implement a UDAF in pure Python for processing not supported by SQL aggregate functions (e.g. a low-pass filter).
A Pandas UDF can be used, where the definition is compatible from Spark 3.0 and Python 3.6+. See the issue and documentation for details.
Full implementation in Spark SQL:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
pd.DataFrame({'id': [1, 1, 2, 2], 'value': [1, 2, 3, 4]}))
df.createTempView('df')
#pandas_udf(DoubleType())
def avg_udf(s: pd.Series) -> float:
return s.mean()
spark.udf.register('avg_udf', avg_udf)
rv = spark.sql('SELECT id, avg_udf(value) FROM df GROUP BY id').toPandas()
with return value
In [2]: rv
Out[2]:
id avg_udf(value)
0 1 1.5
1 2 3.5
You can use a Pandas UDF with GROUPED_AGG type. It receives columns from Spark as Pandas Series, so that you can call Series.mean on the column.
import pyspark.sql.functions as F
#F.pandas_udf('float', F.PandasUDFType.GROUPED_AGG)
def avg_udf(s):
return s.mean()
df2 = df.groupBy('id').agg(avg_udf('value'))
df2.show()
+---+--------------+
| id|avg_udf(value)|
+---+--------------+
| 1| 1.5|
| 2| 3.5|
+---+--------------+
To register it for use in SQL is also possible:
df.createTempView('df')
spark.udf.register('avg_udf', avg_udf)
df2 = spark.sql("select id, avg_udf(value) from df group by id")
df2.show()
+---+--------------+
| id|avg_udf(value)|
+---+--------------+
| 1| 1.5|
| 2| 3.5|
+---+--------------+
tldr; I want to pass a series of positions on a DataFrame and receive a series of values, If possible with a DataFrame method.
I have a Dataframe with some columns and an index
import pandas as pd
df_a = pd.DataFrame(
{'A':[0,1,3,7],
'B':[2,3,4,5]}, index=[0,1,2,3])
I want to retrieve the values at specific (row, column) positions on the DataFrame
rows = [0, 2, 3]
cols = ['A','B','A']
df_a.loc[rows, cols] returns a 3x3 DataFrame
|A |B |A
0 0 2 0
2 3 4 3
3 7 5 7
I want the series of values corresponding to the (row, col) values, a series of length 3
[0, 4, 7]
What is the best way to do this in pandas?
Most certainly! you can use DataFrame.lookup to achieve exactly what you want:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.lookup.html
import pandas as pd
df_a = pd.DataFrame({'A':[0,1,3,7], 'B':[2,3,4,5]}, index=[0,1,2,3])
rows = [0, 2, 3]
cols = ['A','B','A']
values = df_a.lookup(rows, cols)
print(values)
array([0, 4, 7], dtype=int64)
Pandas does not support that kind of indexing, only numpy
>>> df.to_numpy()[rows, df.columns.get_indexer(cols)]
array([0, 4, 7])
I have a PySpark UDF that takes an array and returns its suffix:
func.udf( lambda ng: ng[1:], ArrayType(IntegerType()) )
Is it possible to turn it into a scalar pandas_udf?
Does Pandas offer the necessary vectorized operations?
Thanks
You can get the same functionality with scalar pandas udf but make sure that you return a Series with list of lists from the udf as the series normally expects a list of elements and your row array is flattened and converted to multiple rows if you return directly the list as series.
Edited As pointed out by OP in comments my previous answer(pd.Series([v[0][1:]]) using index is wrong and works only in certain conditions. Now corrected.
df = spark.createDataFrame([([1,2,3],'val1'),([4,5,6],'val2')],['col1','col2'])
df.show()
+---------+----+
| col1|col2|
+---------+----+
|[1, 2, 3]|val1|
|[4, 5, 6]|val2|
+---------+----+
from pyspark.sql.functions import pandas_udf,PandasUDFType
from pyspark.sql.types import *
import pandas as pd
#pandas_udf(ArrayType(LongType()))
def func(v):
res=[]
for row in v:
res.append(row[1:])
return pd.Series(res)
df.withColumn('col3',func(df.col1)).show()
+---------+----+------+
|col1 |col2|col3 |
+---------+----+------+
|[1, 2, 3]|val1|[2, 3]|
|[4, 5, 6]|val2|[5, 6]|
+---------+----+------+
The other way is to use apply similar to what you have done:
#pandas_udf(ArrayType(LongType()))
def func(v):
return v.apply(lambda x:x[1:])