I am stuck with a situation where i need to perform division on output from two sql Data Frame . Any Suggestion How it can be done ?
scala> val TotalDie = sqlc.sql("select COUNT(DISTINCT XY) from Data")
TotalDie: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> TotalDie.show()
+---+
|_c0|
+---+
|887|
+---+
scala> val PassDie = sqlc.sql("select COUNT(DISTINCT XY) from Data where Sbin = '1'")
PassDie: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> PassDie.show()
+---+
|_c0|
+---+
|413|
+---+
I need to perform to calculate the Yield which refer to (PassDie/TotalDie)*100,
I am new to spark-shell
In case of multiple values (ie multiple rows): do you have a column (or key or id) to join the two dataframes (or tables) on ?
In case of always a single value (ie single row): something along the lines of: 100* PassDie.collect() / TotalDie.collect()
UPDATE
The exact syntax in case of 1 value:
100.0 * passdie.collect()(0).getInt(0) / totaldie.collect()(0).getInt(0)
res25: Double = 46.56144306651635
It is possible to do this with just SparkSQL, too.
Here's what i'd do to solve it that way:
>>> rdd1 = sc.parallelize([("a",1.12),("a",2.22)])
>>> rdd2 = sc.parallelize([("b",9.12),("b",12.22)])
>>> r1df = rdd1.toDF()
>>> r2df = rdd2.toDF()
>>> r1df.registerTempTable('r1')
>>> r2df.registerTempTable('r2')
>>> r3df = sqlContext.sql("SELECT * FROM r1 UNION SELECT * FROM r2").show()
>>> r3df.registerTempTable('r3')
>>> sqlContext.sql("SELECT * FROM r3") -------> do your aggregation / math here.
Now from here, in theory, you can do basic grouping and arithmetic just using SQL queries, since you've got this grand table of data. I realize in my example code, I didn't really declare a good schema with column names, and that makes this example not really work, but you have a schema, so you get the idea.
Related
I was trying to collect 2 user_id dataframes which have no same user_id mutually in pyspark.
So, I typed some codes below you can see
import pyspark.sql.functions as f
query = "select * from tb_original"
df_original = spark.sql(query)
df_original = df_original.select("user_id").distinct()
df_a = df_original.sort(f.rand()).limit(10000)
df_a.count()
# df_a: 10000
df_b = df_original.join(df_a,on="user_id",how="left_anti").sort(f.rand()).limit(10000)
df_b.count()
# df_b: 10000
df_a.join(df_b,on="user_id",how="left_anti").count()
# df_a - df_b = 9998
# What?????
As a result, df_a and df_b have the same 2 user_ids... sometimes 1, or 0.
It looks like no problem with codes. However, this occurs due to lazy action of spark mechanism maybe...
I need to solve this problem for collecting 2 user_id dataframes which have no same user_id mutually.
Since you want to generate two different set of users from a given pool of users with no overlap you may use this simple trick : =
from pyspark.sql.functions import monotonically_increasing_id
import pyspark.sql.functions as f
#"Creation of Original DF"
query = "select * from tb_original"
df_original = spark.sql(query)
df_original = df_original.select("user_id").distinct()
df_original =df.withColumn("UNIQUE_ID", monotonically_increasing_id())
number_groups_needed=2 ## you can adjust the number of group you need for your use case
dfa=df_original.filter(df_original.UNIQUE_ID % number_groups_needed ==0)
dfb=df_original.filter(df_original.UNIQUE_ID % number_groups_needed ==1)
##dfa and dfb will not have any overlap for user_id
Ps- if your user_id is itself a integer you don't need to create a new UNIQUE_ID column you can use it directly .
I choose randomSplit function pyspark supports.
df_a,df_b = df_original.randomSplit([0.6,0.4])
df_a = df_a.limit(10000)
df_a.count()
# 10000
df_b = df_b.limit(10000)
df_b.count()
# 10000
df_a.join(df_b,on="user_id",how="left_anti").count()
# 10000
never conflict between df_a and df_b anymore!
I have a spark dataframe (12m x 132) and I am trying to calculate the number of unique values by column, and remove columns that have only 1 unique value.
So far, I have used the pandas nunique function as such:
import pandas as pd
df = sql_dw.read_table(<table>)
df_p = df.toPandas()
nun = df_p.nunique(axis=0)
nundf = pd.DataFrame({'atr':nun.index, 'countU':nun.values})
dropped = []
for i, j in nundf.values:
if j == 1:
dropped.append(i)
df = df.drop(i)
print(dropped)
Is there a way to do this that is more native to spark - i.e. not using pandas?
Please have a look at the commented example below. The solution requires more python as pyspark specific knowledge.
import pyspark.sql.functions as F
#creating a dataframe
columns = ['asin' ,'ctx' ,'fo' ]
l = [('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO2')
,('ASIN1','CTX2','FO1')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO3')
,('ASIN1','CTX3','FO1')
,('ASIN1','CTX3','FO3')]
df=spark.createDataFrame(l, columns)
df.show()
#we create a list of functions we want to apply
#in this case countDistinct for each column
expr = [F.countDistinct(c).alias(c) for c in df.columns]
#we apply those functions
countdf = df.select(*expr)
#this df has just one row
countdf.show()
#we extract the columns which have just one value
cols2drop = [k for k,v in countdf.collect()[0].asDict().items() if v == 1]
df.drop(*cols2drop).show()
Output:
+-----+----+---+
| asin| ctx| fo|
+-----+----+---+
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO2|
|ASIN1|CTX2|FO1|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO3|
|ASIN1|CTX3|FO1|
|ASIN1|CTX3|FO3|
+-----+----+---+
+----+---+---+
|asin|ctx| fo|
+----+---+---+
| 1| 3| 3|
+----+---+---+
+----+---+
| ctx| fo|
+----+---+
|CTX1|FO1|
|CTX1|FO1|
|CTX1|FO2|
|CTX2|FO1|
|CTX2|FO2|
|CTX2|FO2|
|CTX2|FO3|
|CTX3|FO1|
|CTX3|FO3|
+----+---+
My apologies as I don't have the solution in pyspark but in pure spark, which may be transferable or used in case you can't find a pyspark way.
You can create a blank list and then using a foreach, check which columns have a distinct count of 1, then append them to the blank list.
From there you can use the list as a filter and drop those columns from your dataframe.
var list_of_columns: List[String] = ()
df_p.columns.foreach{c =>
if (df_p.select(c).distinct.count == 1)
list_of_columns ++= List(c)
df_p_new = df_p.drop(list_of_columns:_*)
you can group your df by that column and count distinct value of this column:
df = df.groupBy("column_name").agg(countDistinct("column_name").alias("distinct_count"))
And then filter your df by row which has more than 1 distinct_count:
df = df.filter(df.distinct_count > 1)
I there an easy way to call sql on multiple column on spark sql.
For example, Let's say I have a query that should be applied to most columns
select
min(c1) as min,
max(c1) as max,
max(c1) - min(c1) range
from table tb1
If there are multiple columns, is there a way to execute the query for all the columns, and get result one time.
Similar to how df.describe does.
Use the meta data (columns in this case) included in your dataframe (which you can get via spark.table("<table_name>") if you don't have it in scope already to get the column names, then apply the functions you want and pass to df.select (or df.selectExpr).
Build some test data:
scala> var seq = Seq[(Int, Int, Float)]()
seq: Seq[(Int, Int, Float)] = List()
scala> (1 to 1000).foreach(n => { seq = seq :+ (n,r.nextInt,r.nextFloat) })
scala> val df = seq.toDF("id", "some_int", "some_float")
Denote some functions we want to run on all the columns:
scala> val functions_to_apply = Seq("min", "max")
functions_to_apply: Seq[String] = List(min, max)
Setup the final Seq of SQL Columns:
scala> var select_columns = Seq[org.apache.spark.sql.Column]()
select_columns: Seq[org.apache.spark.sql.Column] = List()
Iterate over the columns and functions to apply to populate the select_columns Seq:
scala> val cols = df.columns
scala> cols.foreach(col => { functions_to_apply.foreach(f => {select_columns = select_columns :+ expr(s"$f($col)")})})
Run the actual query:
scala> df.select(select_columns:_*).show
+-------+-------+-------------+-------------+---------------+---------------+
|min(id)|max(id)|min(some_int)|max(some_int)|min(some_float)|max(some_float)|
+-------+-------+-------------+-------------+---------------+---------------+
| 1| 1000| -2143898568| 2147289642| 1.8781424E-4| 0.99964607|
+-------+-------+-------------+-------------+---------------+---------------+
Is there a way to reference Spark DataFrame columns by position using an integer?
Analogous Pandas DataFrame operation:
df.iloc[:0] # Give me all the rows at column position 0
The equivalent of Python df.iloc is collect
PySpark examples:
X = df.collect()[0]['age']
or
X = df.collect()[0][1] #row 0 col 1
Not really, but you can try something like this:
Python:
df = sc.parallelize([(1, "foo", 2.0)]).toDF()
df.select(*df.columns[:1]) # I assume [:1] is what you really want
## DataFrame[_1: bigint]
or
df.select(df.columns[1:3])
## DataFrame[_2: string, _3: double]
Scala
val df = sc.parallelize(Seq((1, "foo", 2.0))).toDF()
df.select(df.columns.slice(0, 1).map(col(_)): _*)
Note:
Spark SQL doesn't support and it is unlikely to ever support row indexing so it is not possible to index across row dimension.
You can use like this in spark-shell.
scala>: df.columns
Array[String] = Array(age, name)
scala>: df.select(df.columns(0)).show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
As of Spark 3.1.1 on Databricks, it's a matter of selecting the column of interest, and applying limit:
%python
retDF = (inputDF
.select(col(inputDF
.columns[0]))
.limit(100)
)
I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating:
(df.groupBy("group")
.agg({"money":"sum"})
.show(100)
)
This will give me:
group SUM(money#2L)
A 137461285853
B 172185566943
C 271179590646
The aggregation works just fine but I dislike the new column name SUM(money#2L). Is there a way to rename this column into something human readable from the .agg method? Maybe something more similar to what one would do in dplyr:
df %>% group_by(group) %>% summarise(sum_money = sum(money))
Although I still prefer dplyr syntax, this code snippet will do:
import pyspark.sql.functions as sf
(df.groupBy("group")
.agg(sf.sum('money').alias('money'))
.show(100))
It gets verbose.
withColumnRenamed should do the trick. Here is the link to the pyspark.sql API.
df.groupBy("group")\
.agg({"money":"sum"})\
.withColumnRenamed("SUM(money)", "money")
.show(100)
I made a little helper function for this that might help some people out.
import re
from functools import partial
def rename_cols(agg_df, ignore_first_n=1):
"""changes the default spark aggregate names `avg(colname)`
to something a bit more useful. Pass an aggregated dataframe
and the number of aggregation columns to ignore.
"""
delimiters = "(", ")"
split_pattern = '|'.join(map(re.escape, delimiters))
splitter = partial(re.split, split_pattern)
split_agg = lambda x: '_'.join(splitter(x))[0:-ignore_first_n]
renamed = map(split_agg, agg_df.columns[ignore_first_n:])
renamed = zip(agg_df.columns[ignore_first_n:], renamed)
for old, new in renamed:
agg_df = agg_df.withColumnRenamed(old, new)
return agg_df
An example:
gb = (df.selectExpr("id", "rank", "rate", "price", "clicks")
.groupby("id")
.agg({"rank": "mean",
"*": "count",
"rate": "mean",
"price": "mean",
"clicks": "mean",
})
)
>>> gb.columns
['id',
'avg(rate)',
'count(1)',
'avg(price)',
'avg(rank)',
'avg(clicks)']
>>> rename_cols(gb).columns
['id',
'avg_rate',
'count_1',
'avg_price',
'avg_rank',
'avg_clicks']
Doing at least a bit to save people from typing so much.
It's simple as:
val maxVideoLenPerItemDf = requiredItemsFiltered.groupBy("itemId").agg(max("playBackDuration").as("customVideoLength"))
maxVideoLenPerItemDf.show()
Use .as in agg to name the new row created.
.alias and .withColumnRenamed both work if you're willing to hard-code your column names. If you need a programmatic solution, e.g. friendlier names for an aggregation of all remaining columns, this provides a good starting point:
grouping_column = 'group'
cols = [F.sum(F.col(x)).alias(x) for x in df.columns if x != grouping_column]
(
df
.groupBy(grouping_column)
.agg(
*cols
)
)
df = df.groupby('Device_ID').agg(aggregate_methods)
for column in df.columns:
start_index = column.find('(')
end_index = column.find(')')
if (start_index and end_index):
df = df.withColumnRenamed(column, column[start_index+1:end_index])
The above code can strip out anything that is outside of the "()". For example, "sum(foo)" will be renamed as "foo".
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, "siva", 100), (2, "siva2", 200),(3, "siva3", 300),(4, "siva4", 400),(5, "siva5", 500)]
schema = ['id', 'name', 'sallary']
df = spark.createDataFrame(data, schema=schema)
df.show()
+---+-----+-------+
| id| name|sallary|
+---+-----+-------+
| 1| siva| 100|
| 2|siva2| 200|
| 3|siva3| 300|
| 4|siva4| 400|
| 5|siva5| 500|
+---+-----+-------+
**df.agg({"sallary": "max"}).withColumnRenamed('max(sallary)', 'max').show()**
+---+
|max|
+---+
|500|
+---+
While the previously given answers are good, I think they're lacking a neat way to deal with dictionary-usage in the .agg()
If you want to use a dict, which actually might be also dynamically generated because you have hundreds of columns, you can use the following without dealing with dozens of code-lines:
# Your dictionary-version of using the .agg()-function
# Note: The provided logic could actually also be applied to a non-dictionary approach
df = df.groupBy("group")\
.agg({
"money":"sum"
, "...": "..."
})
# Now do the renaming
newColumnNames = ["group", "money", "..."] # Provide the names for ALL columns of the new df
df = df.toDF(*newColumnNames) # Do the renaming
Of course the newColumnNames-list can also be dynamically generated. E.g., if you only append columns from the aggregation to your df you can pre-store newColumnNames = df.columns and then just append the additional names.
Anyhow, be aware that the newColumnNames must contain all column names of the dataframe, not only those to be renamed (because .toDF() creates a new dataframe due to Sparks immutable RDDs)!
Another quick little one liner to add the the mix:
df.groupBy('group')
.agg({'money':'sum',
'moreMoney':'sum',
'evenMoreMoney':'sum'
})
.select(*(col(i).alias(i.replace("(",'_').replace(')','')) for i in df.columns))
just change the alias function to whatever you'd like to name them. The above generates sum_money, sum_moreMoney, since I do like seeing the operator in the variable name.