I've read a parquet file using the df_pqt_tbl <- spark_read_parquet(...) function.
My environment is Databricks.
I'd like to use the SparkR::sampleBy function to do a stratified sampling, but I'm getting an error:
class(df_pqt_tbl)
df_train <- SparkR::sampleBy(df_pqt_tbl, col = 'labels',
fractions = list('0'=0.7, '1'=0.7),
seed = 12345)
Error in (function (classes, fdef, mtable) : unable to find an
inherited method for function ‘sampleBy’ for signature ‘"tbl_spark",
"character", "list", "numeric"’
Is there a way to transform a tbl_spark in a spark dataframe so that I can use the sampleBy function on it?
Yes, there is! SparkSQL tables are the common interface between SparkR and sparklyr. The error is telling you that you have a type mismatch, but if you work with SparkSQL tables you can find common ground.
In your case, use the sql() function from SparkR and select all from your sparklyr table.
## Read parquet using sparklyr
df_pqt_tbl <- spark_read_parquet(sc, "path", name = "sparklyr_tbl")
## Use SparkSQL to access in SparkR
pqt_tblDF <- sparkR::sql("select * from sparklyr_tbl")
## Now use functions from SparkR
df_train <- sparkR::sampleBy(df_pqt_tbl, col = 'labels',
fractions = list('0'=0.7, '1'=0.7),
seed = 12345)
To go the other way you would have create a temporary view from SparkR and then read the view into a sparklyr object.
sparkR::createOrReplaceTempView(sparkRDF, "sparkSQL_table")
then use
sparklyr_tbl <- sparklyr::spark_read_table(sc, "sparkSQL_table")
Then you can apply functions from sparklyr to your sparklyr_tbl object.
Related
I am trying to port some code from Pandas to Koalas to take advantage of Spark's distributed processing. I am taking a dataframe and grouping it on A and B and then applying a series of functions to populate the columns of the new dataframe. Here is the code that I was using in Pandas:
new = old.groupby(['A', 'B']) \
.apply(lambda x: pd.Series({
'v1': x['v1'].sum(),
'v2': x['v2'].sum(),
'v3': (x['v1'].sum() / x['v2'].sum()),
'v4': x['v4'].min()
})
)
I believe that it is working well and the resulting dataframe appears to be correct value-wise.
I just have a few questions:
Does this error mean that my method will be deprecated in the future?
/databricks/spark/python/pyspark/sql/pandas/group_ops.py:76: UserWarning: It is preferred to use 'applyInPandas' over this API. This API will be deprecated in the future releases. See SPARK-28264 for more details.
How can I rename the group-by columns to 'A' and 'B' instead of "__groupkey_0__ __groupkey_1__"?
As you noticed, I had to call pd.Series -- is there a way to do this in Koalas? Calling ks.Series gives me the following error that I am unsure how to implement:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Thanks for any help that you can provide!
I'm not sure about the error. I am using koalas==1.2.0 and pandas==1.0.5 and I don't get the error so I wouldn't worry about it
The groupby columns are already called A and B when I run the code. This again may have been a bug which has since been patched.
For this you have 3 options:
Keep utilising pd.Series. As long as your original Dataframe is a koalas Dataframe, your output will also be a koalas Dataframe (with the pd.Series automatically converted to ks.Series)
Keep the function and the data exactly the same and just convert the final dataframe to koalas using the from_pandas function
Do the whole thing in koalas. This is slightly more tricky because you are computing an aggregate column based on two GroupBy columns and koalas doesn't support lambda functions as a valid aggregation. One way we can get around this is by computing the other aggregations together and adding the multi-column aggregation afterwards:
import databricks.koalas as ks
ks.set_option('compute.ops_on_diff_frames', True)
# Dummy data
old = ks.DataFrame({"A":[1,2,3,1,2,3], "B":[1,2,3,3,2,3], "v1":[10,20,30,40,50,60], "v2":[4,5,6,7,8,9], "v4":[0,0,1,1,2,2]})
new = old.groupby(['A', 'B']).agg({'v1':'sum', 'v2':'sum', 'v4': 'min'})
new['v3'] = old.groupby(['A', 'B']).apply(lambda x: x['v1'].sum() / x['v2'].sum())
I have previously written a Dataset[T] to a csv file.
In this case T is a case class that contains field x: Option[BigDecimal]
When I attempt to load the file back into a Dataset[T] I see the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `x` from double to decimal(38,18) as it may truncate.
I guess the reason is that the inferred schema contains a double rather than BigDecimal column. Is there a way around this issue? I wish to avoid casting based on column name because the read code is part of a generic function. My read code is below:
val a = spark
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(file)
.as[T]
My case classes reflect tables read from JDBC with Option[T] used to represent a nullable field. Option[BigDecimal] is used to receive a Decimal field from JDBC.
I have pimped on some code to read/write from/to csv files when reading/writing on my local machine so I can easily inspect the contents.
So my next attempt was this:
var df = spark
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.schema(implicitly[Encoder[T]].schema)
.load(file)
val schema = df.schema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
schema.foreach{ field =>
field.dataType match {
case t: DoubleType =>
df = df.withColumn(field.name,
col(field.name).cast(DecimalType(38,18)))
case _ => // do nothing
}
}
df.as[T]
Unfortunately my case class now contains all Nones rather than the values expected. If I just load the csv as a DF with inferred types all of the column values are correctly populated.
It looks like I actually have two issues.
Conversion from Double -> BigDecimal.
Nullable fields are not being wrapped in Options.
Any help/advice would be gratefully received. Happy to adjust my approach if easily writing/reading Options/BigDecimals from csv files is problematic.
First I would fill null values with dfB.na.fill(0.0) then I would try the next solution:
case class MyCaseClass(id: String, cost: Option[BigDecimal])
var dfB = spark.createDataset(Seq(
("a", Option(12.45)),
("b", Option(null.asInstanceOf[Double])),
("c", Option(123.33)),
("d", Option(1.3444))
)).toDF("id", "cost")
dfB
.na.fill(0.0)
.withColumn("cost", col("cost").cast(DecimalType(38,18)))
.as[MyCaseClass]
.show()
First cast the column cost into DecimalType(38,18) explicitly then retrieve the dataset[MyCaseClass]. I believe the issue here was that the spark can't convert double to BigDecimal without specifying scale-precision explicitly therefore you need first to convert it into a specific decimal type and then use it as BigDecimal.
UPDATE:
I slightly modified the previous code to make possible to handle members of type Option[BigDecimal] as well
Good luck
Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.
Will collect() behave the same way if called on a dataframe?
What about the select() method?
Does it also work the same way as collect() if called on a dataframe?
Actions vs Transformations
Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or
other operation that returns a sufficiently small subset of the data.
spark-sql doc
select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.
Parameters: cols – list of column names (string) or expressions
(Column). If one of the column names is ‘*’, that column is expanded
to include all columns in the current DataFrame.**
df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
Execution select(column-name1,column-name2,etc) method on a dataframe, returns a new dataframe which holds only the columns which were selected in the select() function.
e.g. assuming df has several columns including "name" and "value" and some others.
df2 = df.select("name","value")
df2 will hold only two columns ("name" and "value") out of the entire columns of df
df2 as the result of select will be in the executors and not in the driver (as in the case of using collect())
sql-programming-guide
df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
# Select only the "name" column
df.select("name").show()
# +-------+
# | name|
# +-------+
# |Michael|
# | Andy|
# | Justin|
# +-------+
You can running collect() on a dataframe (spark docs)
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]
spark docs
To print all elements on the driver, one can use the collect() method
to first bring the RDD to the driver node thus:
rdd.collect().foreach(println). This can cause the driver to run out
of memory, though, because collect() fetches the entire RDD to a
single machine; if you only need to print a few elements of the RDD, a
safer approach is to use the take(): rdd.take(100).foreach(println).
calling select will result is lazy evaluation: for example:
val df1 = df.select("col1")
val df2 = df1.filter("col1 == 3")
both above statements create lazy path that will be executed when you call action on that df, such as show, collect etc.
val df3 = df2.collect()
use .explain at the end of your transformation to follow its plan
here is more detailed info Transformations and Actions
Select is used for projecting some or all fields of a dataframe. It won't give you an value as an output but a new dataframe. Its a transformation.
To answer the questions directly:
Will collect() behave the same way if called on a dataframe?
Yes, spark.DataFrame.collect is functionally the same as spark.RDD.collect. They serve the same purpose on these different objects.
What about the select() method?
There is no such thing as spark.RDD.select, so it cannot be the same as spark.DataFrame.select.
Does it also work the same way as collect() if called on a dataframe?
The only thing that is similar between select and collect is that they are both functions on a DataFrame. They have absolutely zero overlap in functionality.
Here's my own description: collect is the opposite of sc.parallelize. select is the same as the SELECT in any SQL statement.
If you are still having trouble understanding what collect actually does (for either RDD or DataFrame), then you need to look up some articles about what spark is doing behind the scenes. e.g.:
https://dzone.com/articles/how-spark-internally-executes-a-program
https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
Select is a transformation, not an action, so it is lazily evaluated (won't actually do the calculations just map the operations). Collect is an action.
Try:
df.limit(20).collect()
Short answer in bolds:
collect is mainly to serialize (loss of parallelism preserving all other data characteristics of the dataframe)For example with a PrintWriter pw you can't do direct df.foreach( r => pw.write(r) ), must to use collect before foreach, df.collect.foreach(etc). PS: the "loss of parallelism" is not a "total loss" because after serialization it can be distributed again to executors.
select is mainly to select columns, similar to projection in relational algebra (only similar in framework's context because Spark select not deduplicate data).So, it is also a complement of filter in the framework's context.
Commenting explanations of other answers: I like the Jeff's classification of Spark operations in transformations (as select) and actions (as collect). It is also good remember that transforms (including select) are lazily evaluated.
I am trying to use DSL over pure SQL in Spark SQL jobs but I cannot get my UDF works.
sqlContext.udf.register("subdate",(dateTime: Long)=>dateTime.toString.dropRight(6))
This doesn't work
rdd1.toDF.join(rdd2.toDF).where("subdate(rdd1(date_time)) === subdate(rdd2(dateTime))")
I also would like to add another join condition like in this working pure SQL
val results=sqlContext.sql("select * from rdd1 join rdd2 on rdd1.id=rdd2.idand subdate(rdd1.date_time)=subdate(rdd2.dateTime)")
Thanks for your help
SQL expression you pass to where method is incorrect at least for a few reasons:
=== is a Column method not a valid SQL equality. You should use single equality sign =
bracket notation (table(column)) is not a valid way to reference columns in SQL. In this context it will be recognized as a function call. SQL uses dot notation (table.column)
even if it was neither rdd1 nor rdd2 are valid table aliases
Since it looks like column names are unambiguous you could simply use following code:
df1.join(df2).where("subdate(date_time) = subdate(dateTime)")
If it wasn't the case using dot syntax wouldn't work without providing aliases first. See for example Usage of spark DataFrame "as" method
Moreover registering UDFs makes sense mostly when you use raw SQL all the way. If you want to use DataFrame API it is better to use UDF directly:
import org.apache.spark.sql.functions.udf
val subdate = udf((dateTime: Long) => dateTime.toString.dropRight(6))
val df1 = rdd1.toDF
val df2 = rdd2.toDF
df1.join(df2, subdate($"date_time") === subdate($"dateTime"))
or if column names were ambiguous:
df1.join(df2, subdate(df1("date_time")) === subdate(df2("date_time")))
Finally for simple functions like this it is better to compose built-in expressions than create UDFs.
In dplyr running on R data frames, it is easy to run
df <- df %>%
mutate(income_topcoded = ifelse(income > topcode, income, topcode)
I'm now working with a large SQL database, using dplyr to send commands to the SQL server. When I run the same command, I get back
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR:
function ifelse (boolean, numeric, numeric) does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
How would you suggest implementing ifelse() statements? I'd be fine with something in PivotalR (which seems to support ifelse(), but I don't know how to integrate it with dplyr and couldn't find any examples on SO), some piece of SQL syntax which I can use in-line here, or some feature of dplyr which I was unaware of.
(I have the same problem that I'd like to use grepl() as an in-db operation, but I don't know how to do so.)
Based on #hadley's reply on this thread, you can use an SQL-style if() statement inside mutate() on dplyr's in-db dataframes:
df <- df %>%
mutate( income_topcoded = if (income > topcode) income else topcode)
As far as using grepl() goes...well, you can't. But you can use the SQL like operator:
df <- df %>%
filter( topcode %like% "ABC%" )
I had a similar problem. The best I could do was to use an in-db operation as you suggest:
topcode <- 10000
queryString <- sprintf("UPDATE db.table SET income_topcoded = %s WHERE income_topcoded > %s",topcode,topcode)
dbGetQuery(con, queryString)
In my case, I was using MySQL with dplyr, but it wasn't able to translate my ifelse() into valid SQL.