How to write a Spark Dataframe into multiple JDBC table based on a column - dataframe

I'm working with a batch Spark pipeline written in Scala (v2.4). I would like to save a dataframe into a Postgresql database. However, instead of saving all rows into a single table in the database, I want to save them to multiple tables based on the value of a column.
Suppose the dataframe has a column named country, I want to write records into the respective country table, e.g.
df.show()
+-------+----+
|country|val1|
+-------+----+
| CN | 1.0|
| US | 2.5|
| CN | 3.0|
+-------+----+
Then I would like to save records ( (CN,1.0) and (CN,3.0)) into table app_CN and record (US,2.5) into table app_US. Assume that the tables already exist.
Can I use dataframe API to achieve this? Or should I repartition into RDD and provides a JDBC-like object into executors and manually saved them?

Related

How to find the uncommon rows between two Pyspark DataFrames? [duplicate]

I have to compare two dataframes to find out the columns differences based on one or more key fields using pyspark in a most performance efficient approach since I have to deal with huge dataframes
I have already built a solution for comparing two dataframes using hash match without key field matching like data_compare.df_subtract(self.df_db1_hash,self.df_db2_hash)
but scenario is different if I want to use key field match
Note: I have provided sample expected dataframe. Actual requirement is any differences from DataFrame 2 in any columns should be retrieved in output/expected dataframe.
DataFrame 1:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
DataFrame 2:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sandiego| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Expected dataframe after comparing dataframe 1 and 2
+------+---------+--------+----------+
|emp_id| emp_city|emp_name| emp_phone|
+------+---------+--------+----------+
| 4| sandiego| romino|9848022331|
+------+---------+--------+----------+
subract function is what you are looking for, which will check all the columns value for each row and gives you a dataframe which are different from the other dataframe.
df2.subtract(df1).select("emp_id","emp_city","emp_name","emp_phone")
As the api document says
Return a new :class:DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.

Equivalent of `takeWhile` for Spark dataframe

I have a dataframe looking like this:
scala> val df = Seq((1,.5), (2,.3), (3,.9), (4,.0), (5,.6), (6,.0)).toDF("id", "x")
scala> df.show()
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
| 4|0.0|
| 5|0.6|
| 6|0.0|
+---+---+
I would like to take the first rows of the data as long as the x column is nonzero (note that the dataframe is sorted by id so talking about the first rows is relevant). For this given dataframe, it would give something like that:
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
+---+---+
I only kept the 3 first rows, as the 4th row was zero.
For a simple Seq, I can do something like Seq(0.5, 0.3, 0.9, 0.0, 0.6, 0.0).takeWhile(_ != 0.0). So for my dataframe I thought of something like this:
df.takeWhile('x =!= 0.0)
But unfortunately, the takeWhile method is not available for dataframes.
I know that I can transform my dataframe to a Seq to solve my problem, but I would like to avoid gathering all the data to the driver as it will likely crash it.
The take and the limit methods allow to get the n first rows of a dataframe, but I can't specify a predicate. Is there a simple way to do this?
Can you guarantee that ID's will be in ascending order? New data is not necessarily guaranteed to be added in a specific order. If you can guarantee the order then you can use this query to achieve what you want. It's not going to perform well on large data sets, but it may be the only way to achieve what you are interested in.
We'll mark all 0's as '1' and everything else as '0'. We'll then do a rolling total over the entire data awr. As the numbers only increase in value on a zero it will partition the dataset into sections with number between zero's.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy().orderBy("id")
df.select(
col("id"),
col("x"),
sum( // creates a running total which will be 0 for the first partition --> All numbers before the first 0
when( col("x") === lit(0), lit(1) ).otherwise(lit(0)) // mark 0's to help partition the data set.
).over(windowSpec).as("partition")
).where(col("partition") === lit(0) )
.show()
---+---+---------+
| id| x|partition|
+---+---+---------+
| 1|0.5| 0|
| 2|0.3| 0|
| 3|0.9| 0|
+---+---+---------+

Shift pyspark column value to left by one

I have a pyspark dataframe that looks like this:
|name|age|height |weight
+-------------+--------------------+------------------------+------------------------+-------------------------+--------------------+------------------+------------------+------------+
| |Mike |20|6-7|
As you can see the values and the column names are not aligned. For example, "Mike" should be under the column of "name", instead of age.
How can I shift the values to left by one so it can match the column name?
The ideal dataframe looks like:
|name|age|height |weight
+-------------+--------------------+------------------------+------------------------+-------------------------+--------------------+------------------+------------------+------------+
| Mike |20 |6-0|160|
Please note that the above data is just an example. In reality I have more than 200 columns and more than 1M rows of data.
Try with .toDF with new column names by dropping name column from the dataframe.
Example:
df=spark.createDataFrame([('','Mike',20,'6-7',160)],['name','age','height','weight'])
df.show()
#+----+----+------+------+---+
#|name| age|height|weight| _5|
#+----+----+------+------+---+
#| |Mike| 20| 6-7|160|
#+----+----+------+------+---+
#select all columns except name
df1=df.select(*[i for i in df.columns if i != 'name'])
drop_col=df.columns.pop()
req_cols=[i for i in df.columns if i != drop_col]
df1.toDF(*req_cols).show()
#+----+---+------+------+
#|name|age|height|weight|
#+----+---+------+------+
#|Mike| 20| 6-7| 160|
#+----+---+------+------+
Using spark.createDataFrame():
cols=['name','age','height','weight']
spark.createDataFrame(df.select(*[i for i in df.columns if i != 'name']).rdd,cols).show()
#+----+---+------+------+
#|name|age|height|weight|
#+----+---+------+------+
#|Mike| 20| 6-7| 160|
#+----+---+------+------+
If you are creating dataframe while reading a file then define schema having first column name as dummy then once you read the data drop the column using .drop() function.
spark.read.schema(<struct_type schema>).csv(<path>).drop('<dummy_column_name>')
spark.read.option("header","true").csv(<path>).toDF(<columns_list_with dummy_column>).drop('<dummy_column_name>')

Adding column from dataframe(df1) to another dataframe (df2)

I need some help with this Apache Spark (pyspark) issue.
I've a dataFrame (df1) which has a single column & a single row, it contains max_timestamp
+------------------+
|max_timestamp |
+-------------------+
|2019-10-24 21:18:26|
+-------------------+
I've another DataFrame, which contains 2 Columns - EmpId & Timestamp
masterData = [(1, '1999-10-24 21:18:23',), (1, '2019-10-24 21:18:26',), (2, '2020-01-24 21:18:26',)]
df_masterdata = spark.createDataFrame(masterData, ['dsid', 'txnTime_str'])
df_masterdata = df_masterdata.withColumn('txnTime_ts', col('txnTime_str').cast(TimestampType())).drop('txnTime_str')
df_masterdata.show(5, False)
+----+-------------------+
|dsid|txnTime_ts |
+----+-------------------+
|1 |1999-10-24 21:18:23|
|1 |2019-10-24 21:18:26|
|2 |2020-01-24 21:18:26|
+----+-------------------+
Object is to filter the records in the 2nd Dataframe, based on condition txnTime_ts < max_timestamp
What i'm trying to do -> add the column 'max_timestamp' to the 2nd DataFrame, and filter records by comparing the 2 values.
df_masterdata1 = df_masterdata.withColumn('maxTime', maxTS2['TEMP_MAX'])
Pyspark does not let me add the column from maxTS2 to the dataFrame - df_masterdata
Error -
AnalysisException: 'Resolved attribute(s) TEMP_MAX#207255 missing from dsid#207263L,txnTime_ts#207267 in operator
!Project [dsid#207263L, txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280].;;\n!Project [dsid#207263L,
txnTime_ts#207267, TEMP_MAX#207255 AS maxTime#207280]\n+- Project [dsid#207263L, txnTime_ts#207267]\n +- Project
[dsid#207263L, txnTime_str#207264, cast(txnTime_str#207264 as timestamp) AS txnTime_ts#207267]\n +- LogicalRDD
[dsid#207263L, txnTime_str#207264], false\n'
Any ideas on how to resolve this issue?
If you actually have a DF with a single row/column, the most efficient way to accomplish this would be to extract the value from the dataframe and then filter df_masterdata against it. If you nevertheless need to do this within the context of a dataframe, you should us join , e.g.:
df_masterdata1 = df_masterdata.join(df1, df_masterdata.txnTime_ts <= df1.max_timestamp)

Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?
import sqlContext.implicits._
preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)
preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
val applicationId = x.getAs[String](ApplicationId)
val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
// DO SOME TASK PER applicationId
})
preProcessedData.unpersist()
Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.
For example:
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")
// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct
// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)
// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))
In Pyspark try this,
df.select('col_name').distinct().show()
This solution demonstrates how to transform data with Spark native functions which are better than UDFs. It also demonstrates how dropDuplicates which is more suitable than distinct for certain queries.
Suppose you have this DataFrame:
+-------+-------------+
|country| continent|
+-------+-------------+
| china| asia|
| brazil|south america|
| france| europe|
| china| asia|
+-------+-------------+
Here's how to take all the distinct countries and run a transformation:
df
.select("country")
.distinct
.withColumn("country", concat(col("country"), lit(" is fun!")))
.show()
+--------------+
| country|
+--------------+
|brazil is fun!|
|france is fun!|
| china is fun!|
+--------------+
You can use dropDuplicates instead of distinct if you don't want to lose the continent information:
df
.dropDuplicates("country")
.withColumn("description", concat(col("country"), lit(" is a country in "), col("continent")))
.show(false)
+-------+-------------+------------------------------------+
|country|continent |description |
+-------+-------------+------------------------------------+
|brazil |south america|brazil is a country in south america|
|france |europe |france is a country in europe |
|china |asia |china is a country in asia |
+-------+-------------+------------------------------------+
See here for more information about filtering DataFrames and here for more information on dropping duplicates.
Ultimately, you'll want to wrap your transformation logic in custom transformations that can be chained with the Dataset#transform method.
df = df.select("column1", "column2",....,..,"column N").distinct.[].collect()
in the empty list, you can insert values like [ to_JSON()] if you want the df in a JSON format.