I am using spark-sql-2.4.1v with Java 8. I have a scenario where I will be passed the columns names as list/Seq, for those columns only i need to do perform certain operations like sum, avg, percentages etc.
In my scenario, let's say I have column1, column2, column3 columns. First time I will pass column1 name.
Will pull/select "column1" data and perform some operation based on "column1". Second time I will pass column2 name, but earlier column1 not pulled this time so my dataset does not contain "column1" hence earlier conditions are breaking with error "AnalysisException: cannot resolve 'column1' given input columns".
Hence I need to check the columns, if some column exists then only perform that column related operations else ignore those operations.
How to do this in Spark?
Sample data which is in database.
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12 , 20),
("21", "score", "school", "2018-03-31", 13 , 13 , 21),
("22", "rate", "school", "2018-03-31", 11 , 14, 22),
("21", "rate", "school", "2018-03-31", 13 , 12, 23)
)
val df = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3")
.select("id", "code", "entity", "date", "column2") /// these are passed for each run....this set will keep changing.
Dataset<Row> enrichedDs = df
.withColumn("column1_org",col("column1"))
.withColumn("column1",
when(col("column1").isNotNull() , functions.callUDF("lookUpData",col("column1").cast(DataTypes.StringType)))
);
The above logic is only applicable when in select columns "column1" is available. This is failing in the second set as "column1" is not select, so I need some understanding why this only applicable when selected columns as "column1" is available. I need some logic to achieve this.
check if this is helpful-
you can filter out columns, and process only valid columns
df.show(false)
/**
* +---+-----+------+----------+-------+-------+-------+
* |id |code |entity|date |column1|column2|column3|
* +---+-----+------+----------+-------+-------+-------+
* |20 |score|school|2018-03-31|14 |12 |20 |
* |21 |score|school|2018-03-31|13 |13 |21 |
* |22 |rate |school|2018-03-31|11 |14 |22 |
* |21 |rate |school|2018-03-31|13 |12 |23 |
* +---+-----+------+----------+-------+-------+-------+
*/
// list of columns
val cols = Seq("column1", "column2" ,"column3", "column4")
val processColumns = cols.filter(df.columns.contains).map(sqrt)
df.select(processColumns: _*).show(false)
/**
* +------------------+------------------+-----------------+
* |SQRT(column1) |SQRT(column2) |SQRT(column3) |
* +------------------+------------------+-----------------+
* |3.7416573867739413|3.4641016151377544|4.47213595499958 |
* |3.605551275463989 |3.605551275463989 |4.58257569495584 |
* |3.3166247903554 |3.7416573867739413|4.69041575982343 |
* |3.605551275463989 |3.4641016151377544|4.795831523312719|
* +------------------+------------------+-----------------+
*/
Not sure if i fully understand your requirement, but are you simply trying to perform some conditional operation depending on what columns are available in your dataframe which is not know prior to execution?
if so, Dataframe.columns returns a list of columns which you can parse and select accordingly
i.e
df.columns.foreach { println }
Related
I have this sql query which is a left-join and has a select statement in the beginning which chooses from the right table columns as well..
Can you please help to convert it to a spark dataframes and get the result using spark-shell?
I don't want to use the sql code in spark instead I want to use dataframes.
I know the join syntax in scala, but I don't know how to choose from the right table (here it is count(w.id2)) when resulting df from left join doesn't have access to the right table's columns.
Thank you!
select count(x.user_id) user_id_count, count(w.id2) current_id2_count
from
(select
user_id
from
tb1
where
year='2021'
and month=1
) x
left join
(select id1, max(id2) id2 from tb2 group by id1) w
on
x.user_id=w.id1;
In spark I would create two dataframes x and w and join them:
var x = spark.sqlContext.table("tb1").where("year='2021' and month=1")
var w= spark.sqlContext.table("tb2").groupBy("id1").agg(max("id2")).alias("id2"
var joined = x.join(w, x("user_id")===w("id1"), "left")
EDIT :
I was confused about the left join. There was some error from the spark that column id2 is not available and I thought it is because the resulting df from left-join will have only left table's columns. However the reason was that when I was choosing max(id2) I had to give it an alias correctly.
Here is a sample and the solution:
var x = Seq("1","2","3","4").toDF("user_id")
var w = Seq (("1", 1), ("1",2), ("3",10),("1",5),("5",4)).toDF("id1", "id2")
var z= w.groupBy("id1").agg(max("id2").alias("id2"))
val xJoinsZ= x.join(z, x("user_id") === z("id1"), "left").select(count(col("user_id").alias("user_id_count")), count(col("id2").alias("current_id2_count")))
scala> x.show(false)
+-------+
|user_id|
+-------+
|1 |
|2 |
|3 |
|4 |
+-------+
scala> z.show(false)
+---+---+
|id1|id2|
+---+---+
|3 |10 |
|5 |4 |
|1 |5 |
+---+---+
scala> xJoinsZ.show(false)
+---------------------------------+---------------------------------+
|count(user_id AS `user_id_count`)|count(id2 AS `current_id2_count`)|
+---------------------------------+---------------------------------+
|4 |2 |
+---------------------------------+---------------------------------+
Your request is quite difficult to understand, however I am gonna try to reply taking the SQL code you provided as baseline and reproduce it with Spark.
// Reading tb1 (x) and filtering for Jan 2021, selecting only "user_id"
val x: DataFrame = spark.read
.table("tb1")
.filter(col("year") === "2021")
.filter(col("mont") === "01")
.select("user_id")
// Reading tb2 (w) and for each "id1" getting the max "id2"
val w: DataFrame = spark.read
.table("tb2")
.groupBy(col("id1"))
.max("id2")
// Joining tb1 (x) and tb2 (w) on "user_id" === "id1", then counting user_id and id2
val xJoinsW: DataFrame = x
.join(w, x("user_id") === w("id1"), "left")
.select(count(col("user_id").as("user_id_count")), count(col("max(id2)").as("current_id2_count")))
A small but relevant remark, as you're using Scala and Spark, I would suggest you to use val and not var. val means it's final, cannot be reassigned, whereas, var can be reassigned later. You can read more here.
Lastly, feel free to change the Spark reading mechanism with whatever you like.
Consider the following BigQuery tables schemas in my dataset my_dataset:
Table_0001: NAME (string); NUMBER (string)
Table_0002: NAME(string); NUMBER (string)
Table_0003: NAME(string); NUMBER (string)
...
Table_0865: NAME (string); CODE (string)
Table_0866: NAME(string); CODE (string)
...
I now want to union all tables using :
select * from `my_dataset.*`
However this will not yield the CODE column of the second set of table. From my understanding, the schema of the first table in the dataset will be adopted instead.
So the result with be something like:
| NAME | NUMBER |
__________________
| John | 123456 |
| Mary | 123478 |
| ... | ...... |
| Abdul | null |
| Ariel | null |
I tried to tap into the INFORMATION_SCHEMA so as to select the two sets of tables separately and then union them:
with t_code as (
select
table_name,
from my_dataset.INFORMATION_SCHEMA.COLUMNS
where column_name = 'CODE'
),
select t.NAME, t.CODE as NUMBER from `my_dataset.*` as t
where _TABLE_SUFFIX in (select * from t_code)
However, still the script will look to the first table of my_dataset for its schema and will return: Error Running Query: Name CODE not found inside t.
So now I'm at a loss: How can I union all my tables without having to union them one by one? ie. how to select CODE as NUMBER in the second set of tables.
Note: Although it seems the question was asked over here, the accepted answer did not seem to actually respond to the question (as far as I'm concerned).
The trick I see you can do is to first gather all codes by running
create table `my_another_dataset.codes` as
select * from `my_dataset.*` where not code is null
Then to do any simple fake update of any just one table with number column - this will make schema with number column default. so now you can gather all numbers
create table `my_another_dataset.numbers` as
select * from `my_dataset.*` where not number is null
Finally then you can do simple union
select * from `my_another_dataset.numbers` union all
select * from `my_another_dataset.codes`
Note: see also my comment below your question
SELECT
borrow.id AS `borrowId`,
IF(borrow.created_date IS NULL, '', borrow.created_date) AS `borrowCreatedDate`,
IF(borrow.return_date IS NULL, '', borrow.return_date) AS `borrowReturnDate`,
IF(borrow.return_date IS NULL, '0', '1') AS `borrowIsReturn`,
IF(person.card_identity IS NULL, '', person.card_identity) AS `personCardIdentity`,
IF(person.fullname IS NULL, '', person.fullname) AS `personFullname`,
IF(person.phone_number IS NULL, '', person.phone_number) AS `personPhoneNumber`,
IF(book.book_name IS NULL, '', book.book_name) AS `bookName`,
IF(book.year IS NULL, '', book.year) AS `bookYear`
FROM tbl_tbl_borrow AS borrow
LEFT JOIN tbl_person AS person
ON person.card_identity = borrow.person_card_identity
LEFT JOIN tbl_book AS book
ON book.unique_id = borrow.book_unique_id
ORDER BY
borrow.return_date ASC, person.fullname ASC;
I'm currently looking to get a table that gets counts for null, not null, distinct values, and all rows for all columns in a given table. This happens to be in Databricks (Apache Spark).
Something that looks like what is shown below.
I know I can do this with something like the SQL shown below. Also, I can use something like Java or Python, etc., to generate the SQL.
The question is:
Is this the most efficient approach?
Is there a better way to write this query (less typing and/or more efficient)?
select
1 col_position,
'location_id' col_name,
count(*) all_records,
count(location_id) not_null,
count(*) - count(location_id) null,
count(distinct location_id) distinct_values
from
admin
union
select
2 col_position,
'location_zip' col_name,
count(*) all_records,
count(location_zip) not_null,
count(*) - count(location_zip) null,
count(distinct location_zip) distinct_values
from
admin
union
select
3 col_position,
'provider_npi' col_name,
count(*) all_records,
count(provider_npi) not_null,
count(*) - count(provider_npi) null,
count(distinct provider_npi) distinct_values
from
admin
order by col_position
;
As said in the comments, using UNION ALL should be efficient.
Using SQL
To avoid taping all the columns and sub queries, you can generate the SQL query from the list of columns like this:
val df = spark.sql("select * from admin")
// generate the same query from the columns list
val sqlQuery =
df.columns.zipWithIndex.map { case (c, i) =>
Seq(
s"$i col_position",
s"$c col_name",
"count(*) all_records",
s"count($c) not_null",
s"count(*) - count($c) null",
s"count(distinct $c) distinct_values"
).mkString("select ", ", ", " from admin")
}.mkString("", " union all\n", "order by col_position")
spark.sql(sqlQuery).show
Using DataFrame (Scala)
There are some optimizations you can do by using DataFrame, like calculate count(*) one time, avoid typing all the column names, and the possibility to use caching.
Example input DataFrame :
//+---+---------+--------+---------------------+------+
//|id |firstName|lastName|email |salary|
//+---+---------+--------+---------------------+------+
//|1 |michael |armbrust|no-reply#berkeley.edu|100K |
//|2 |xiangrui |meng |no-reply#stanford.edu|120K |
//|3 |matei |null |no-reply#waterloo.edu|140K |
//|4 |null |wendell |null |160K |
//|5 |michael |jackson |no-reply#neverla.nd |null |
//+---+---------+--------+---------------------+------+
First, get count and column list :
val cols = df.columns
val allRecords = df.count
Then, calculate each metric by looping through the columns list (you can create a function for each metric for example) :
val nullsCountDF = df.select(
(Seq(expr("'nulls' as metric")) ++ cols.map(c =>
sum(when(col(c).isNull, lit(1)).otherwise(lit(0))).as(c)
)): _*
)
val distinctCountDF = df.select(
(Seq(expr("'distinct_values' as metric")) ++ cols.map(c =>
countDistinct(c).as(c)
)): _*
)
val maxDF = df.select(
(Seq(expr("'max_value' as metric")) ++ cols.map(c => max(c).as(c))): _*
)
val minDF = df.select(
(Seq(expr("'min_value' as metric")) ++ cols.map(c => min(c).as(c))): _*
)
val allRecordsDF = spark.sql("select 'all_records' as metric," + cols.map(c => s"$allRecords as $c").mkString(","))
Finally, union the data frames created above:
val metricsDF = Seq(allRecordsDF, nullsCountDF, distinctCountDF, maxDF, minDF).reduce(_ union _)
metricsDF.show
//+---------------+---+---------+--------+---------------------+------+
//|metric |id |firstName|lastName|email |salary|
//+---------------+---+---------+--------+---------------------+------+
//|all_records |5 |5 |5 |5 |5 |
//|nulls |0 |1 |1 |1 |1 |
//|distinct_values|5 |3 |4 |4 |4 |
//|max_value |5 |xiangrui |wendell |no-reply#waterloo.edu|160K |
//|min_value |1 |matei |armbrust|no-reply#berkeley.edu|100K |
//+---------------+---+---------+--------+---------------------+------+
Using DataFrame (Python)
For Python example, you can see my other answer.
use count (ifnull(field,0)) total_count. This will count all non-null rows.
The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.
For example, this statement will work.
Select store, prod, distance from table
+-----+------+--------+
|store|prod |distance|
+-----+------+--------+
|51 |42 |2 |
|51 |42 |5 |
|89 |44 |9 |
If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')
+-----+------+
|store|prod |
+-----+------+
|51 |42 |
|51 |42 |
|89 |44 |
You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect
val columnNames = Seq("c1","c2","c3","c4")
df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()
You can make use of columns method on Dataframe. This would look like that:
val result = if(df.columns.contains("distance")) df.select("store", "prod", "distance")
else df.select("store", "prod")
Edit:
Having many such columns, you can keep them in array, for example cols and filter it:
val selectedCols = cols.filter(col -> df.columns.contains("distance")).map(col)
val result = df.select(selectedCols:_*)
Assuming you use the expanded SQL template, like select a,b,c from tab, you could do something like below to get the required results.
Get the sql string and convert it to lowercase.
Split the sql on space or comma to get the individual words in an array
Remove "select" and "from" from the above array as they are SQL keywords.
Now your last index is the table name
First to last index but one contains the list of select columns.
To get the required columns, just filter it against df2.columns. The columns that are there in SQL but not in table will be filtered out
Now construct the sql using the individual pieces.
Run it using spark.sql(reqd_sel_string) to get the results.
Check this out
scala> val df2 = Seq((51,42),(51,42),(89,44)).toDF("store","prod")
df2: org.apache.spark.sql.DataFrame = [store: int, prod: int]
scala> df2.createOrReplaceTempView("tab2")
scala> val sel_query="Select store, prod, distance from tab2".toLowerCase
sel_query: String = select store, prod, distance from tab2
scala> val tabl_parse = sel_query.split("[ ,]+").filter(_!="select").filter(_!="from")
tabl_parse: Array[String] = Array(store, prod, distance, tab2)
scala> val tab_name=tabl_parse(tabl_parse.size-1)
tab_name: String = tab2
scala> val tab_cols = (0 until tabl_parse.size-1).map(tabl_parse(_))
tab_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod, distance)
scala> val reqd_cols = tab_cols.filter( x=>df2.columns.contains(x))
reqd_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod)
scala> val reqd_sel_string = "select " + reqd_cols.mkString(",") + " from " + tab_name
reqd_sel_string: String = select store,prod from tab2
scala> spark.sql(reqd_sel_string).show(false)
+-----+----+
|store|prod|
+-----+----+
|51 |42 |
|51 |42 |
|89 |44 |
+-----+----+
scala>
I have a Table in MS Access like this:
The Columns are:
--------------------------------------
| *Date* | *Article* | *Distance* | Value |
---------------------------------------
Date, Article and Distance are Primary Keys, so the combination of them is always unique.
The column Distance has discrete values from 0 to 27.
I need to transform this table into a table like this:
----------
| *Date* | *Article* | Value from Distance 0| Value Dis. 1|...|Value Dis. 27|
----------
I really don't know a SQL Statement for this task. I needed a really fast solution which is why I wrote an Excel macro which worked fine but was very inefficient and needed several hours to complete. Now that the amount of data is 10 times higher, I can't use this macro anymore.
You can try the following pivot query:
SELECT
Date,
Article,
MAX(IIF(Distance = 0, Value, NULL)) AS val_0,
MAX(IIF(Distance = 1, Value, NULL)) AS val_1,
...
MAX(IIF(Distance = 27, Value, NULL)) AS val_27
FROM yourTable
GROUP BY
Date,
Article
Note that Access does not support CASE expressions, but it does offer a function called IIF() which takes the form of:
IIF(condition, value if true, value if false)
which essentially behaves the same way as CASE in other RDBMS.