Spark dataframe groupby unique values in a column - apache-spark-sql

I have the following dataframe:
val df = Seq(
("A", 2.0),
("A", 1.5),
("B", 8.0),
("B", 9.0)
).toDF("id", "val")
I would like to do a group and for each group based on the unique ID
1. have a running count, ie first one is 0, second is 1
2. a total count inside the group.
should look like
+---+---+---+---+
| id|val|order|count|
+---+---+---+---+
| A|2.0|0 |2
| A|1.5|1 |2
| A|2.5|2 |2
| B|8.0|0|2
| B|9.0|1 |2
+---+---+
I don't see how to do this with the spark sql or functions.

Here is one way.
Input Data:
+---+---+
|id |val|
+---+---+
|A |2.0|
|A |1.5|
|A |4.5|
|A |0.5|
|B |8.0|
|B |9.0|
+---+---+
Use row_number function to get the order count.
val w1 = Window.partitionBy("id").orderBy(lit(null))
df.withColumn("rank", row_number().over(w1))
.withColumn("order", 'rank - 1)
.withColumn("count",count('order).over(w1))
.drop('rank)
.orderBy('id)
.show(false)
Both give the same result:
+---+---+-----+-----+
|id |val|order|count|
+---+---+-----+-----+
|A |2.0|0 |4 |
|A |1.5|1 |4 |
|A |4.5|2 |4 |
|A |0.5|3 |4 |
|B |8.0|0 |2 |
|B |9.0|1 |2 |
+---+---+-----+-----+

Related

combine result of 2 data frame generated in for loop for 2 input values

combine result of 2 data frame generated in for loop for 2 input values
here is the data frame :
1st DF for first value in for loop:
+--------+-------------------------------+---+
|order_id|Diff |id |
+--------+-------------------------------+---+
|12 |order_status |1 |
|1 |order_customer_id order_status |1 |
|68885 |New row in DataFrame 2 |1 |
|68886 |New row in DataFrame 2 |1 |
|2 |order_customer_id |1 |
+--------+-------------------------------+---+
2nd DF for first value in for loop:
+--------+-------------------------------+---+
|order_id|Diff |id |
+--------+-------------------------------+---+
|12 |order_status |2 |
|1 |order_customer_id order_status |2 |
|68885 |New row in DataFrame 2 |2 |
|68886 |New row in DataFrame 2 |2 |
|2 |order_customer_id |2 |
+--------+-------------------------------+---+
want to combine both of the above at the end - also it can be more than 2 so want final result as combined DF. can anyone have any logic?
Let's say you have the following loop that generates a sequence of DataFrames:
import spark.implicits._
val dfs: Seq[DataFrame] = List(List((1,1)), List((2,2)), List((3,3))).map(l => l.toDF("a","b"))
You can use the union function in order to combine them:
val combinedDf = dfs.reduce(_ union _)
combinedDf.show()
+---+---+
| a| b|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+

sparksql how to resolve Data dependency issues

scala> sql(""" select * from demo1""").show(false)
+-----+---+
|from1|to1|
+-----+---+
|c |d |
|b |c |
|a |b |
+-----+---+
demo1 is my inputTable.
now from table we can find: a to b, b to c,c to d; so we need All elements on the link
will be to d.
so i need result like this:
+-----+---+
|from1|to1|
+-----+---+
|c |d |
|b |d |
|a |d |
+-----+---+
note: a b c It's not sorted by subtitle or size.
Their relationship transitions are random.
How do I write this SparkSQL?

how to merge specific cells table data in oracle

I want to condionally concatenate text cells in oracle table according to sequence (SEQ) number attribute. Is it possible to do it? I need your help with the query.
For example I have the following table DATA:
|-----------------|
|ID|CODE|SEQ|TEXT |
|--|----|---|-----|
|1 |a |1 |text1|
|1 |a |2 |text2|
|2 |b |1 |text3|
|3 |c |1 |text4|
|4 |d |1 |text6|
|4 |d |2 |text7|
|4 |d |3 |text8|
-------------------
What I want to do is to create a new table DATA1 which concatenates TEXT values having the same id and code with concatenated texts in case SEQ > 1. The new table should look like this:
|-------------------------|
|ID|CODE|TEXT |
|--|----|-----------------|
|1 |a |text1 text2 |
|2 |b |text3 |
|3 |c |text4 |
|4 |d |text6 text7 text8|
---------------------------
listagg() function might be used with grouping by id and code.
select id, code,
listagg(text,' ') within group (order by seq) as text
from tab
group by id, code
Demo

Add two elements in a dataframe (based on the index)

I have a dataframe in which some rows are useless except for one variable.
I want to add that the variable in those rows to the previous row and then delete the useless rows.
In the data frame there are some rows in which the only useful information is on a variable, so I want to preserve this information.
More precisely, my dataframe looks something like
|cat1| cat2|var1|var2|
|A |x |1 |2 |
|A |x |1 |0 |
|A |x |. |5 |
|A |y |1 |2 |
|A |y |1 |2 |
|A |y |1 |3 |
|A |y |. |6 |
|B |x |1 |2 |
|B |x |1 |4 |
|B |x |1 |2 |
|B |x |1 |1 |
|B |x |. |3 |
and i want to get
|cat1| cat2|var1|var2|
|A |x |1 |2 |
|A |x |1 |5(5+0)|
|A |y |1 |2 |
|A |y |1 |2 |
|A |y |1 |9(6+3)|
|B |x |1 |2 |
|B |x |1 |4 |
|B |x |1 |2 |
|B |x |1 |4(3+1)|
iI've tried code like
test = df[df['var1'] == '.'].index
for num in test:
df['var2][num - 1] = df['var2][num - 1] + df['var2][num]
but it doesn't work.
Any help would be appreciated.
For a very readable solution combine np.where to select the rows where the shifted rows of var1 contain .. Use the -1 to select the next row. If that's the case add the next row, otherwise just fill the original row. Afterwards, just drop all the rows with a .
df['var2_new'] = np.where(df['var1'].shift(-1) == '.',
df['var2'] + df['var2'].shift(-1), df['var2'])
df[df['var1'] != '.']
# cat1 cat2 var1 var2 var2_new
#0 A x 1 2 2.0
#1 A x 1 0 5.0
#3 A y 1 2 2.0
#4 A y 1 2 2.0
#5 A y 1 3 9.0
#7 B x 1 2 2.0
#8 B x 1 4 4.0
#9 B x 1 2 2.0
#10 B x 1 1 4.0

Select rows with different values on different columns

I'm new to SQL so this took my a long time without being able to figure it out.
My table looks like this:
+------+------+------+
|ID |2016 | 2017 |
+------+------+------+
|1 |A |A |
+------+------+------+
|2 |A |B |
+------+------+------+
|3 |B |B |
+------+------+------+
|4 |B |C |
+------+------+------+
I would like to have only the rows which have changed from 2016 to 2017:
+------+------+------+
|ID |2016 | 2017 |
+------+------+------+
|2 |A |B |
+------+------+------+
|4 |B |C |
+------+------+------+
Could you please help ?
select * from mytable where column_2016<>column_2017
assuming your column labels are column_2016 and column_2017