SQL table transformation. How to pivot a certain table? - sql

How would I do the pivot below?
I have a table like this:
+------+---+----+
| round| id| kpi|
+------+---+----+
| 0 | 1 | 0.1|
| 1 | 1 | 0.2|
| 0 | 2 | 0.5|
| 1 | 2 | 0.4|
+------+---+----+
I want to convert the id column into multiple columns (same amount of different ids), with KPI value as their values and in the new table we keep the rounds like in the first table.
+------+----+----+
| round| id1| id2|
+------+----+----+
| 0 | 0.1| 0.5|
| 1 | 0.2| 0.4|
+------+----+----+
Is it possible to do this in SQL? How to do that?

You are looking for a pivot function. You can find details on how to do this here and here. The first link also provides input into how to do this if you have an unknown number of columnnames.

Related

add corresponding columns for every column of the spark df that assigns 1 for every notNull in the original column using case when

I have a sample dataframe:
df = spark.createDataFrame([('name1','id1',1,None,3),('name2','id2',None,2,5)],['NAME','personID','col1','col2','col3'])
My use case has 15 columns
What I would like to do is using case when and loop, to add new columns that correspond to each column from the original except the first two columns. Within those new columns, it will give a value of 1 if notNull, otherwise 0.
I am aiming to get something like below:
+--------+--------+--------+-------+-------+-------+------+------+
|Name | ID | col1 | col2 | col3 | col1_N|col2_N|col3_N|
+--------+--------+--------+-------+-------+-------+------+------+
|name1 | id1 | 1 | Null | 3 | 1 | 0 | 1 |
|name2 | id2 | Null | 2 | 5 | 0 | 1 | 1 |
+--------+--------+--------+-------+-------+-------+------+------+
the first five columns are the original columns, the last three columns will be added with corresponding 1 or 0 from 'col1', 'col2', and 'col3' values.
The last code/s I am working on creates a new one but does not keep the original dataframe values.
df.select([when(col(c).isNotNull(), 1).otherwise(0).alias(c + '_N') for c in df.columns])
for which I get:
+-------+-------+-------+------+------+
| Name_N| ID_N | col1_N|col2_N|col3_N|
+-------+-------+-------+------+------+
| 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 |
+-------+-------+-------+------+------+
The above could have been acceptable but I need to keep the original values of Name and ID columns.
I got an InvalidArgument with below:
df.select(['*'],[when(col(c).isNotNull(), 1).otherwise(0).alias(c + '_N') for c in df.columns])
TypeError: Invalid argument, not a string or column: ['*'] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I thought selecting all first will give me all the columns of the original
UPDATE:
somehow this worked, but I only get the last column:
for c in df.columns[2:]:
sdf = df.withColumn(c+'_N', when(col(c).isNotNull(),1).otherwise(0))
but this is what I get:
+--------+--------+--------+-------+-------+------+
|Name | ID | col1 | col2 | col3 |col3_N|
+--------+--------+--------+-------+-------+------+
|name1 | id1 | 1 | Null | 3 | 1 |
|name2 | id2 | Null | 2 | 5 | 1 |
+--------+--------+--------+-------+-------+------+
``
I only got the last original column
Using list comprehension as show below will give expected result.
df.select([col(c) if c in ['NAME', 'personID'] else when(col(c).isNotNull(), 1).otherwise(0).alias(f"{c}_N") for c in df.columns]).show()
+-----+--------+------+------+------+
| NAME|personID|col1_N|col2_N|col3_N|
+-----+--------+------+------+------+
|name1| id1| 1| 0| 1|
|name2| id2| 0| 1| 1|
+-----+--------+------+------+------+
Just fix your 1st approach with specifying a slice of columns and simplify boolean condition:
df.select([col(c).isNotNull().cast("integer").alias(c + '_N') for c in df.columns[2:]])

SQL pivot table for unknown number of columns

I need some tips for the Postgres pivot below, please.
I have a table like this:
+------+---+----+
| round| id| kpi|
+------+---+----+
| 0 | 1 | 0.1|
| 1 | 1 | 0.2|
| 0 | 2 | 0.5|
| 1 | 2 | 0.4|
+------+---+----+
The number of Ids is unknown.
I need to convert the id column into multiple columns (same amount of different ids), with KPI value as their values and in the new table we keep the rounds like in the first table.
+------+----+----+
| round| id1| id2|
+------+----+----+
| 0 | 0.1| 0.5|
| 1 | 0.2| 0.4|
+------+----+----+
Is it possible to do this in SQL? How to do that?
It´s possible, check this question
This other is a pivot that I did, also with an unknown number of columns, maybe it can help you too: Advanced convert rows to columns (pivot) in SQL Server

Using pyspark to create a segment array from a flat record

I have a sparsely populated table with values for various segments for unique user ids. I need to create an array with unique_id and relevant segment headers only
Please note that this is just an indicative dataset. I have several hundreds of segments like these.
------------------------------------------------
| user_id | seg1 | seg2 | seg3 | seg4 | seg5 |
------------------------------------------------
| 100 | M | null| 25 | null| 30 |
| 200 | null| null| 43 | null| 250 |
| 300 | F | 3000| null| 74 | null|
------------------------------------------------
I am expecting the output to be
-------------------------------
| user_id| segment_array |
-------------------------------
| 100 | [seg1, seg3, seg5] |
| 200 | [seg3, seg5] |
| 300 | [seg1, seg2, seg4] |
-------------------------------
Is there any function available in pyspark of pyspark-sql to accomplish this?
Thanks for your help!
I cannot find the direct way but you can do this.
cols= df.columns[1:]
r = df.withColumn('array', array(*[when(col(c).isNotNull(), lit(c)).otherwise('notmatch') for c in cols])) \
.withColumn('array', array_remove('array', 'notmatch'))
r.show()
+-------+----+----+----+----+----+------------------+
|user_id|seg1|seg2|seg3|seg4|seg5| array|
+-------+----+----+----+----+----+------------------+
| 100| M|null| 25|null| 30|[seg1, seg3, seg5]|
| 200|null|null| 43|null| 250| [seg3, seg5]|
| 300| F|3000|null| 74|null|[seg1, seg2, seg4]|
+-------+----+----+----+----+----+------------------+
Not sure this is the best way but I'd attack it this way:
There's the collect_set function which will always give you a unique value across a list of values you aggregate over.
do a union for each segment on:
df_seg_1 = df.select(
'user_id',
fn.when(
col('seg1').isNotNull(),
lit('seg1)
).alias('segment')
)
# repeat for all segments
df = df_seg_1.union(df_seg_2).union(...)
df.groupBy('user_id').agg(collect_list('segment'))

How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala

I want to get the distinct values and their respective counts of every column of a dataframe and store them as (k,v) in another dataframe.
Note: My Columns are not static, they keep changing. So, I cannot hardcore the column names instead I should loop through them.
For Example, below is my dataframe
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| Blaze | IND| 19950312|
| Scarlet | USA| 19950313|
| Jonas | CAD| 19950312|
| Blaze | USA| 19950312|
| Jonas | CAD| 19950312|
| mark | USA| 19950313|
| mark | CAD| 19950313|
| Smith | USA| 19950313|
| mark | UK | 19950313|
| scarlet | CAD| 19950313|
My final result should be created in a new dataframe as (k,v) where k is the distinct record and v is the count of it.
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| (Blaze,2) | (IND,1) |(19950312,3)|
| (Scarlet,2) | (USA,4) |(19950313,6)|
| (Jonas,3) | (CAD,4) | |
| (mark,3) | (UK,1) | |
| (smith,1) | | |
Can anyone please help me with this, I'm using Spark 2.4.0 and Scala 2.11.12
Note: My columns are dynamic, so I can't hardcore the columns and do groupby on them.
I don't have exact solution to your query but I can surely provide you with some help that can get you started working on your issue.
Create dataframe
scala> val df = Seq(("Blaze ","IND","19950312"),
| ("Scarlet","USA","19950313"),
| ("Jonas ","CAD","19950312"),
| ("Blaze ","USA","19950312"),
| ("Jonas ","CAD","19950312"),
| ("mark ","USA","19950313"),
| ("mark ","CAD","19950313"),
| ("Smith ","USA","19950313"),
| ("mark ","UK ","19950313"),
| ("scarlet","CAD","19950313")).toDF("name", "country","dob")
Next calculate count of distinct element of each column
scala> val distCount = df.columns.map(c => df.groupBy(c).count)
Create a range to iterate over distCount
scala> val range = Range(0,distCount.size)
range: scala.collection.immutable.Range = Range(0, 1, 2)
Aggregate your data
scala> val aggVal = range.toList.map(i => distCount(i).collect().mkString).toSeq
aggVal: scala.collection.immutable.Seq[String] = List([Jonas ,2][Smith ,1][Scarlet,1][scarlet,1][mark ,3][Blaze ,2], [CAD,4][USA,4][IND,1][UK ,1], [19950313,6][19950312,4])
Create data frame:
scala> Seq((aggVal(0),aggVal(1),aggVal(2))).toDF("name", "country","dob").show()
+--------------------+--------------------+--------------------+
| name| country| dob|
+--------------------+--------------------+--------------------+
|[Jonas ,2][Smith...|[CAD,4][USA,4][IN...|[19950313,6][1995...|
+--------------------+--------------------+--------------------+
I hope this helps you in some way.

SSRS Distinct Count Of Condition Once Per Group

Let's say I have data similar to this:
|NAME |AMOUNT|RANDOM_FLAG|
|------|------|-----------|
|MARK | 100| X |
|MARK | 400| |
|MARK | 200| X |
|AMY | 100| X |
|AMY | 400| |
|AMY | 300| |
|ABE | 300| |
|ABE | 900| |
|ABE | 700| |
How can I get a distinct count of names with at least one RANDOM_FLAG set. In my total row, I want to see a count of 2 since both Mark and Amy had the flag set, regardless of how many times it is selected. I have tried every thing I can think of in SSRS. I'm guessing there is a way to nest aggregates to get to this, but I can't come up with it. I do have a group on NAME.
You can use a conditional COUNTDISTINCT in SSRS.
=COUNTDISTINCT(
IIF( not ISNOTHING(Fields!Random_Flag.Value) and Fields!Random_Flag <> "",Fields!Name.Value,Nothing),
"DataSetName"
)
Replace DataSetName by the name of your dataset.
select count(name) from table_name where name in (select distinct name from table_name where random_flag="x");