how to merge specific cells table data in oracle - sql

I want to condionally concatenate text cells in oracle table according to sequence (SEQ) number attribute. Is it possible to do it? I need your help with the query.
For example I have the following table DATA:
|-----------------|
|ID|CODE|SEQ|TEXT |
|--|----|---|-----|
|1 |a |1 |text1|
|1 |a |2 |text2|
|2 |b |1 |text3|
|3 |c |1 |text4|
|4 |d |1 |text6|
|4 |d |2 |text7|
|4 |d |3 |text8|
-------------------
What I want to do is to create a new table DATA1 which concatenates TEXT values having the same id and code with concatenated texts in case SEQ > 1. The new table should look like this:
|-------------------------|
|ID|CODE|TEXT |
|--|----|-----------------|
|1 |a |text1 text2 |
|2 |b |text3 |
|3 |c |text4 |
|4 |d |text6 text7 text8|
---------------------------

listagg() function might be used with grouping by id and code.
select id, code,
listagg(text,' ') within group (order by seq) as text
from tab
group by id, code
Demo

Related

Add rows of data to each group in a Spark dataframe

I have this dataframe -
data = [(0,1,1,201505,3),
(1,1,1,201506,5),
(2,1,1,201507,7),
(3,1,1,201508,2),
(4,2,2,201750,3),
(5,2,2,201751,0),
(6,2,2,201752,1),
(7,2,2,201753,1)
]
cols = ['id','item','store','week','sales']
data_df = spark.createDataFrame(data=data,schema=cols)
display(data_df)
What I want it this -
data_new = [(0,1,1,201505,3,0),
(1,1,1,201506,5,0),
(2,1,1,201507,7,0),
(3,1,1,201508,2,0),
(4,1,1,201509,0,0),
(5,1,1,201510,0,0),
(6,1,1,201511,0,0),
(7,1,1,201512,0,0),
(8,2,2,201750,3,0),
(9,2,2,201751,0,0),
(10,2,2,201752,1,0),
(11,2,2,201753,1,0),
(12,2,2,201801,0,0),
(13,2,2,201802,0,0),
(14,2,2,201803,0,0),
(15,2,2,201804,0,0)]
cols_new = ['id','item','store','week','sales','flag',]
data_df_new = spark.createDataFrame(data=data_new,schema=cols_new)
display(data_df_new)
So basically, I want 8 (this can also be 6 or 10) weeks of data for each item-store groupby combination. Wherever the 52/53 weeks for the year ends, I need the weeks for the next year, as I have mentioned in the sample. I need this in PySpark, thanks in advance!
See my attempt below. Could have made it shorter but felt should be as explicit as I can so I dint chain the soultions. code below
from pyspark.sql import functions as F
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
# Convert week of the year to date
s=data_df.withColumn("week", expr("cast (week as string)")).withColumn('date', F.to_date(F.concat("week",F.lit("6")), "yyyywwu"))
s = (s.groupby('item', 'store').agg(F.collect_list('sales').alias('sales'),F.collect_list('date').alias('date'))#Put sales and dates in an array
.withColumn("id", sequence(lit(0), lit(6)))#Create sequence ids with the required expansion range per group
)
#Explode datframe back with each item/store combination in a row
s =s.selectExpr('item','store','inline(arrays_zip(date,id,sales))')
#Create partition window broadcasting from start to end for each item/store combination
w = Window.partitionBy('item','store').orderBy('id').rowsBetween(-sys.maxsize, sys.maxsize)
#Create partition window broadcasting from start to end for each item/store/date combination. the purpose here is to aggregate over null dates as group
w1 = Window.partitionBy('item','store','date').orderBy('id').rowsBetween(Window.unboundedPreceding, Window.currentRow)
s=(s.withColumn('increment', F.when(col('date').isNull(),(row_number().over(w1))*7).otherwise(0))#Create increment values per item/store combination
.withColumn('date1', F.when(col('date').isNull(),max('date').over(w)).otherwise(col('date')))#get last date in each item/store combination
)
# #Compute the week of year and drop columns not wanted
s = s.withColumn("weekofyear", expr("weekofyear(date_add(date1, cast(increment as int)))")).drop('date','increment','date1').na.fill(0)
s.show(truncate=False)
Outcome
+----+-----+---+-----+----------+
|item|store|id |sales|weekofyear|
+----+-----+---+-----+----------+
|1 |1 |0 |3 |5 |
|1 |1 |1 |5 |6 |
|1 |1 |2 |7 |7 |
|1 |1 |3 |2 |8 |
|1 |1 |4 |0 |9 |
|1 |1 |5 |0 |10 |
|1 |1 |6 |0 |11 |
|2 |2 |0 |3 |50 |
|2 |2 |1 |0 |51 |
|2 |2 |2 |1 |52 |
|2 |2 |3 |1 |1 |
|2 |2 |4 |0 |2 |
|2 |2 |5 |0 |3 |
|2 |2 |6 |0 |4 |
+----+-----+---+-----+----------+

How to select rows based on exact count of array elements in a different column

Suppose I have a dataframe like this, where B_C is concat of col B and col C, and column selected_B_C is an array formed by picking a few B_C col from within the group.
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|A |grp_count_A|B |C |B_C |D |selected_B_C |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|1 |6 |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
I want to append a column selected where it takes a value 1, if for a row, col B_C is found in colselected_B_C, otherwise 0, so the final dataframe looks like this.
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|A |grp_count_A|B |C |B_C |D |selected_B_C |selected|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|1 |6 |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|1 |
|1 |6 |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|1 |
|1 |6 |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|0 |
|2 |4 |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|0 |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
The tricky part for col selected is that I only want the exact number of occurrences of a value in selected_B_C to have value 1 for selected
For example in group 2, even though there are 3 records with value of 76498.0_20150501 for col B_C, I want only two records from group 2 whose value is 76498.0_20150501 to have value of 1 for selected, as selected_B_C for group 2 has exactly 2 elements with value 76498.0_20150501 in col selected_B_C

transform table with duplicate

I'm trying to transform a base with duplicates into a new base according to the attached model
impossible without duplicate
I don't see how I can do
in advance thank you for your help
original base
IDu| ID | Information
1 |A |1
2 |A |2
3 |A |3
4 |A |4
5 |A |5
6 |B |1
7 |B |2
8 |B |3
9 |B |4
10 |C |1
11 |D |1
12 |D |2
13 |D |3
base to reach
ID | Resultat/table2 | plus grand valeur
A |(1,2,3,4,5) |5
B |(1,2,3,4) |4
C |(1) |1
D |(1,2,3) |3
You can use GROUP_CONCAT
(https://www.w3resource.com/mysql/aggregate-functions-and-grouping/aggregate-functions-and-grouping-group_concat.php):
SELECT
ID, GROUP_CONCAT(INFORMATION), COUNT(INFORMATION)
FROM
TABLE
GROUP BY
ID
a huge thank you.
Quick and perfect response
on the other hand how I can filter to have the greatest value
this query ranges from smallest to largest, but how to keep only the largest value
D | Resultat/table2 | greatest value
A |(1,2,3,4,5) |5
B |(1,2,3,4) |4
C |(1) |1
D |(1,2,3) |3
I tried, but without success
SELECT ID,GROUP_CONCAT(ID1)
from tournee_reduite
GROUP BY ID
ORDER BY MAX(ID1) desc;
another huge thank you

Distinct count on multiple unrelated columns

I've a dataset where from I want distinct count of more than one column and get the result in one single select, how to go about it?
Example:
Table:
|Col_A|Col_B|
|a |c |
|a |d |
|b |c |
|b |d |
|b |c |
I want like this (with the use of a single select query) -
|Col_A|Count_of_A|Col_B|Count_of_B|
|a |2 |c |3 |
|b |3 |d |2 |
How to do this? Given that, data is unknown every-time and hence, we cannot use where or case statements for specific use-case.
Ideally this is a Spark-Streaming problem, where I want to do this operation on a Spark-streaming dataframe every time new data comes in from Kafka.

SUM specific set of values in SQL

I've been trying to look for a way to sum a specific set of values in a query.
Currently I have a query that returns all values needed, but I want it to now sum several values.
|Name|Value|
|x |1 |
|x |2 |
|x |3 |
|x |5 |
|y |3 |
|y |2 |
|y |2 |
|y |3 |
|z |3 |
|z |2 |
|z |1 |
I don't know if I should run a subquery, I'm not necessarily summing up distinct values, but instead have something along the lines of this:
|Name|Value|
|x |11 |
|y |10 |
|z |6 |
Although, each entry has their own unique ID for their respective row. I'm fairly new at this so I don't know if I would take that into account with my query.
Select q.name, sum(q.value)
from (YOUR_SELECT_QUERY) q
group by q.name
A CTE might be the easiest way to go here:
;WITH CTE AS (Your Query Here)
SELECT Name, SUM(Value) AS Value
FROM CTE
GROUP BY Name