I have this dataframe -
data = [(0,1,1,201505,3),
(1,1,1,201506,5),
(2,1,1,201507,7),
(3,1,1,201508,2),
(4,2,2,201750,3),
(5,2,2,201751,0),
(6,2,2,201752,1),
(7,2,2,201753,1)
]
cols = ['id','item','store','week','sales']
data_df = spark.createDataFrame(data=data,schema=cols)
display(data_df)
What I want it this -
data_new = [(0,1,1,201505,3,0),
(1,1,1,201506,5,0),
(2,1,1,201507,7,0),
(3,1,1,201508,2,0),
(4,1,1,201509,0,0),
(5,1,1,201510,0,0),
(6,1,1,201511,0,0),
(7,1,1,201512,0,0),
(8,2,2,201750,3,0),
(9,2,2,201751,0,0),
(10,2,2,201752,1,0),
(11,2,2,201753,1,0),
(12,2,2,201801,0,0),
(13,2,2,201802,0,0),
(14,2,2,201803,0,0),
(15,2,2,201804,0,0)]
cols_new = ['id','item','store','week','sales','flag',]
data_df_new = spark.createDataFrame(data=data_new,schema=cols_new)
display(data_df_new)
So basically, I want 8 (this can also be 6 or 10) weeks of data for each item-store groupby combination. Wherever the 52/53 weeks for the year ends, I need the weeks for the next year, as I have mentioned in the sample. I need this in PySpark, thanks in advance!
See my attempt below. Could have made it shorter but felt should be as explicit as I can so I dint chain the soultions. code below
from pyspark.sql import functions as F
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
# Convert week of the year to date
s=data_df.withColumn("week", expr("cast (week as string)")).withColumn('date', F.to_date(F.concat("week",F.lit("6")), "yyyywwu"))
s = (s.groupby('item', 'store').agg(F.collect_list('sales').alias('sales'),F.collect_list('date').alias('date'))#Put sales and dates in an array
.withColumn("id", sequence(lit(0), lit(6)))#Create sequence ids with the required expansion range per group
)
#Explode datframe back with each item/store combination in a row
s =s.selectExpr('item','store','inline(arrays_zip(date,id,sales))')
#Create partition window broadcasting from start to end for each item/store combination
w = Window.partitionBy('item','store').orderBy('id').rowsBetween(-sys.maxsize, sys.maxsize)
#Create partition window broadcasting from start to end for each item/store/date combination. the purpose here is to aggregate over null dates as group
w1 = Window.partitionBy('item','store','date').orderBy('id').rowsBetween(Window.unboundedPreceding, Window.currentRow)
s=(s.withColumn('increment', F.when(col('date').isNull(),(row_number().over(w1))*7).otherwise(0))#Create increment values per item/store combination
.withColumn('date1', F.when(col('date').isNull(),max('date').over(w)).otherwise(col('date')))#get last date in each item/store combination
)
# #Compute the week of year and drop columns not wanted
s = s.withColumn("weekofyear", expr("weekofyear(date_add(date1, cast(increment as int)))")).drop('date','increment','date1').na.fill(0)
s.show(truncate=False)
Outcome
+----+-----+---+-----+----------+
|item|store|id |sales|weekofyear|
+----+-----+---+-----+----------+
|1 |1 |0 |3 |5 |
|1 |1 |1 |5 |6 |
|1 |1 |2 |7 |7 |
|1 |1 |3 |2 |8 |
|1 |1 |4 |0 |9 |
|1 |1 |5 |0 |10 |
|1 |1 |6 |0 |11 |
|2 |2 |0 |3 |50 |
|2 |2 |1 |0 |51 |
|2 |2 |2 |1 |52 |
|2 |2 |3 |1 |1 |
|2 |2 |4 |0 |2 |
|2 |2 |5 |0 |3 |
|2 |2 |6 |0 |4 |
+----+-----+---+-----+----------+
Suppose I have a dataframe like this, where B_C is concat of col B and col C, and column selected_B_C is an array formed by picking a few B_C col from within the group.
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|A |grp_count_A|B |C |B_C |D |selected_B_C |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|1 |6 |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
I want to append a column selected where it takes a value 1, if for a row, col B_C is found in colselected_B_C, otherwise 0, so the final dataframe looks like this.
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|A |grp_count_A|B |C |B_C |D |selected_B_C |selected|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|1 |6 |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|1 |
|1 |6 |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|1 |
|1 |6 |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|0 |
|2 |4 |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|0 |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
The tricky part for col selected is that I only want the exact number of occurrences of a value in selected_B_C to have value 1 for selected
For example in group 2, even though there are 3 records with value of 76498.0_20150501 for col B_C, I want only two records from group 2 whose value is 76498.0_20150501 to have value of 1 for selected, as selected_B_C for group 2 has exactly 2 elements with value 76498.0_20150501 in col selected_B_C
I'm trying to transform a base with duplicates into a new base according to the attached model
impossible without duplicate
I don't see how I can do
in advance thank you for your help
original base
IDu| ID | Information
1 |A |1
2 |A |2
3 |A |3
4 |A |4
5 |A |5
6 |B |1
7 |B |2
8 |B |3
9 |B |4
10 |C |1
11 |D |1
12 |D |2
13 |D |3
base to reach
ID | Resultat/table2 | plus grand valeur
A |(1,2,3,4,5) |5
B |(1,2,3,4) |4
C |(1) |1
D |(1,2,3) |3
You can use GROUP_CONCAT
(https://www.w3resource.com/mysql/aggregate-functions-and-grouping/aggregate-functions-and-grouping-group_concat.php):
SELECT
ID, GROUP_CONCAT(INFORMATION), COUNT(INFORMATION)
FROM
TABLE
GROUP BY
ID
a huge thank you.
Quick and perfect response
on the other hand how I can filter to have the greatest value
this query ranges from smallest to largest, but how to keep only the largest value
D | Resultat/table2 | greatest value
A |(1,2,3,4,5) |5
B |(1,2,3,4) |4
C |(1) |1
D |(1,2,3) |3
I tried, but without success
SELECT ID,GROUP_CONCAT(ID1)
from tournee_reduite
GROUP BY ID
ORDER BY MAX(ID1) desc;
another huge thank you