Add rows of data to each group in a Spark dataframe - dataframe

I have this dataframe -
data = [(0,1,1,201505,3),
(1,1,1,201506,5),
(2,1,1,201507,7),
(3,1,1,201508,2),
(4,2,2,201750,3),
(5,2,2,201751,0),
(6,2,2,201752,1),
(7,2,2,201753,1)
]
cols = ['id','item','store','week','sales']
data_df = spark.createDataFrame(data=data,schema=cols)
display(data_df)
What I want it this -
data_new = [(0,1,1,201505,3,0),
(1,1,1,201506,5,0),
(2,1,1,201507,7,0),
(3,1,1,201508,2,0),
(4,1,1,201509,0,0),
(5,1,1,201510,0,0),
(6,1,1,201511,0,0),
(7,1,1,201512,0,0),
(8,2,2,201750,3,0),
(9,2,2,201751,0,0),
(10,2,2,201752,1,0),
(11,2,2,201753,1,0),
(12,2,2,201801,0,0),
(13,2,2,201802,0,0),
(14,2,2,201803,0,0),
(15,2,2,201804,0,0)]
cols_new = ['id','item','store','week','sales','flag',]
data_df_new = spark.createDataFrame(data=data_new,schema=cols_new)
display(data_df_new)
So basically, I want 8 (this can also be 6 or 10) weeks of data for each item-store groupby combination. Wherever the 52/53 weeks for the year ends, I need the weeks for the next year, as I have mentioned in the sample. I need this in PySpark, thanks in advance!

See my attempt below. Could have made it shorter but felt should be as explicit as I can so I dint chain the soultions. code below
from pyspark.sql import functions as F
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
# Convert week of the year to date
s=data_df.withColumn("week", expr("cast (week as string)")).withColumn('date', F.to_date(F.concat("week",F.lit("6")), "yyyywwu"))
s = (s.groupby('item', 'store').agg(F.collect_list('sales').alias('sales'),F.collect_list('date').alias('date'))#Put sales and dates in an array
.withColumn("id", sequence(lit(0), lit(6)))#Create sequence ids with the required expansion range per group
)
#Explode datframe back with each item/store combination in a row
s =s.selectExpr('item','store','inline(arrays_zip(date,id,sales))')
#Create partition window broadcasting from start to end for each item/store combination
w = Window.partitionBy('item','store').orderBy('id').rowsBetween(-sys.maxsize, sys.maxsize)
#Create partition window broadcasting from start to end for each item/store/date combination. the purpose here is to aggregate over null dates as group
w1 = Window.partitionBy('item','store','date').orderBy('id').rowsBetween(Window.unboundedPreceding, Window.currentRow)
s=(s.withColumn('increment', F.when(col('date').isNull(),(row_number().over(w1))*7).otherwise(0))#Create increment values per item/store combination
.withColumn('date1', F.when(col('date').isNull(),max('date').over(w)).otherwise(col('date')))#get last date in each item/store combination
)
# #Compute the week of year and drop columns not wanted
s = s.withColumn("weekofyear", expr("weekofyear(date_add(date1, cast(increment as int)))")).drop('date','increment','date1').na.fill(0)
s.show(truncate=False)
Outcome
+----+-----+---+-----+----------+
|item|store|id |sales|weekofyear|
+----+-----+---+-----+----------+
|1 |1 |0 |3 |5 |
|1 |1 |1 |5 |6 |
|1 |1 |2 |7 |7 |
|1 |1 |3 |2 |8 |
|1 |1 |4 |0 |9 |
|1 |1 |5 |0 |10 |
|1 |1 |6 |0 |11 |
|2 |2 |0 |3 |50 |
|2 |2 |1 |0 |51 |
|2 |2 |2 |1 |52 |
|2 |2 |3 |1 |1 |
|2 |2 |4 |0 |2 |
|2 |2 |5 |0 |3 |
|2 |2 |6 |0 |4 |
+----+-----+---+-----+----------+

Related

Is it possible to find the sum of values on one row sqlite?

If I have data in a table with integers like the example below, is it possible to calculate for each row the sum of several columns and output that sum as well as several other columns through an sqlite query command?
My table looks like this below
|Timestamp |Email |Name |Year|Make |Model |Car_ID|Judge_ID|Judge_Name|Racer_Turbo|Racer_Supercharged|Racer_Performance|Racer_Horsepower|Car_Overall|Engine_Modifications|Engine_Performance|Engine_Chrome|Engine_Detailing|Engine_Cleanliness|Body_Frame_Undercarriage|Body_Frame_Suspension|Body_Frame_Chrome|Body_Frame_Detailing|Body_Frame_Cleanliness|Mods_Paint|Mods_Body|Mods_Wrap|Mods_Rims|Mods_Interior|Mods_Other|Mods_ICE|Mods_Aftermarket|Mods_WIP|Mods_Overall|
|--------------|---------------------------|----------|----|--------|---------|------|--------|----------|-----------|------------------|-----------------|----------------|-----------|--------------------|------------------|-------------|----------------|------------------|------------------------|---------------------|-----------------|--------------------|----------------------|----------|---------|---------|---------|-------------|----------|--------|----------------|--------|------------|
|8/5/2018 14:10|honoland13#japanpost.jp |Hernando |2015|Acura |TLX |48 |J04 |Bob |0 |0 |2 |2 |4 |4 |0 |2 |4 |4 |2 |4 |2 |2 |2 |2 |2 |0 |4 |4 |4 |6 |2 |0 |4 |
|8/5/2018 15:11|nlighterness2q#umn.edu |Noel |2015|Jeep |Wrangler |124 |J02 |Carl |0 |6 |4 |2 |4 |6 |6 |4 |4 |4 |6 |6 |6 |6 |6 |4 |6 |6 |6 |6 |6 |4 |6 |4 |6 |
|8/5/2018 17:10|eguest47#microsoft.com |Edan |2015|Lexus |Is250 |222 |J05 |Adrian |0 |0 |0 |0 |0 |0 |0 |0 |6 |6 |6 |0 |0 |6 |6 |6 |0 |0 |0 |0 |0 |0 |0 |0 |4 |
|8/5/2018 17:34|hchilley40#fema.gov |Hieronymus|1993|Honda |Civic eG |207 |J06 |Aaron |0 |0 |2 |2 |2 |2 |2 |2 |0 |4 |2 |2 |2 |2 |2 |2 |4 |2 |2 |0 |0 |0 |2 |2 |0 |
|8/5/2018 14:30|nnowick3d#tuttocitta.it |Nickolas |2016|Ford |Mystang |167 |J02 |Carl |0 |0 |2 |2 |0 |2 |2 |0 |0 |0 |0 |2 |0 |2 |2 |2 |0 |0 |2 |0 |0 |0 |0 |0 |2 |
|8/5/2018 16:12|mdearl39#amazon.co.uk |Martin |2013|Hyundai |Gen coupe|159 |J04 |Bob |0 |0 |2 |0 |0 |0 |2 |0 |0 |0 |0 |2 |0 |2 |2 |0 |2 |0 |2 |0 |0 |0 |0 |0 |0 |
How can I find the sum from column 10 to 34 for each row, then output each row up to column 7 followed by a column with the total for each row? So far I've only figured out how to get the sum for each column individually but not to across several columns for each row and to output each the desired columns.
SELECT Car_ID, Year, Make, Model, SUM(Mods_ICE) FROM Carstable
But this only outputs data for one row at the bottom of the table with the sum. Expected outcome would be something like below
|Car_ID|Year |Make |Model |Total |
|------|------|------|---------|-------|
|48 |2015 |Acura |TLX |89 |
|22 |2015 |Chevy |Camaro |101 |
|19 |2006 |Ford |Mustang |55 |
|101 |2011 |Subaru|WRX |91 |
For sum of columns in a single row you need no extra function like SUM. Use + oerator:
SELECT column10 + ... + column34 FROM Carstable

How to get week number using year and day of year using pyspark?

I am trying to add row numbers to a table. I need to add 1 for the first 7 rows in the dataframe and then 2 for the second 7 rows in the dataframe and so on. for eg pls refer to the last column in the dataframe.
I am basically trying to get week number based on day of the year and year
+-----------+---------------+----------------+------------------+---------+
|datekey |datecalendarday|datecalendaryear|weeknumberofseason|indicator| weeknumber
+-----------+---------------+----------------+------------------+---------+
|4965 |1 |2018 |2 |1 | 1
|4966 |2 |2018 |2 |2 | 1
|4967 |3 |2018 |2 |3 | 1
|4968 |4 |2018 |2 |4 | 1
|4969 |5 |2018 |2 |5 | 1
|4970 |6 |2018 |2 |6 | 1
|4971 |7 |2018 |3 |7 | 1
|4972 |8 |2018 |3 |8 | 2
|4973 |9 |2018 |3 |9 | 2
|4974 |10 |2018 |3 |10 | 2
|4975 |11 |2018 |3 |11 | 2
|4976 |12 |2018 |3 |12 | 2
|4977 |13 |2018 |3 |13 | 2
|4978 |14 |2018 |4 |14 | 2
I stumbled upon a solution where i use ntile function to get the number of week from the days available in that year. Any other effecient solution also would help. Thaks in advance

How to select rows based on exact count of array elements in a different column

Suppose I have a dataframe like this, where B_C is concat of col B and col C, and column selected_B_C is an array formed by picking a few B_C col from within the group.
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|A |grp_count_A|B |C |B_C |D |selected_B_C |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|1 |6 |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|
|1 |6 |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2 |4 |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
I want to append a column selected where it takes a value 1, if for a row, col B_C is found in colselected_B_C, otherwise 0, so the final dataframe looks like this.
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|A |grp_count_A|B |C |B_C |D |selected_B_C |selected|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|1 |6 |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|1 |
|1 |6 |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|1 |
|1 |6 |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|1 |6 |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|0 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1 |
|2 |4 |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|0 |
|2 |4 |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|0 |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
The tricky part for col selected is that I only want the exact number of occurrences of a value in selected_B_C to have value 1 for selected
For example in group 2, even though there are 3 records with value of 76498.0_20150501 for col B_C, I want only two records from group 2 whose value is 76498.0_20150501 to have value of 1 for selected, as selected_B_C for group 2 has exactly 2 elements with value 76498.0_20150501 in col selected_B_C

Oracle: Recursively self referential join with nth level record

I have self referential table like this:
id |level | parent_id
----------------------
1 |1 |null
2 |1 |null
3 |2 |1
4 |2 |1
5 |2 |2
6 |3 |5
7 |3 |3
8 |4 |7
9 |4 |6
------------------------
I need nth level parent in result. for example 2nd level parent
id |level | parent_id| second_level_parent_id
------------------------------------------------
1 |1 |null |null
2 |1 |null |null
3 |2 |1 |null
4 |2 |1 |null
5 |2 |2 |null
6 |3 |5 |5
7 |3 |3 |3
8 |4 |7 |3
9 |4 |6 |5
-------------------------------------------------
this works for me.
SELECT m.*,
CONNECT_BY_ROOT id AS second_level_parent_id
FROM my_table m
WHERE CONNECT_BY_ROOT level =2
CONNECT BY prior id = parent_id;
thanks #Jozef DĂșc

SQL: Need to SUM column for each type

How can I find the SUM of all scores for the minimum date of each lesson_id please:
-----------------------------------------------------------
|id |uid |group_id |lesson_id |game_id |score |date |
-----------------------------------------------------------
|1 |145 |1 |1 |0 |40 |1391627323 |
|2 |145 |1 |1 |0 |80 |1391627567 |
|3 |145 |1 |2 |0 |40 |1391627323 |
|4 |145 |1 |3 |0 |30 |1391627323 |
|5 |145 |1 |3 |0 |90 |1391627567 |
|6 |145 |1 |4 |0 |20 |1391628000 |
|7 |145 |1 |5 |0 |35 |1391628000 |
-----------------------------------------------------------
I need output:
-------------------
|sum_first_scores |
-------------------
|165 |
-------------------
I have this so far, which lists the score for each minimum date, per lesson, but I need to sum those results as above:
SELECT lesson_id, MIN(date), score AS first_score FROM cdu_user_progress
WHERE cdu_user_progress.uid = 145
GROUP BY lesson_id
You can identify the first score as the one where no earlier record exists. Then just take the sum:
select sum(score)
from edu_user_progress eup
where cdu_user_progress.uid = 145 and
not exists (select 1
from edu_user_progress eup2
where eup2.uid = eup.uid and
eup2.lesson_id = eup.lesson_id and
eup2.date < eup.date
);
This assumes that the minimum date for the lesson id has only one score.