Group column by year range pyspark dataframe - dataframe

I am trying to make my year column into year range instead of the value being a specific year. This is a movie dataset.
This is my code:
group =join_DF.groupby("relYear").avg("rating").withColumnRenamed("relYear", "year_range")
group.show()
This is what i have right now:
+----------+------------------+
|year_range| avg(rating)|
+----------+------------------+
| 1953|3.7107686857952533|
| 1903|3.0517241379310347|
| 1957|3.9994918537809254|
| 1897|2.9177215189873418|
| 1987|3.5399940908663594|
| 1956|3.7077949616896153|
| 2016|3.5318961695914055|
| 1936|3.8356813313560724|
| 2012|3.5490157995509457|
| |3.5151401495104130|
+----------+------------------+
This is what i want to achieve:
+-----------------+------------------+
| year_range | avg(rating) |
+-----------------+------------------+
| 1970-1979 |3.7773614199240319|
| 1960-1969 |3.8007319471419123|
| |3.5455419410410923|
| 1980-1989 |3.5778570247142313|
| 2000 onwards |3.5009940908663594|
| 1959 and earlier|3.8677949616896153|
| 1990-1999 |3.4618961695914055|
+-----------------+------------------+
The year_range that is null are movie title without release year stated.
1874 is earliest year and 2019 is the latest year.

You can divide the year by 10 to make your bin.
df = (df.withColumn('bin', F.floor(F.col('year_range') / 10))
.withColumn('bin', F.when(F.col('bin') >= 200, 200)
.when(F.col('bin') <= 195, 195)
.otherwise(F.col('bin')))
.groupby('bin')
.agg(F.avg('avg(rating)').alias('10_years_avg')))
Then, format your year_range column.
df = df.withColumn('year_range', F.when(F.col('bin') >= 200, F.lit('2000 onwards'))
.when(F.col('bin') <= 195, F.lit('1959 and earlier'))
.otherwise(F.concat(F.col('bin'), F.lit('0-'), F.col('bin'), F.lit('9'))))

Related

Is there a way to extract data from a map(varchar, varchar) column in SQL?

The data is stored as map(varchar, varchar) and looks like this:
Date Info ID
2020-06-10 {"Price":"102.45", "Time":"09:31", "Symbol":"AAPL"} 10
2020-06-10 {"Price":"10.28", "Time":"12:31", "Symbol":"MSFT"} 10
2020-06-11 {"Price":"12.45", "Time":"09:48", "Symbol":"T"} 10
Is there a way to split up the info column and return a table where each entry has its own column?
Something like this:
Date Price Time Symbol ID
2020-06-10 102.45 09:31 AAPL 10
2020-06-10 10.28 12:31 MSFT 10
Note, there is the potential for the time column to not appear in every entry. For example, an entry can look like this:
Date Info ID
2020-06-10 {"Price":"10.28", "Symbol":"MSFT"} 10
In this case, I would like it to just fill it with a nan value
Thanks
You can use the subscript operator ([]) or the element_at function to access the values in the map. The difference between the two is that [] will fail with an error if the key is missing from the map.
WITH data(dt, info, id) AS (VALUES
(DATE '2020-06-10', map_from_entries(ARRAY[('Price', '102.45'), ('Time', '09:31'), ('Symbol','AAPL')]), 10),
(DATE '2020-06-10', map_from_entries(ARRAY[('Price', '10.28'), ('Time', '12:31'), ('Symbol','MSFT')]), 10),
(DATE '2020-06-11', map_from_entries(ARRAY[('Price', '12.45'), ('Time', '09:48'), ('Symbol','T')]), 10),
(DATE '2020-06-12', map_from_entries(ARRAY[('Price', '20.99'), ('Symbol','X')]), 10))
SELECT
dt AS "date",
element_at(info, 'Price') AS price,
element_at(info, 'Time') AS time,
element_at(info, 'Symbol') AS symbol,
id
FROM data
date | price | time | symbol | id
------------+--------+-------+--------+----
2020-06-10 | 102.45 | 09:31 | AAPL | 10
2020-06-10 | 10.28 | 12:31 | MSFT | 10
2020-06-11 | 12.45 | 09:48 | T | 10
2020-06-12 | 20.99 | NULL | X | 10
This answers the original version of the question.
If that is really a string, you can use regular expressions:
select t.*,
regexp_extract(info, '"Price":"([^"]*)"', 1) as price,
regexp_extract(info, '"Symbol":"([^"]*)"', 1) as symbol,
regexp_extract(info, '"Time":"([^"]*)"', 1) as time
from t;

Groupby by Year with pd.Timestamp / dateTime64 in the format YYYY-MM-DD while keeping full timestamp

I have a dataframe with a column "time" and "value" in the format YYYY-MM-DD and np.int64
time | value
2009-11-03 | 13
2009-11-14 | 25
2009-12-05 | 25
2016-03-02 | 80
2016-05-17 | 56
I need to groupby by year, getting the maximum value by year. If days within the same year both have the highest value I need tp keep both. But I need to keep the full timestamp as well.
Desired output:
time | value
2009-11-14 | 25
2009-12-05 | 25
2016-03-02 | 80
My code so far:
df["year"] = df["time"].dt.year
df = df.groupby(["year"], sort=False)['value'].max()
But this removes the timestamp and I only have the year + value as a column. How can I get the desired result?
Let us try transform first then do filter
m=df.value.eq(df.groupby(df.time.dt.year).value.transform('max'))
df=df[m]
Out[111]:
time value
1 2009-11-14 25
2 2009-12-05 25
3 2016-03-02 80
Calculate the maximum values per year, and then join the result with the original data frame:
df["year"] = pd.to_datetime(df["time"]).dt.year
max_val = df.groupby(["year"], sort=False)['value'].max()
pd.merge(max_val, df, on=["value", "year"])
result:
value year time
0 25 2009 2009-11-14
1 25 2009 2009-12-05
2 80 2016 2016-03-02

Query a table so that data in one column could be shown as different fields

I have a table that stores data of customer care . The table/view has the following structure.
userid calls_received calls_answered calls_rejected call_date
-----------------------------------------------------------------------
1030 134 100 34 28-05-2018
1012 140 120 20 28-05-2018
1045 120 80 40 28-05-2018
1030 99 39 50 28-04-2018
1045 50 30 20 28-04-2018
1045 200 100 100 28-05-2017
1030 160 90 70 28-04-2017
1045 50 30 20 28-04-2017
This is the sample data. The data is stored on day basis.
I have to create a report in a report designer software that takes date as an input. When user selects a date for eg. 28/05/2018. This date is send as parameter ${call_date}. i have to query the view in such a way that result should look like as below. If user selects date 28/05/2018 then data of 28/04/2018 and 28/05/2017 should be displayed side by side as like the below column order.
userid | cl_cur | ans_cur | rej_cur |success_percentage |diff_percent|position_last_month| cl_last_mon | ans_las_mon | rej_last_mon |percentage_lm|cl_last_year | ans_last_year | rej_last_year
1030 | 134 | 100 | 34 | 74.6 % | 14% | 2 | 99 | 39 | 50 | 39.3% | 160 | 90 | 70
1045 | 120 | 80 | 40 | 66.6% | 26.7% | 1 | 50 | 30 | 20 | 60% | 50 | 30 | 20
The objective of this query is to show data of selected day, data of same day previous month and same day previous years in columns so that user can have a look and compare. Here the result is ordered by percentage(ans_cur/cl_cur) of selected day in descending order of calculated percentage and show under success_percentage.
The column position_last_month is the position of that particular employee in previous month when it is ordered in descending order of percentage. In this example userid 1030 was in 2nd position last month and userid 1045 in 1 st position last month. Similarly I have to calculate this also for year.
Also there is a field called diff_percent which calculates the difference of percentage between the person who where in same position last month.Same i have to do for last year. How i can achieve this result.Please help.
THIS ANSWERS THE ORIGINAL VERSION OF THE QUESTION.
One method is a join:
select t.user_id,
t.calls_received as cr_cur, t.calls_answered as ca_cur, t.calls_rejected as cr_cur,
tm.calls_received as cr_last_mon, tm.calls_answered as ca_last_mon, tm.calls_rejected as cr_last_mon,
ty.calls_received as cr_last_year, ty.calls_answered as ca_last_year, ty.calls_rejected as cr_last_year
from t left join
t tm
on tm.userid = t.userid and
tm.call_date = dateadd(month, -1, t.call_date) left join
t ty
on ty.userid = t.userid and
tm.call_date = dateadd(year, -1, t.call_date)
where t.call_date = ${call_date};

how to group dates from ms access database as week of month using excel vba

I am using MS access 2010 database and working with Excel VBA to connect to the database and make queries. Suppose I have a table named "MyTable" like this below:
----------------------
| Date | Count |
----------------------
|7/7/16 | 12 |
----------------------
|7/8/16 | 15 |
----------------------
|7/15/16 | 18 |
----------------------
|7/18/16 | 16 |
----------------------
|8/7/16 | 15 |
----------------------
|8/8/16 | 10 |
----------------------
|8/15/16 | 9 |
----------------------
|8/16/16 | 18 |
----------------------
Now I want to use query to get a table like this:
----------------------
|Week by Month | Sum |
----------------------
|July Week 2 | 27 |
----------------------
|July Week 3 | 18 |
----------------------
|July Week 4 | 16 |
----------------------
|Aug Week 2 | 25 |
----------------------
|Aug Week 3 | 27 |
----------------------
Use DatePart to get the week of the year, then subtract the week of the first day of the month (zero based week of the month) and then add 1 (to get to a one based week of the month:
Public Function WeekOfMonth(x As Date) As Integer
WeekOfMonth = DatePart("ww", x) - _
DatePart("ww", DateSerial(Year(x), Month(x), 1)) _
+ 1
End Function
Note that the Access SQL version should be idential to what's after the = sign.
I have solved this as below:
select weeknum, sum(count1) from (
select format(date1,'MMM') & " Week - " & int((datepart('d',date1,1,1) -1 ) / 7 + 1) as weeknum, count1 from MyTable)
group by weeknum
Show Week of Month where Week 1 is always the 1st Full Week of the Month starting in that month (First Sunday is 1 or 2 or 3 or 4 or 5 or 6 or 7), days of the month prior to the first Sunday are counted as week 4/5 of previous month.
After searching and failing to find EXACTLY the right answer for my situation - I modified ComIntern's solution as follows. This is used a CONTROL on a REPORT, where [StartDate] is a criteria on the form that calls/generates the report:
=IIf((DatePart("ww",[StartDate]-7)-DatePart("ww",DateSerial(Year([StartDate]-7),Month([StartDate]-7),1))+1)="5","1",DatePart("ww",[StartDate])-DatePart("ww",DateSerial(Year([StartDate]),Month([StartDate]),1))+0)
This results in showing the Week of Month based on FULL weeks - and accounts for when the previous month's week 5 included 1 or more days from this month.
For example - Week 5 of Oct 2017 is 29 OCT - 04 NOV. If I did not include the IIF statement to adjust the formula, 05-11 NOV is returned as Week 2, but for my reporting purposes it is Week 1 of NOV. I have tested this out and appears to ALWAYS work, if you need to see Week of Month, based on FULL weeks, this should work for you!

How to calculate the broadcast year and month out of the given date?

Is there a way to calculate the the broadcast year and month for a given gregorian date?
The advertising broadcast calendar differs from the regular calendar, in the way that every month needs to start on a Monday and end on a Sunday and have exactly 4 or 5 weeks. You can read about it here: http://en.wikipedia.org/wiki/Broadcast_calendar
This is a pretty common thing in TV advertising, so I guess there is a standard mathematical formula for it, that uses a combination of date functions (week(), month(), etc...).
Here is an example mapping between gregorian and broadcast dates:
| gregorian_date | broadcast_month | broadcast_year |
+----------------+-----------------+----------------+
| 2014-12-27 | 12 | 2014 |
| 2014-12-28 | 12 | 2014 |
| 2014-12-29 | 1 | 2015 |
| 2014-12-30 | 1 | 2015 |
| 2014-12-31 | 1 | 2015 |
| 2015-01-01 | 1 | 2015 |
| 2015-01-02 | 1 | 2015 |
Here is example how the broadcast calendar looks for 2015:
http://www.rab.com/public/reports/BroadcastCalendar_2015.pdf
As far as I can see, the pattern is that the first of the Gregorian month always falls within the first week of the Broadcast calendar, and any days from the previous month are pulled forward into that month to create full weeks. In Excel, you can use the following formula in cell B2 (first of your broadcast months above) to calculate the broadcast month:
=MONTH(A2+(7-WEEKDAY(A2,2)))
Similarly, in cell C2:
=IF(AND(MONTH(A2)=12,B2=1),YEAR(A2)+1,YEAR(A2))
This will return the broadcast month and year for any dates you put into your data set.
Hope that helps!
month,first,last
2018_1,2018-01-01,2018-01-28
2018_2,2018-01-29,2018-02-25
2018_3,2018-02-26,2018-03-25
2018_4,2018-03-26,2018-04-29
2018_5,2018-04-30,2018-05-27
2018_6,2018-05-28,2018-06-24
2018_7,2018-06-25,2018-07-29
2018_8,2018-07-30,2018-08-26
2018_9,2018-08-27,2018-09-30
2018_10,2018-10-01,2018-10-28
2018_11,2018-10-29,2018-11-25
2018_12,2018-11-26,2018-12-30
2019_1,2018-12-31,2019-01-27
2019_2,2019-01-28,2019-02-24
2019_3,2019-02-25,2019-03-31
2019_4,2019-04-01,2019-04-28
2019_5,2019-04-29,2019-05-26
2019_6,2019-05-27,2019-06-30
2019_7,2019-07-01,2019-07-28
2019_8,2019-07-29,2019-08-25
2019_9,2019-08-26,2019-09-29
2019_10,2019-09-30,2019-10-27
2019_11,2019-10-28,2019-11-24
2019_12,2019-11-25,2019-12-29
2020_1,2019-12-30,2020-01-26
2020_2,2020-01-27,2020-02-23
2020_3,2020-02-24,2020-03-29
2020_4,2020-03-30,2020-04-26
2020_5,2020-04-27,2020-05-31
2020_6,2020-06-01,2020-06-28
2020_7,2020-06-29,2020-07-26
2020_8,2020-07-27,2020-08-30
2020_9,2020-08-31,2020-09-27
2020_10,2020-09-28,2020-10-25
2020_11,2020-10-26,2020-11-29
2020_12,2020-11-30,2020-12-27
2021_1,2020-12-28,2021-01-31
2021_2,2021-02-01,2021-02-28
2021_3,2021-03-01,2021-03-28
2021_4,2021-03-29,2021-04-25
2021_5,2021-04-26,2021-05-30
2021_6,2021-05-31,2021-06-27
2021_7,2021-06-28,2021-07-25
2021_8,2021-07-26,2021-08-29
2021_9,2021-08-30,2021-09-26
2021_10,2021-09-27,2021-10-31
2021_11,2021-11-01,2021-11-28
2021_12,2021-11-29,2021-12-26
2022_1,2021-12-27,2022-01-30
2022_2,2022-01-31,2022-02-27
2022_3,2022-02-28,2022-03-27
2022_4,2022-03-28,2022-04-24
2022_5,2022-04-25,2022-05-29
2022_6,2022-05-30,2022-06-26
2022_7,2022-06-27,2022-07-31
2022_8,2022-08-01,2022-08-28
2022_9,2022-08-29,2022-09-25
2022_10,2022-09-26,2022-10-30
2022_11,2022-10-31,2022-11-27
2022_12,2022-11-28,2022-12-25
2023_1,2022-12-26,2023-01-29
2023_2,2023-01-30,2023-02-26
2023_3,2023-02-27,2023-03-26
2023_4,2023-03-27,2023-04-30
2023_5,2023-05-01,2023-05-28
2023_6,2023-05-29,2023-06-25
2023_7,2023-06-26,2023-07-30
2023_8,2023-07-31,2023-08-27
2023_9,2023-08-28,2023-09-24
2023_10,2023-09-25,2023-10-29
2023_11,2023-10-30,2023-11-26
2023_12,2023-11-27,2023-12-31
2024_1,2024-01-01,2024-01-28
2024_2,2024-01-29,2024-02-25
2024_3,2024-02-26,2024-03-31
2024_4,2024-04-01,2024-04-28
2024_5,2024-04-29,2024-05-26
2024_6,2024-05-27,2024-06-30
2024_7,2024-07-01,2024-07-28
2024_8,2024-07-29,2024-08-25
2024_9,2024-08-26,2024-09-29
2024_10,2024-09-30,2024-10-27
2024_11,2024-10-28,2024-11-24
2024_12,2024-11-25,2024-12-29
2025_1,2024-12-30,2025-01-26
2025_2,2025-01-27,2025-02-23
2025_3,2025-02-24,2025-03-30
2025_4,2025-03-31,2025-04-27
2025_5,2025-04-28,2025-05-25
2025_6,2025-05-26,2025-06-29
2025_7,2025-06-30,2025-07-27
2025_8,2025-07-28,2025-08-31
2025_9,2025-09-01,2025-09-28
2025_10,2025-09-29,2025-10-26
2025_11,2025-10-27,2025-11-30
2025_12,2025-12-01,2025-12-28