How to create a pivot table from a given dataframe? - pandas

I have imported the tips data set from seaborn and tried to find the maximum bill amount for lunch and dinner on Saturday and Sunday.
I tried below code but get an error:
pd.pivot_table(df, values=df['total_bill'], index=df['day'],
columns=df['time'], aggfunc='max')

In order to get data for index as 'Sat' and 'Sun' you can use loc along with the pivot you created.
pd.pivot_table(df, values='total_bill',index='day',
columns='time', aggfunc='max').loc[['Sat', 'Sun']]

Related

Cross apply historical date range in BigQuery

I have a growing table of orders which looks something like this:
units_sold
timestamp
1
2021-03-02 10:00:00
2
2021-03-02 11:00:00
4
2021-03-02 12:00:00
3
2021-03-03 13:00:00
9
2021-03-03 14:00:00
I am trying to partition the table into each day, and gather statistics on units sold on the day, and on the day before. I can pretty easily get the units sold today and yesterday for just today, but I need to cross apply a date range for every date in my orders table.
The expected result would look like this:
units_sold_yesterday
units_sold_today
date_measured
12
7
2021-03-02
NULL
12
2021-03-03
One way of doing it, is by creating or appending the order data every day to a new table. However, this table could grow very large and also I need historical data as well.
In my minds eye I know I have cascade the data, so that BigQuery compares the data to "todays date" which would shift across a all the dates in the table.
I'm thinking this shift could come from a cross apply of all the distinct dates in the table, and so I would get a copy of the orders table for each date, but with a different "todays date" column that I can extrapolate the units_sold_today data from by using that column to date-diff the salesdate to.
This would still, however, create a massive amount of data to process, and I guess maybe there is a simple function for this in BigQuery or standard SQL syntax.
This sounds like aggregation and lag():
select timestamp_trunc(timestamp, day), count(*) as sold_today,
lag(count(*)) over (order by min(timestamp)) as sold_yesterday
from t
group by 1
order by 1;
Note: This assumes that you have data for every day.
Consider below
select date_measured, units_sold_today,
lag(units_sold_today) over(order by date_measured) units_sold_yesterday,
from (
select date(timestamp) date_measured,
sum(units_sold) units_sold_today
from `project.dataset.table`
group by date_measured
)
if applied to sample data in your question - output is

How to add query results to table in SQL?

I am trying to add the results of a query to an existing table, dependent on the values of an existing column. For example, using the table below
Store
Sales
Weekday
10
11000
Weekday
11
5000
Weekday
12
8000
Weekday
10
19000
Weekend
11
20000
Weekend
12
5000
Weekend
I want the averages per store and weekday which I can get using the following:
SELECT AVG(Sales) AS weekday_avg, store
FROM store_sales
WHERE Weekday = 'Weekday'
GROUP BY store;
But then I'd like to add these results to the same table and store in a column named 'weekday_avg'.
I've tried the following, and while I don't get an error, the column doesn't have any values added:
ALTER TABLE store_sales ADD COLUMN weekday_avg numeric;
UPDATE store_sales SET weekday_avg = (
SELECT AVG(Sales) AS weekday_avg
FROM store_sales
WHERE Weekday = 'Weekday'
GROUP BY store
);
I know this probably isn't best database practice, but I'm working with what has been provided and all I need is to end up with a table with columns for averages per store / weekday type that I can export into R for further analysis.
Many thanks in advance!
As mentioned in other answer, don't store calculated values in table.
But if you want to know how it can be done, then one of the options is to use the corelated query as follows:
UPDATE store_sales s SET s.weekday_avg = (
SELECT AVG(ss.Sales) AS weekday_avg
FROM store_sales ss
WHERE s.Weekday = ss.weekday
And ss.store=s.store
);
Don't store the values. They are easily calculated on the fly:
select ss.*,
avg(sales) over (partition by store, weekday) as weekday_avg_sales
from store_sales ss;
Perhaps if you have a really big table, you might want to store the summary values. But even so, I would recommend a second table for the summaries and joins. With indexes, that would be much more efficient.
Note that calculating the data on the fly means that it is always up-to-date when existing data is updated or new data is inserted.

SQL: Dynamic Join Based on Row Value

Context:
I am working with some complicated schema and have got many CTEs and joins to get to this point. This is a watered-down version and completely different source data and example to illustrate my point (data anonymity). Hopefully it provides enough of a snapshot.
Data Overview:
I have a service which generates a production forecast looking ahead 30 days. The forecast is generated for each facility, for each shift (morning/afternoon). Each forecast produced covers all shifts (morning/afternoon/evening) so they share a common generation_id but different forecast_profile_key.
What I am trying to do: I want to find the SUM of the forecast error for a given forecast generation constrained by a dynamic date range based on whether the date is a weekday or weekend. The SUM must be grouped only on similar IDs.
Basically, the temp table provides one record per facility per date per shift with the forecast error. I want to SUM the historical error dynamically for a facility/shift/date based on whether the date is weekday/weekend, and only SUM the error where the IDs match up.. (hope that makes sense!!)
Specifics: I want to find the SUM grouped by 'week_part_grouping', 'forecast_profile_key', 'forecast_profile' and 'forecast_generation_id'. The part I am struggling with is that I only want to SUM the error dynamically based on date: (a) if the date is a weekday, I want to SUM the error from up to the 5 recent-most days in a 7 day look back period, or (b) if the date is a weekend, I want to SUM the error from up to the 3 recent-most days in a 16 day look back period.
Ideally, having an extra column for 'total_forecast_error_in_lookback_range'.
Specific examples:
For 'facility_a', '2020-11-22' is a weekend. The lookback range is 16 days, so any date between '2020-11-21' and '2020-11-05' is eligible. The 3 recent-most dates would be '2020-11-21', '2020-11-15' and '2020-11'14'. Therefore, the sum of error would be 2000+3250+1050.
For 'facility_a', '2020-11-20' is a weekday. The lookback range is 7 days, so any date between '2020-11-19 and '2020-11-13'. That would work out to be '2020-11-19':'2020-11-16' and '2020-11-13'.
For 'facility_b', notice there is a change in the 'forecast_generation_id'. So, the error for '2020-11-20' would be only be 4565.
What I have tried: I'll confess to not being quite sure how to break down this portion. I did consider a case statement on the week_part but then got into a nested mess. I considered using a RANK windowed function but I didn't make much progress as was unsure how to implement the dynamic lookback component. I then also thought about doing some LISTAGG to get all the dates and do a REGEXP wildcard lookup but that would be very slow..
I am seeking pointers how to go about achieving this in SQL. I don't know if I am missing something from my toolkit here to go about breaking this down into something I can implement.
DROP TABLE IF EXISTS seventh__error_calc;
create temporary table seventh__error_calc
(
facility_name varchar,
shift varchar,
date_actuals date,
week_part_grouping varchar,
forecast_profile_key varchar,
forecast_profile_id varchar,
forecast_generation_id varchar,
count_dates_in_forecast bigint,
forecast_error bigint
);
Insert into seventh__error_calc
VALUES
('facility_a','morning','2020-11-22','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1000'),
('facility_a','morning','2020-11-21','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2000'),
('facility_a','morning','2020-11-20','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3000'),
('facility_a','morning','2020-11-19','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2500'),
('facility_a','morning','2020-11-18','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1200'),
('facility_a','morning','2020-11-17','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','5000'),
('facility_a','morning','2020-11-16','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','4400'),
('facility_a','morning','2020-11-15','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','3250'),
('facility_a','morning','2020-11-14','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','1050'),
('facility_a','morning','2020-11-13','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-12','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-11','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-10','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-09','weekday','facility_a_morning_Mon_Fri','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_a','morning','2020-11-08','weekend','facility_a_morning_Sat_Sun','Profile#facility_a#dfc3989b#b6e5386a','6809dea6','8','2450'),
('facility_b','morning','2020-11-22','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3400'),
('facility_b','morning','2020-11-21','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','2800'),
('facility_b','morning','2020-11-20','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','3687'),
('facility_b','morning','2020-11-19','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','6809dea6','8','4565'),
('facility_b','morning','2020-11-18','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1262'),
('facility_b','morning','2020-11-17','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','8765'),
('facility_b','morning','2020-11-16','weekday','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','5678'),
('facility_b','morning','2020-11-15','weekend','facility_b_morning_Mon_Fri','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','2893'),
('facility_b','morning','2020-11-14','weekend','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','1928'),
('facility_b','morning','2020-11-13','weekday','facility_b_morning_Sat_Sun','Profile#facility_b#dfc3989b#b6e5386a','7252fzw5','8','4736')
;
SELECT *
FROM seventh__error_calc
This achieved what I was trying to do. There were two learning points here.
Self Joins. I've never used one before but can now see why they are powerful!
Using a CASE statement in the WHERE clause.
Hope this might help someone else some day!
select facility_name,
forecast_profile_key,
forecast_profile_id,
shift,
date_actuals,
week_part_grouping,
forecast_generation_id,
sum(forecast_error) forecast_err_calc
from (
select rank() over (partition by forecast_profile_id, forecast_profile_key, facility_name, a.date_actuals order by b.date_actuals desc) rnk,
a.facility_name, a.forecast_profile_key, a.forecast_profile_id, a.shift, a.date_actuals, a.week_part_grouping, a.forecast_generation_id, b.forecast_error
from seventh__error_calc a
join seventh__error_calc b
using (facility_name, forecast_profile_key, forecast_profile_id, week_part_grouping, forecast_generation_id)
where case when a.week_part_grouping = 'weekend' then b.date_actuals between a.date_actuals - 16 and a.date_actuals
when a.week_part_grouping = 'weekday' then b.date_actuals between a.date_actuals - 7 and a.date_actuals
end
) src
where case when week_part_grouping = 'weekend' then rnk < 4
when week_part_grouping = 'weekday' then rnk < 6
end

Visualizing headcount data over a particular time period

I have a data visualization question that I would like to get some input on. I'm currently using python pandas to clean up a data set then subsequently uploading it in SISENSE for use. What I am trying to do is visualize active jobs grouped by week/month based on the start and end dates of particular assignments. For example, I have a set of jobs with the following start dates, organized in rows within a dataframe:
Job ID Start Date End Date
Job 1 5/25/2020 6/7/2020
Job 2 5/25/2020 5/31/2020
For the week of 5/25/2020 I have two active jobs, and for the week of 6/1/2020 I have 1 active job. The visualization should look like a bar chart with the x axis being the week/time period and y axis being the count of active jobs.
How can I best organize this into a data frame and visualize it?
something like
df = pd.DataFrame({'Job ID': [1,2], 'Start Date': ['5/25/2020', '5/25/2020'], 'End Date': ['6/7/2020', '5/31/2020']})
You could then apply a function to generate a new column 'week beginning' - take a look here for a solution in python Get week start date (Monday) from a date column in Python (pandas)?
import datetime as dt
# Change 'myday' to contains dates as datetime objects
df['Start Date'] = pd.to_datetime(df['Start Date'])
# 'daysoffset' will container the weekday, as integers
df['daysoffset'] = df['Start Date'].apply(lambda x: x.weekday())
# We apply, row by row (axis=1) a timedelta operation
df['Week Beginning'] = df.apply(lambda x: x['Start Date'] - dt.TimeDelta(days=x['daysoffset']), axis=1)
and then groupby on this week beginning
counr = df.groupby(df['Week Beginning']).sum()
Following this, you could plot using
count_by_job_id = count['Job ID']
pd.DataFrame(count_by_job_id).plot.bar()
You will need a custom SQL in Sisense elasticube to make this work easily. You will then join your dataframe table with the dim_dates ( excel file from the below link)
This is similar to scenario described here : https://support.sisense.com/hc/en-us/articles/230644208-Date-Dimension-File
You custom SQL will be something like this :
Select JobID,
CAST(startdate as date) as Startdate,
CAST(enddate as date) as Enddate,
C.RECORD_DATE AS week_start
FROM JOB j
JOIN tbl_Calendar C ON c.RECORD_DATE BETWEEN j.StartDate and j.EndDate
WHERE DATENAME(DW,C.RECORD_DATE) = 'MONDAY'
Then you can just create a column chart and drop the fields week_start( you can format in a few different ways) under categories section and drop the field count(JOBID) under values section.

SQL Query / Regular Expression to Split Custom string into columns

Hi I have source data in a field in different formats like below
1Y3M6D (1 Year, 3 Months, 6 Days) I need to split this into 3 fields Year, Month, Days but the source data format can change like month can come first as 3M1Y6D OR source data can only have 3M with no year and day. How do I write a query to get the preceding number from M, Y or D?
Thanks in advance for help.
Thanks everyone, Unoembre command helped.
select my_value, REGEXP_SUBSTR(my_value,'(\d+)Y',1,1,NULL,1) REG_Y, REGEXP_SUBSTR(my_value,'(\d+)M',1,1,NULL,1) REG_M, REGEXP_SUBSTR(my_value,'(\d+)D',1,1,NULL,1) REG_D from ( select '3M6Y2D' my_value from dual );