Related
Let's say that we have this table
Employee
EmploymentStarted
EmploymentEnded
Sara
20210115
20210715
Lora
20210215
20210815
Viki
20210515
20210615
Now what I need is a table that we can see all the employees that we had each month. For example, Sara started on January 15th 2021 and she left the company on July 15th 2021. This means that she has been with us during January, February, March, April, May, June and July.
The result table should look like this:
Month
Year
Employee
January
2021
Sara
February
2021
Sara
February
2021
Lora
March
2021
Sara
March
2021
Lora
April
2021
Sara
April
2021
Lora
May
2021
Sara
May
2021
Lora
May
2021
Viki
June
2021
Sara
June
2021
Lora
June
2021
Viki
July
2021
Sara
July
2021
Lora
August
2021
Lora
How can I get a table like this in SQL?
I tried a group by, but it does not seem to be the right way to do it
It would be interesting to find out in practice how much performance decreases when using recursion. In this case calendarTable contain less about 12 records per year. Most part of query cost is JOIN to Employee (staff) table.
with FromTo as (
select min(EmploymentStarted) fromDt, eomonth(max(EmploymentEnded)) toDt
from staff
)
--,FromTo as(select fromDt=#fromDt,toDt=#toDt)
,rdT as (
select 1 n,fromDt bM,eomonth(fromDt) eM
,fromDt, toDt
from FromTo
union all
select n+1 n,dateadd(day,1,eM) bM,eomonth(dateadd(month,1,bM)) eM
,fromDt,toDt
from rdT where dateadd(month,1,bM)<toDt
)
select month(bM) as 'Month',year(bM) as 'Year',Employee --*
from rdT left join staff s on s.EmploymentEnded>=bM and s.EmploymentStarted<=eM
Fiddle
I have a table in HANA that has the following columns:
Customer(varchar), Day(varchar), Item(varchar), Quantity (decimal), Cost(decimal)
My goal is to have a sql procedure that will duplicate the table values and append them to the existing table daily, while also updating the Day column values with the next day. So it will just be the same data over and over but new values for the Day column.
I believe this needs a select * from the table into a variable, then loop through the variable in which the Day column values will push forward 1 day, and then an insert all the updates rows. I'm stuck in the part of testing out the selection of 1 column and declaring it into a variable, and keep receiving this error:
DO
BEGIN
DECLARE V1 VARCHAR(20);
SELECT 'ITEM' INTO V1 FROM "TABLE_NAME";
SELECT :V1 FROM "TABLE_NAME";
END;
DBTech JDBC: fetch returns more than requested number of rows: "TABLE_NAME"."(DO statement)": line 4 col 5 (at pos 43):
if you want to double your values, you don't need loops or variables.
the following doubles all ITEMs with a curent timestamp
INSERT INTO "TABLE_NAME"
("ITEM", "MYDATETIME")
SELECT "ITEM", NOW ( )
FROM "TABLE_NAME";
The requirement seems to be:
For each customer copy all entries of a reference day (e.g., the most recent entries) to new entries for the day following the reference day.
Such a function could be supporting e.g., the daily "roll over" of inventory entries.
In its most basic form of requirements this can be achieved in plain SQL - no procedure code required.
create column table cust_items
(customer nvarchar(20)
, i
tem_date date
, item nvarchar(20)
, quantity decimal (10,2)
, cost decimal (10,2)) ;
insert into cust_items values ('Aardvark Inc.', add_days(current_date, -3), 'Yellow Balls', 10, 23.23);
insert into cust_items values ('Aardvark Inc.', add_days(current_date, -3), 'Bird Food', 4.5, 47.11);
insert into cust_items values ('Aardvark Inc.', add_days(current_date, -3), 'Carrot Cake', 3, 08.15);
insert into cust_items values ('Wolf Ltd.', add_days(current_date, -3), 'Red Ballon', 1, 47.11);
insert into cust_items values ('Wolf Ltd.', add_days(current_date, -3), 'Black Bile', 2, 23.23);
insert into cust_items values ('Wolf Ltd.', add_days(current_date, -3), 'Carrot Cake', 3, 08.15);
insert into cust_items values ('Wolf Ltd.', add_days(current_date, -2), 'Red Ballon', 1, 47.11);
insert into cust_items values ('Wolf Ltd.', add_days(current_date, -2), 'Black Bile', 2, 23.23);
insert into cust_items values ('Wolf Ltd.', add_days(current_date, -2), 'Carrot Cake', 3, 08.15);
select * from cust_items
order by item_date, customer, item;
/*
CUSTOMER ITEM_DATE ITEM QUANTITY COST
Aardvark Inc. 6 Apr 2022 Bird Food 4.5 47.11
Aardvark Inc. 6 Apr 2022 Carrot Cake 3 8.15
Aardvark Inc. 6 Apr 2022 Yellow Balls 10 23.23
Wolf Ltd. 6 Apr 2022 Black Bile 2 23.23
Wolf Ltd. 6 Apr 2022 Carrot Cake 3 8.15
Wolf Ltd. 6 Apr 2022 Red Ballon 1 47.11
Wolf Ltd. 7 Apr 2022 Black Bile 2 23.23
Wolf Ltd. 7 Apr 2022 Carrot Cake 3 8.15
Wolf Ltd. 7 Apr 2022 Red Ballon 1 47.11
*/
We see that the two customers have individual entries, "Wolf Ltd." most recent entries are for April 7th, "Aardvark Inc." most recent entries are for April 6th.
The first part of the task now is to find the entries for most recent ITEM_DATE per customer. A simple join with a sub-query is sufficient here:
select co.customer, add_days(co.item_date, 1) as new_date, co.item, co.quantity, co.cost
from
cust_items co
join (select customer, max(item_date) max_date
from cust_items ci
group by customer) m_date
on (co.customer, co.item_date)
= (m_date.customer, m_date.max_date);
/* new date entries for each customer, based on previous most recent entry per customer
CUSTOMER NEW_DATE ITEM QUANTITY COST
Aardvark Inc. 7 Apr 2022 Yellow Balls 10 23.23
Aardvark Inc. 7 Apr 2022 Bird Food 4.5 47.11
Aardvark Inc. 7 Apr 2022 Carrot Cake 3 8.15
Wolf Ltd. 8 Apr 2022 Red Ballon 1 47.11
Wolf Ltd. 8 Apr 2022 Black Bile 2 23.23
Wolf Ltd. 8 Apr 2022 Carrot Cake 3 8.15
*/
Note, the add_days(co.item_date, 1) as new_date function takes care of the "moving the day one day ahead" requirement.
The second part of the requirement is INSERTing the new entries into the same table:
insert into cust_items
(select co.customer, add_days(co.item_date, 1) as new_date, co.item, co.quantity, co.cost
from
cust_items co
join (select customer, max(item_date) max_date
from cust_items ci
group by customer) m_date
on (co.customer, co.item_date)
= (m_date.customer, m_date.max_date)
);
/* execute 3 times
Statement 'insert into cust_items (select co.customer, add_days(co.item_date, 1) as new_date, co.item, ...'
successfully executed in 25 ms 530 µs (server processing time: 14 ms 94 µs) - Rows Affected: 6
Statement 'insert into cust_items (select co.customer, add_days(co.item_date, 1) as new_date, co.item, ...'
successfully executed in 9 ms 288 µs (server processing time: 3 ms 900 µs) - Rows Affected: 6
Statement 'insert into cust_items (select co.customer, add_days(co.item_date, 1) as new_date, co.item, ...'
successfully executed in 11 ms 311 µs (server processing time: 4 ms 586 µs) - Rows Affected: 6
--> number of new records always the same as only the most recent values are copied
*/
The table content now looks like this:
/*
CUSTOMER ITEM_DATE ITEM QUANTITY COST
Aardvark Inc. 6 Apr 2022 Bird Food 4.5 47.11
Aardvark Inc. 6 Apr 2022 Carrot Cake 3 8.15
Aardvark Inc. 6 Apr 2022 Yellow Balls 10 23.23
Wolf Ltd. 6 Apr 2022 Black Bile 2 23.23
Wolf Ltd. 6 Apr 2022 Carrot Cake 3 8.15
Wolf Ltd. 6 Apr 2022 Red Ballon 1 47.11
Aardvark Inc. 7 Apr 2022 Bird Food 4.5 47.11
Aardvark Inc. 7 Apr 2022 Carrot Cake 3 8.15
Aardvark Inc. 7 Apr 2022 Yellow Balls 10 23.23
Wolf Ltd. 7 Apr 2022 Black Bile 2 23.23
Wolf Ltd. 7 Apr 2022 Carrot Cake 3 8.15
Wolf Ltd. 7 Apr 2022 Red Ballon 1 47.11
Aardvark Inc. 8 Apr 2022 Bird Food 4.5 47.11
Aardvark Inc. 8 Apr 2022 Carrot Cake 3 8.15
Aardvark Inc. 8 Apr 2022 Yellow Balls 10 23.23
Wolf Ltd. 8 Apr 2022 Black Bile 2 23.23
Wolf Ltd. 8 Apr 2022 Carrot Cake 3 8.15
Wolf Ltd. 8 Apr 2022 Red Ballon 1 47.11
Aardvark Inc. 9 Apr 2022 Bird Food 4.5 47.11
Aardvark Inc. 9 Apr 2022 Carrot Cake 3 8.15
Aardvark Inc. 9 Apr 2022 Yellow Balls 10 23.23
Wolf Ltd. 9 Apr 2022 Black Bile 2 23.23
Wolf Ltd. 9 Apr 2022 Carrot Cake 3 8.15
Wolf Ltd. 9 Apr 2022 Red Ballon 1 47.11
Wolf Ltd. 10 Apr 2022 Black Bile 2 23.23
Wolf Ltd. 10 Apr 2022 Carrot Cake 3 8.15
Wolf Ltd. 10 Apr 2022 Red Ballon 1 47.11
--> only Wolf Ltd. has entries on 10/4 as it was the only one starting off with
values on 7/4.
*/
I have this dataset:
year
month
victims
2016
7
4869
2016
8
4817
2016
9
4900
2016
10
4873
2016
11
4461
2016
12
4908
2017
1
4717
2017
2
4489
2017
3
4733
2017
4
4549
2017
5
4928
2017
6
4767
2017
7
4713
2017
8
4992
2017
9
4885
2017
10
5049
2017
11
4861
2017
12
4667
....
I want to plot the number of victims by year and month
I used this code :
sb.lineplot(data=number_victims, x='month', y='victims', hue='year')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), title = 'month', title_fontsize = 12)
this is the result:
I tried to use FacetGrid to get a better view of each year
this is my code :
g = sb.FacetGrid(number_victims, col="year", margin_titles=True, col_wrap=3, height=3, aspect=2)
g.map_dataframe(
sb.lineplot, x="month", y="victims"
)
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('The overall trend of victims number by month and year');
and this is the result :
So I have 2 questions:
How to sort the months in the first graph? (to start from 1 to 12)
Why in the second graph the name of months are from 1 to 10 ? and in 2016 it suppose to start from month #7 not #1, how can I fix that?
Thank you so much.
I am trying to populate a DataFrame using the result of a calculation performed on a different DataFrame.
These calculations should be run on a series, when conditions are met in two separate series.
Here is what I have tried.
I have built a dataframe, rswcapacity on which calculations should be run, then created another dataframe annualcapacity where I would like the conditional calculations to be stored.
#First DataFrame
d = {'technology': ['EAF', 'EAF', 'EAF', 'BOF', 'BOF', 'BOF'], 'equip_detail1': [150, 130, 100, 200, 200, 150], 'equip_number' : [1, 2, 3, 1, 2, 3], 'capacity_actual': [2400, 2080, 1600, 3200, 3200, 2400], 'start_year': [1992, 1993, 1994, 1989, 1990, 1991], 'closure_year': [ '', 2002, '', '', 2001, 2011] }
rswcapacity = pd.DataFrame(data = d)
rswcapacity['closure_year'].replace('', np.nan, inplace = True)
#Second DataFrame
annualcapacity = pd.DataFrame(columns=['years', 'capacity'])
annualcapacity ['years'] = [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
#Neither of the attempts below yields the desired results:
for y in years:
annualcapacity['capacity'].append(rswcapacity['capacity_actual'].apply(lambda x : x['capacity_actual'].sum() (x['start_year'] >= y & (x['closure_year'] <= y | x['closure_year'].isnull()))).sum())
annualcapacity
#other attempt:
for y in years:
if (rswcapacity['start_year'] >= y).any() & ((rswcapacity['closure_year'].isnull()).any() | (rswcapacity['closure_year'] <= y).any()):
annualcapacity['capacity'].append(rswcapacity['capacity_actual'].sum())
annualcapacity
The result I would like to obtain is a sum performed for every year.
For instance:
1985 should return NaN as 1985 is smaller than any of the years in start_year 1992 should return 14880, as 1992 is larger than any start_year and smaller than any closure_year
2001 should return 7200, as it is larger than all start_year and larger of all closure_years.
Instead all three my attempts are only returning NaN across the list of years.
There is something wrong with my setting the conditions, but have not managed to work out what.
Any insight much appreciated!
You can do this as follows:
# start with an empty dataframe for the summed capacity
# with int32 as type of the year and float32 as type for the capacity
annualcapacity = pd.DataFrame({'years': pd.Series(dtype='int32'), 'capacity': pd.Series(dtype='float32')})
# use your list of years
years= [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
for y in years:
# create a sum for each year
indexer= (rswcapacity['start_year'] <= y) & ((rswcapacity['closure_year'].isnull()) | (rswcapacity['closure_year'] >= y))
capa= rswcapacity.loc[indexer, 'capacity_actual'].sum()
# and append it to the result frame
annualcapacity= annualcapacity.append(dict(years=y, capacity=capa), ignore_index=True)
annualcapacity
The result looks like this:
years capacity
0 1980 0.0
1 1981 0.0
2 1982 0.0
3 1983 0.0
4 1984 0.0
5 1985 0.0
6 1986 0.0
7 1987 0.0
8 1988 0.0
9 1989 3200.0
10 1990 6400.0
11 1991 8800.0
12 1992 11200.0
13 1993 13280.0
14 1994 14880.0
15 1995 14880.0
16 1996 14880.0
17 1997 14880.0
18 1998 14880.0
19 1999 14880.0
20 2000 14880.0
21 2001 14880.0
22 2002 11680.0
23 2003 9600.0
24 2004 9600.0
25 2005 9600.0
26 2006 9600.0
27 2007 9600.0
28 2008 9600.0
29 2009 9600.0
30 2010 9600.0
31 2011 9600.0
32 2012 7200.0
33 2013 7200.0
34 2014 7200.0
35 2015 7200.0
36 2016 7200.0
37 2017 7200.0
38 2018 7200.0
39 2019 7200.0
40 2020 7200.0
Note: the sums are always numeric, so if there is no capacity for a year, the value is 0.0 instead of NaN. If you need NaN for some reason, you can replace it with the line below.
The second point is, that I switched your condition,
(rswcapacity['start_year'] >= y) & ((rswcapacity['closure_year'].isnull()) | (rswcapacity['closure_year'] <= y))
so >= became <= because I thought, you want to sum all capacities which were available for that year, right?
So if you need NaN entries instead of 0.0 if no capacity is available at all, you can do that as follows:
annualcapacity.loc[annualcapacity['capacity] == 0, 'capacity']= np.NaN
For this, you need to add import numpy as np in your header.
I have a dataframe like this
Year Month ProductCategory Sales(In ThousandDollars)
0 2009 1 WomenClothing 1755.0
1 2009 1 MenClothing 524.0
2 2009 1 OtherClothing 936.0
3 2009 2 WomenClothing 1729.0
4 2009 2 MenClothing 496.0
5 2009 2 OtherClothing 859.0
6 2009 3 WomenClothing 2256.0
7 2009 3 MenClothing 542.0
8 2009 3 OtherClothing 921.0
9 2009 4 WomenClothing 2662.0
10 2009 4 MenClothing 669.0
11 2009 4 OtherClothing 914.0
12 2009 5 WomenClothing 2732.0
13 2009 5 MenClothing 650.0
14 2009 5 OtherClothing 989.0
15 2009 6 WomenClothing 2220.0
16 2009 6 MenClothing 607.0
17 2009 6 OtherClothing 932.0
18 2009 7 WomenClothing 2164.0
19 2009 7 MenClothing 575.0
20 2009 7 OtherClothing 901.0
21 2009 8 WomenClothing 2371.0
22 2009 8 MenClothing 551.0
23 2009 8 OtherClothing 865.0
24 2009 9 WomenClothing 2421.0
25 2009 9 MenClothing 579.0
26 2009 9 OtherClothing 819.0
27 2009 10 WomenClothing 2579.0
28 2009 10 MenClothing 610.0
29 2009 10 OtherClothing 914.0
Every month of a year has 3 different product categories (WomenClothing, MenClothing, OtherClothing), so to represent that we have 3 rows for each month. I want to take average of Sales column for every month, i.e. average of every 3 rows and take that as one value for every month, so that I can reduce the number of rows.
That is, at the end, I just want to have one row for every month in a year.
Just like this:
Year Month Average Sale of each month
0 2009 1 1071.66
3 2009 2 1028.0
6 2009 3 1239.66
10 2009 4 1415.0
You can use:
df.groupby(['Year','Month'])['Sales(In ThousandDollars)'].mean().reset_index()
Year Month Sales(In ThousandDollars)
0 2009 1 1071.666667
1 2009 2 1028.000000
2 2009 3 1239.666667
3 2009 4 1415.000000
4 2009 5 1457.000000
5 2009 6 1253.000000
6 2009 7 1213.333333
7 2009 8 1262.333333
8 2009 9 1273.000000
9 2009 10 1367.666667
You can utilize the index for your grouping. It would look something like this:
df.groupby(df.index // 3).mean()
If your month column is consistent that you will always have 3 rows for each month in a year, you can groupby year and month to get the same result.
This gives you:
Year Month Sales
0 2009 1 1071.666667
1 2009 2 1028.000000
2 2009 3 1239.666667
3 2009 4 1415.000000
4 2009 5 1457.000000
5 2009 6 1253.000000
6 2009 7 1213.333333
7 2009 8 1262.333333
8 2009 9 1273.000000
9 2009 10 1367.666667