I have data that may at certain times of the year around the first of each year, that a day_of_year sequence involves changing the "year" column to the new year when day_of_year ==1. It is a trick that I have not been able to figure out and in some ways not sure how to start so any help here is much appreciated. My data looks like this:
Here is my df1 =
day_of_year year var_1
364 2017 17.71666667
364 2018 5.166666667
364 2019 2
364 2020 1.595833333
364 2021 3.75
364 2022 6.8875
365 2017 14.83333333
365 2018 2.758333333
365 2019 4.108333333
365 2020 5.766666667
365 2021 5.291666667
365 2022 10.58636364
1 2017 2.0125
1 2018 14.0125
1 2019 -0.504166667
1 2020 7.666666667
1 2021 5.520833333
1 2022 1.229166667
2 2017 1.7625
2 2018 15.10416667
2 2019 -0.391666667
2 2020 9.5
2 2021 7.645833333
2 2022 0.9125
And, after the re-formatting, I need it to look like the below sorted df with "n/a" for any missing or expected data in a year that might be missing data. thank you again,
final df:
day_of_year year var_1
364 2017 17.71666667
365 2017 14.83333333
1 2018 14.0125
2 2018 15.10416667
364 2018 5.166666667
365 2018 2.758333333
1 2019 -0.504166667
2 2019 -0.391666667
364 2019 2
365 2019 4.108333333
1 2020 7.666666667
2 2020 9.5
364 2020 1.595833333
365 2020 5.766666667
1 2021 5.520833333
2 2021 7.645833333
364 2021 3.75
365 2021 5.291666667
1 2022 1.229166667
2 2022 0.9125
364 2022 6.8875
365 2022 10.58636364
n/a n/a n/a
n/a n/a n/a
Why would you change the year based on the day? Just sort by the two columns:
df.sort_values(by=['year', 'day_of_year'])
Output:
day_of_year year var_1
12 1 2017 2.012500
18 2 2017 1.762500
0 364 2017 17.716667
6 365 2017 14.833333
13 1 2018 14.012500
19 2 2018 15.104167
1 364 2018 5.166667
7 365 2018 2.758333
14 1 2019 -0.504167
20 2 2019 -0.391667
2 364 2019 2.000000
8 365 2019 4.108333
15 1 2020 7.666667
21 2 2020 9.500000
3 364 2020 1.595833
9 365 2020 5.766667
16 1 2021 5.520833
22 2 2021 7.645833
4 364 2021 3.750000
10 365 2021 5.291667
17 1 2022 1.229167
23 2 2022 0.912500
5 364 2022 6.887500
11 365 2022 10.586364
If for some reason you really need to fix the year, use a conditional with mask:
(df.assign(year=df['year'].mask(df['day_of_year'].le(2), df['year'].add(1)))
.sort_values(by=['year', 'day_of_year'])
)
Or, if you want to update the years after a change from 365 to a lower day:
(df.assign(year=df['year'].add(df['day_of_year'].diff().lt(0).cumsum()))
.sort_values(by=['year', 'day_of_year'])
)
Output:
day_of_year year var_1
0 364 2017 17.716667
6 365 2017 14.833333
12 1 2018 2.012500
18 2 2018 1.762500
1 364 2018 5.166667
7 365 2018 2.758333
13 1 2019 14.012500
19 2 2019 15.104167
2 364 2019 2.000000
8 365 2019 4.108333
14 1 2020 -0.504167
20 2 2020 -0.391667
3 364 2020 1.595833
9 365 2020 5.766667
15 1 2021 7.666667
21 2 2021 9.500000
4 364 2021 3.750000
10 365 2021 5.291667
16 1 2022 5.520833
22 2 2022 7.645833
5 364 2022 6.887500
11 365 2022 10.586364
17 1 2023 1.229167
23 2 2023 0.912500
I would convert everything to date time first. Just run:
pd.to_datetime(df['day_of_year'].astype(str) + '-' + df['year'].astype(str),
format='%j-%Y')
I assign it to column ymd and sort, yielding the following:
>>> df.sort_values('ymd')
day_of_year year var_1 ymd
12 1 2017 2.012500 2017-01-01
18 2 2017 1.762500 2017-01-02
0 364 2017 17.716667 2017-12-30
6 365 2017 14.833333 2017-12-31
13 1 2018 14.012500 2018-01-01
19 2 2018 15.104167 2018-01-02
1 364 2018 5.166667 2018-12-30
7 365 2018 2.758333 2018-12-31
14 1 2019 -0.504167 2019-01-01
20 2 2019 -0.391667 2019-01-02
2 364 2019 2.000000 2019-12-30
8 365 2019 4.108333 2019-12-31
15 1 2020 7.666667 2020-01-01
21 2 2020 9.500000 2020-01-02
3 364 2020 1.595833 2020-12-29
9 365 2020 5.766667 2020-12-30
16 1 2021 5.520833 2021-01-01
22 2 2021 7.645833 2021-01-02
4 364 2021 3.750000 2021-12-30
10 365 2021 5.291667 2021-12-31
17 1 2022 1.229167 2022-01-01
23 2 2022 0.912500 2022-01-02
5 364 2022 6.887500 2022-12-30
11 365 2022 10.586364 2022-12-31
Good afternoon -
I have a table in Teradata that stores a rolling cumulative sum that resets every month. I would like to be able to calculate the incremental gain between each day of the month. Is this something that I can accomplish with olap functions or should it be handled in a recursive cte? Would love assistance thinking through this. Thanks!
example source
date
month
cum_value
2022-07-02
July 2022
25
2022-07-01
July 2022
5
2022-06-30
June 2022
100
2022-06-29
June 2022
70
2022-06-28
June 2022
65
2022-06-27
June 2022
50
example result
date
month
cum_value
incremental_value
2022-07-02
July 2022
25
20
2022-07-01
July 2022
5
5
2022-06-30
June 2022
100
30
2022-06-29
June 2022
70
5
2022-06-28
June 2022
65
15
2022-06-27
June 2022
50
..
How could I calculate a field based on values from previous and next rows?
I have this list from users with a date (month and year) and a field indicating if the user has 1+ purchases in that month-year
id_user
Date
Has_purchases
Active
15678
Jan 2021
0
1
15678
feb 2021
1
1
15678
mar 2021
0
1
15678
Apr 2021
0
1
15678
may 2021
0
0
15678
jun 2021
0
1
15678
jul 2021
0
1
15678
Aug 2021
1
1
15678
sep 2021
0
1
15678
oct 2021
0
1
15678
nov 2021
0
1
15678
Dec 2021
1
1
I need to calculate if the user was active on a date (month-year). An active user is defined as an user who has at least one purchase on the last 3 months.
Eg. User 15678 is 'active' on march because user has purchases on february, the same user in unactive on may beacause it does not have purchases on march and april and also does not have purchases on june and july
I am trying to scrape a website called WikiCFP and return the information in the table as a dataframe.
As of now I have this code
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
df = pd.DataFrame(columns=["abbreviation", "name", "dates", "place", "deadline"])
url = "http://www.wikicfp.com/cfp/call?conference=computer%20science&page=1"
response = requests.get(url)
soup= BeautifulSoup(response.content, "html.parser")
table = soup.find("table", align="center",cellpadding="3",cellspacing="1",width="100%")
for row in table.find_all("tr")[1:]:
values= row.find_all("td")
print(values[0].text.split("/n")[0])
I specifically don't know how to convert the text in each row into a viable list or some other thing from which a dataframe can be made.
Thanks in advance
You can use read_html directly:
dfs = pd.read_html(url, header=0) # return all tables in a list of dataframe
df = dfs[4] # 4 is the index of the dataframe you want
Output:
>>> df
Event When Where Deadline Unnamed: 4
0 DSA 2021 2nd International Conference on Data Science a... 2nd International Conference on Data Science a... 2nd International Conference on Data Science a... NaN
1 DSA 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
2 NCWMC 2022 7th International Conference on Networks, Comm... 7th International Conference on Networks, Comm... 7th International Conference on Networks, Comm... NaN
3 NCWMC 2022 Jul 30, 2022 - Jul 31, 2022 London, United Kingdom Oct 24, 2021 NaN
4 ICAIT 2022 11th International Conference on Advanced Comp... 11th International Conference on Advanced Comp... 11th International Conference on Advanced Comp... NaN
5 ICAIT 2022 Jul 23, 2022 - Jul 24, 2022 Toronto, Canada Oct 24, 2021 NaN
6 CNDC 2021 8th International Conference on Computer Netwo... 8th International Conference on Computer Netwo... 8th International Conference on Computer Netwo... NaN
7 CNDC 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
8 CAIML 2022 3rd International Conference on Artificial Int... 3rd International Conference on Artificial Int... 3rd International Conference on Artificial Int... NaN
9 CAIML 2022 Jul 23, 2022 - Jul 24, 2022 Toronto, Canada Oct 24, 2021 NaN
10 ITCSE 2022 11th International Conference on Information T... 11th International Conference on Information T... 11th International Conference on Information T... NaN
11 ITCSE 2022 Jul 23, 2022 - Jul 24, 2022 Toronto, Canada Oct 24, 2021 NaN
12 CCSIT 2022 12th International Conference on Computer Scie... 12th International Conference on Computer Scie... 12th International Conference on Computer Scie... NaN
13 CCSIT 2022 Jul 30, 2022 - Jul 31, 2022 London, United Kingdom Oct 24, 2021 NaN
14 SOEN 2022 7th International Conference on Software Engin... 7th International Conference on Software Engin... 7th International Conference on Software Engin... NaN
15 SOEN 2022 Jul 30, 2022 - Jul 31, 2022 London, United Kingdom Oct 24, 2021 NaN
16 AIAA 2021 11th International Conference on Artificial In... 11th International Conference on Artificial In... 11th International Conference on Artificial In... NaN
17 AIAA 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
18 NLPTA 2021 2nd International Conference on NLP Techniques... 2nd International Conference on NLP Techniques... 2nd International Conference on NLP Techniques... NaN
19 NLPTA 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
20 CSTY 2021 7th International Conference on Computer Scien... 7th International Conference on Computer Scien... 7th International Conference on Computer Scien... NaN
21 CSTY 2021 Dec 18, 2021 - Dec 19, 2021 Dubai, UAE Oct 24, 2021 NaN
22 KG#SAC 2022 ACM SAC 2022 Track on Knowledge Graphs ACM SAC 2022 Track on Knowledge Graphs ACM SAC 2022 Track on Knowledge Graphs NaN
23 KG#SAC 2022 Apr 25, 2022 - Apr 29, 2022 Brno, Czech Republic Oct 24, 2021 NaN
24 E&C 2021 5th International Conference on Electrical & C... 5th International Conference on Electrical & C... 5th International Conference on Electrical & C... NaN
25 E&C 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
26 IoTE 2021 2nd International Conference on Internet of Th... 2nd International Conference on Internet of Th... 2nd International Conference on Internet of Th... NaN
27 IoTE 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
28 EEEN 2021 5th International Conference on Electrical and... 5th International Conference on Electrical and... 5th International Conference on Electrical and... NaN
29 EEEN 2021 Nov 27, 2021 - Nov 28, 2021 Dubai, UAE Oct 24, 2021 NaN
30 CVIE--EI, Scopus 2022 2022 2nd International Conference on Computer ... 2022 2nd International Conference on Computer ... 2022 2nd International Conference on Computer ... NaN
31 CVIE--EI, Scopus 2022 Feb 18, 2022 - Feb 20, 2022 Sanya, China Oct 25, 2021 NaN
32 ICCDA--Ei, Scopus 2022 2022 The 6th International Conference on Compu... 2022 The 6th International Conference on Compu... 2022 The 6th International Conference on Compu... NaN
33 ICCDA--Ei, Scopus 2022 Feb 18, 2022 - Feb 20, 2022 Sanya, China Oct 25, 2021 NaN
34 ACM--ICMLC--Ei and Scopus 2022 ACM--2022 14th International Conference on Mac... ACM--2022 14th International Conference on Mac... ACM--2022 14th International Conference on Mac... NaN
35 ACM--ICMLC--Ei and Scopus 2022 Feb 18, 2022 - Feb 20, 2022 Guangzhou, China Oct 25, 2021 NaN
36 IEEE CSP--EI Compendex, Scopus 2022 2022 IEEE 6th International Conference on Cryp... 2022 IEEE 6th International Conference on Cryp... 2022 IEEE 6th International Conference on Cryp... NaN
37 IEEE CSP--EI Compendex, Scopus 2022 Jan 14, 2022 - Jan 16, 2022 Tianjin, China Oct 25, 2021 NaN
38 ACM--ICMIP--Ei and Scopus 2022 ACM--2022 7th International Conference on Mult... ACM--2022 7th International Conference on Mult... ACM--2022 7th International Conference on Mult... NaN
39 ACM--ICMIP--Ei and Scopus 2022 Jan 14, 2022 - Jan 16, 2022 Tianjin, China Oct 25, 2021 NaN
Try
values.getText()
getText() function returns textual content from bs4 HTML objects.
Looking for an Oracle SQL query to show Month and Year starting from the current year- 1y and current year+1y.
Eg: December 2019, January 2020, February 2020,......December 2021
You can use the hierarchy query as follows:
SQL> SELECT trunc(ADD_MONTHS(ADD_MONTHS(sysdate,-12), LEVEL-1), 'Mon') as month_year
2 FROM DUAL CONNECT BY LEVEL <= 24 + 1;
MONTH_YEAR
--------------
December 2019
January 2020
February 2020
March 2020
April 2020
May 2020
June 2020
July 2020
August 2020
September 2020
October 2020
November 2020
December 2020
January 2021
February 2021
March 2021
April 2021
May 2021
June 2021
July 2021
August 2021
September 2021
October 2021
November 2021
December 2021
25 rows selected.
SQL>
There are multiple methods for doing this. I think simple examples like this are a good opportunity to learn about recursive CTEs:
with dates(yyyymm, n) as (
select trunc(sysdate, 'Mon') as yyyymm, 1 as n
from dual
union all
select add_months(yyyymm, -1), n + 1
from dates
where n <= 12
)
select yyyymm
from dates;
WITH d AS (
SELECT
'JAN' m,
2021 y
FROM
dual
), d1 AS (
SELECT
to_date(m || y, 'MONYYYY') first_day,
last_day(to_date(m || y, 'MONYYYY')) last_day1,
last_day(to_date(m || y, 'MONYYYY')) - to_date(m || y, 'MONYYYY') no_of_days
FROM
d
)
SELECT
level - 1 + first_day dates
FROM
d1
CONNECT BY
level <= no_of_days + 1;