Ensuring years and months are running as part of data cleaning - pandas

I have 2 datasets:
rainfall per month (mm) from 1982-01 to 2022-08
no. of rainy days per month per year from 1982-01 to 2022-08.
month no_of_rainy_days
0 1982-01 10
1 1982-02 5
2 1982-03 11
3 1982-04 14
4 1982-05 10
month total_rainfall
0 1982-01 107.1
1 1982-02 27.8
2 1982-03 160.8
3 1982-04 157.0
4 1982-05 102.2
Qn 1: As part of ensuring data integrity, how do I ensure that the dates are running consecutively? i.e 1982-01 and next is 1982-02 and not a skip to 1982-03?
I am unsure how to perform the checking and have done a search online. Is it common practice to assume that the years and months are running?

First, separate the year from the month.
df.rename(columns={"month": "ym"}, inplace=True)
df[["year", "month"]] = df["ym"].astype(str).str.split("-", expand=True)
Then you can group the dataframe by year and count the number of observations per year (counts number of rows per year).
observations_per_year = df["year"]\
.groupby(df["year"])\
.agg("count")\
.reset_index(name="observations")
observations_per_year[observations_per_year["observations"] < 12]
Assuming you have any years with less than 12 observations, they will be displayed like so:
year observations
0 1982 11
4 1986 11
5 1987 11
6 1988 10
11 1993 11
Given the lack of detail and no sample data provided, I made some assumptions about your data:
Each data set will not have more than one row for any month of the year (i.e., a maximum of 12 rows/observations per year).
Each dataframe contains a single observation per row, as shown in your examples (so you would do this for each dataframe prior to merging them). As such, counting rows per year-month is an accurate means of counting the number of observations for any given month.
The sorted order of the data is irrelevant (you can later sort by year-month if needed).

Related

How to calculate monthy counts per season using dataframe in pandas

Need to calculate monthly averge count as per season for below given dataset
season months daily counts
1 2 280
1 3 360
2 1 290
3 2 750
3 4 360
I tried using below code but the counts are daily for each month therefore couldn't find average monthly counts
dataseason=pd.read_csv(path,usecols = ['season','mnth','cnt']);
dataseason ['col5']=dataseason.groupby(dataseason['season'].ne(dataseason['season'].shift()).cumsum())['cnt'].transform('mean')
print(dataseason.drop_duplicates(subset='col5'))

How to redistribute outliers over the previous time period?

Imagine a dataframe that looks like this:
1
2
3
4
5
6
7
50
16
17
Normally we would apply an algorithm from Detect and exclude outliers in a pandas DataFrame to entirely remove the 50, however my particular dataset instead requires me to distribute the values of the 50 over the previous 7 days:
8
9
10
11
12
13
14
15
16
17
How can I make this work in Pandas? I can detect the outliers pretty easily but not sure how to spread the values out into previous days. Note that a simple moving average doesn't work well for this type of data, as there would still be a jump in the average value when 50 shows up. What I need to do is smooth out 50 into the previous days so that no jump is visible.

Forecasting with Postgres

I'm trying to do some forecasting on some data I have. The tables below are just examples.
Basically I have an integer value x for today's date in a table.
todays_date, x
07/15/2018, 3
I have another query that has generated the monthly avg change of x for the past 3 years.
month, change
jan, 1
feb, 2
mar, 1
apr,-1
may, 1
jun, -2
jul, 2 ...
All I want to do now is create entries in a new table that would be the "forecast" for the next 6 months by adding the current months historical avg change to the current value of x and keep adding the change value for the next month to that number for the next 6 months. Putting a row in the table for each month.
todays_date, forecast_date, value
07/15/2018, 08/01/2018, 6
07/15/2018, 09/01/2018, 8
07/15/2018, 10/01/2018, 9
07/15/2018, 11/01/2018, 11
07/15/2018, 12/01/2018, 13
07/15/2018, 13/01/2018, 13
I could do this in Go but I would much rather do it in Postgres and possibly create a trigger to populate this forecast table.

pandas: get third business day of month

i Have a daterange
pd.bdate_range("2001-01-01", "2018-01-01")
and want to find the third business day of the month (ignore holidays for now). How do I do that?
As you are already in business dates, you could resample to the start of the business month ('BMS') and add an offset of 3 business days::
>>> pd.Series(index=pd.bdate_range("2001-01-01",
"2018-01-01")).resample('BMS').index + pd.datetools.BDay(3)
DatetimeIndex(['2001-01-04', '2001-02-06', '2001-03-06', '2001-04-05',
'2001-05-04', '2001-06-06', '2001-07-05', '2001-08-06',
'2001-09-06', '2001-10-04',
...
'2017-04-06', '2017-05-04', '2017-06-06', '2017-07-06',
'2017-08-04', '2017-09-06', '2017-10-05', '2017-11-06',
'2017-12-06', '2018-01-04'],
dtype='datetime64[ns]', length=205, freq=None)
You'll find further details on how to work with dates in pandas in the documentation.
Plan: groupby year, month. Choose third with nth().
This example will be easier with a series:
dates = pd.Series(pd.bdate_range("2001-01-01", "2018-01-01"))
dates.groupby((dates.dt.year, dates.dt.month)).nth(3)
Partial output:
2001 1 2001-01-04
2 2001-02-06
3 2001-03-06
4 2001-04-05
5 2001-05-04
6 2001-06-06
7 2001-07-05
8 2001-08-06
9 2001-09-06
10 2001-10-04
11 2001-11-06
12 2001-12-06
2002 1 2002-01-04
2 2002-02-06
3 2002-03-06
4 2002-04-04

how to get a moving average in sql

If I have data from week 1 to week 52 data and I want 4 week Moving Average with 1 week how can I make a SQL query for this? For example, for week 5 I want week1-week4 average, week6 I want week5-week8 average and so on.
I have the columns week and target_value in table A.
Sample data is like this:
Week target_value
1 20
2 10
3 10
4 20
5 60
6 20
So the output I want will start from week 5 as only week 1-week4 is available not before that.
Output data will look like:
Week Output
5 15 (20+10+10+20)/4=15 Moving Average week1-week4
6 25 (10+10+20+60)/4=25 Moving Average week2-week5
The data is in hive but I can move it to oracle if it is simpler to do this there.
SELECT
Week,
(SELECT ISNULL(AVG(B.target_value), A.target_value)
FROM tblA B
WHERE (B.Week < A.Week)
AND B.Week >= (A.Week - 4)
) AS Moving_Average
FROM tblA A
The ISNULL keeps you from getting a null for your first week since there is no week 0. If you want it to be null, then just leave the ISNULL function out.
If you want it to start at week 5 only, then add the following line to the end of the SQL that I wrote:
WHERE A.Week > 4
Results:
Week Moving_Average
1 20
2 20
3 15
4 13
5 15
6 25