time dataframe to single column data - pandas

I have a data looks like:
df = pd.DataFrame( np.random.randn(140,13),columns=['Year', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
df['Year']=np.arange(1876,2016)
df.head()
Out[54]:
Year Jan Feb Mar Apr May Jun Jul \
1 1877 -0.341183 -2.369659 -0.301529 1.268756 0.291787 -0.433796 1.846660
2 1878 0.015547 -1.248171 -0.961130 -2.473062 -1.227789 -0.291215 -0.552831
3 1879 -1.643790 0.238561 1.120954 0.273184 -2.255050 0.189526 -0.528215
4 1880 1.800950 0.900657 -1.785493 -0.505400 -0.909594 0.829114 0.310907
Aug Sep Oct Nov Dec
0 -0.540807 1.041048 -0.392727 0.526774 0.482579
1 0.087704 1.520229 0.008850 -0.052644 1.255057
2 0.475701 -0.402313 0.860482 -1.331818 1.248075
3 1.746745 -0.362812 -0.357801 -1.649273 -0.884970
4 1.064974 -2.636122 0.300357 0.523165 1.047123
I want to transform it into a single column data with index being year-month . I try to stack my original data but it becomes a time series, which has the year mix with my values.
df=df.stack()
df
Out[60]:
0 Year 1876.000000
Jan -1.375433
Feb 0.115271
Mar 0.160305
Apr 0.962201
May -1.170467
Jun -0.312078
Jul -1.046972
Aug -0.540807
Sep 1.041048
Oct -0.392727
Nov 0.526774
Dec 0.482579
1 Year 1877.000000
Jan -0.341183
...
What I really want looks like:
result=pd.DataFrame(data=np.random.randn(10,1),columns=['values'],index=pd.date_range('1876/1/1',periods=10,freq='BM'))
result.head()
Out[58]:
values
1876-01-31 0.593254
1876-02-29 0.777550
1876-03-31 -1.777443
1876-04-28 -0.880476
1876-05-31 -1.698800

set_index to Year first, and then stack.
# data
# =====================
Year Jan Feb Mar Apr ... Aug Sep Oct Nov Dec
0 1876 1.8309 0.6724 0.6230 0.3548 ... 0.6316 0.7837 -0.0132 -0.3274 -0.0795
1 1877 1.1363 -2.5042 1.8929 -0.2806 ... 2.0662 0.5430 -0.2887 1.2593 0.6788
2 1878 -0.4730 -1.3182 1.2255 1.1420 ... -0.3064 -1.0505 0.8774 -0.7551 1.0743
3 1879 -0.6651 -0.1462 0.5634 1.7074 ... 0.1588 0.8856 -2.9899 -0.2085 0.3358
4 1880 -0.1305 1.2971 -0.6043 -1.1446 ... 0.7274 -0.8798 0.0978 -0.7801 -1.7695
5 1881 0.0165 -0.6090 -0.2994 -0.5597 ... -1.3628 0.6206 1.4357 1.1800 -1.8132
6 1882 -0.3365 -0.0699 -1.2027 -0.4825 ... -0.3016 1.7806 0.9992 -1.4172 0.4250
7 1883 0.7963 -1.1474 0.8532 -0.9619 ... -0.8057 -1.0750 -0.5305 0.3533 -0.0818
.. ... ... ... ... ... ... ... ... ... ... ...
132 2008 -0.0440 -2.2967 -1.0145 0.1504 ... -0.4940 0.2150 0.2712 0.5997 0.2958
133 2009 -0.2410 -0.6169 1.1429 0.1749 ... 0.8128 0.9391 1.1312 -0.0915 1.1761
134 2010 0.8155 0.3567 1.1648 0.7068 ... -0.8204 -0.3549 1.5648 -0.2102 1.6549
135 2011 0.4847 -0.4535 0.5300 -0.8678 ... -0.2837 0.8821 1.1700 0.0899 -0.5830
136 2012 0.1835 0.9730 -0.7666 -1.0301 ... 0.3203 -0.2747 -1.8450 0.0942 0.2149
137 2013 0.2517 0.8293 1.9907 -1.0461 ... -0.3113 0.7177 0.8896 0.2329 2.0546
138 2014 -1.6106 -1.3285 -0.1870 0.2511 ... -0.3264 1.3578 1.5639 -1.3799 -1.1196
139 2015 -2.0050 0.3680 -0.5553 -0.6471 ... 0.6217 -0.0965 1.3019 -1.0420 -1.3107
[140 rows x 13 columns]
# processing
# =================================
df.set_index('Year').stack()
Year
1876 Jan 1.8309
Feb 0.6724
Mar 0.6230
Apr 0.3548
May 1.4329
Jun -0.3263
Jul 1.7276
Aug 0.6316
...
2015 May -0.5075
Jun -1.4982
Jul -1.9434
Aug 0.6217
Sep -0.0965
Oct 1.3019
Nov -1.0420
Dec -1.3107
dtype: float64

Related

Get the number of unique days with overlapping dates (in SAS)

I couldn't briefly explain the problem so I'll try to explain it this way. Let's say I have a table similar to the one below.
How do I get the total number of days in October per student that that student has at least 1 book checked out?
Please note that a single student can check out more than 1 book at a time which cause the overlapping dates.
Student
Book
Date_Borrowed
Date_Returned
David
A Thousand Splendid Suns
01 Oct 2021
05 Oct 2021
David
Jane Eyre
09 Oct 2021
13 Oct 2021
David
Please Look After Mom
21 Oct 2021
29 Oct 2021
Fiona
Sense and Sensibility
05 Oct 2021
14 Oct 2021
Fiona
The Girl Who Saved the King of Sweden
05 Oct 2021
14 Oct 2021
Fiona
A Fort of Nine Towers
02 Oct 2021
17 Oct 2021
Fiona
One Hundred Years of Solitude
20 Oct 2021
30 Oct 2021
Fiona
The Unbearable Lightness of Being
20 Oct 2021
30 Oct 2021
Greg
Fahrenheit 451
06 Oct 2021
11 Oct 2021
Greg
One Hundred Years of Solitude
10 Oct 2021
17 Oct 2021
Greg
Please Look After Mom
15 Oct 2021
21 Oct 2021
Greg
4 3 2 1
20 Oct 2021
27 Oct 2021
Greg
The Girl Who Saved the King of Sweden
27 Oct 2021
03 Nov 2021
Marcus
Fahrenheit 451
01 Oct 2021
04 Oct 2021
Marcus
Nectar in a Sieve
15 Oct 2021
15 Oct 2021
Marcus
Please Look After Mom
30 Oct 2021
31 Oct 2021
Priya
Like Water for Chocolate
02 Oct 2021
21 Oct 2021
Priya
Fahrenheit 451
21 Oct 2021
22 Oct 2021
Sasha
Baudolino
03 Oct 2021
29 Oct 2021
Sasha
A Thousand Splendid Suns
07 Oct 2021
16 Oct 2021
Sasha
A Fort of Nine Towers
26 Oct 2021
01 Nov 2021
Thanks in advance!
Using the data step, you can expand each date into a long format. From there, you can use SQL to do a simple count by student after removing overlapping dates.
data foo;
set have;
do date = date_borrowed to date_returned;
output;
end;
keep student date;
format date date9.;
run;
This gets us a long table of all the dates with at least one book checked out for each student.
student date
David 01OCT2021
David 02OCT2021
David 03OCT2021
David 04OCT2021
David 05OCT2021
David 09OCT2021
...
Now we need to remove the overlapping dates.
proc sort data=foo nodupkey;
by student date;
run;
From here, we can do a simple SQL count per student.
proc sql noprint;
create table want as
select student
, intnx('month', date, 0, 'B') as month format=monyy7.
, count(*) as days_checked_out
from foo
where calculated month = '01OCT2021'd
group by student, calculated month
;
quit;
Output:
student month days_checked_out
David OCT2021 19
Fiona OCT2021 27
Greg OCT2021 26
Marcus OCT2021 7
Priya OCT2021 21
Sasha OCT2021 29
An easy way is to make a temporary array with one variable for each day in the time period you want to count. Then just use a do loop to set the variables representing those days to 1. When you have reached the last record for a student then take the sum to find the number of days covered.
First let's convert your posted table into a dataset.
data have;
infile cards dsd dlm='|' truncover;
input Student :$20. Book :$100. (Date_Borrowed Date_Returned) (:date.);
format Date_Borrowed Date_Returned date11.;
cards;
David|A Thousand Splendid Suns|01 Oct 2021|05 Oct 2021
David|Jane Eyre|09 Oct 2021|13 Oct 2021
David|Please Look After Mom|21 Oct 2021|29 Oct 2021
Fiona|Sense and Sensibility|05 Oct 2021|14 Oct 2021
Fiona|The Girl Who Saved the King of Sweden|05 Oct 2021|14 Oct 2021
Fiona|A Fort of Nine Towers|02 Oct 2021|17 Oct 2021
Fiona|One Hundred Years of Solitude|20 Oct 2021|30 Oct 2021
Fiona|The Unbearable Lightness of Being|20 Oct 2021|30 Oct 2021
Greg|Fahrenheit 451|06 Oct 2021|11 Oct 2021
Greg|One Hundred Years of Solitude|10 Oct 2021|17 Oct 2021
Greg|Please Look After Mom|15 Oct 2021|21 Oct 2021
Greg|4 3 2 1|20 Oct 2021|27 Oct 2021
Greg|The Girl Who Saved the King of Sweden|27 Oct 2021|03 Nov 2021
Marcus|Fahrenheit 451|01 Oct 2021|04 Oct 2021
Marcus|Nectar in a Sieve|15 Oct 2021|15 Oct 2021
Marcus|Please Look After Mom|30 Oct 2021|31 Oct 2021
Priya|Like Water for Chocolate|02 Oct 2021|21 Oct 2021
Priya|Fahrenheit 451|21 Oct 2021|22 Oct 2021
Sasha|Baudolino|03 Oct 2021|29 Oct 2021
Sasha|A Thousand Splendid Suns|07 Oct 2021|16 Oct 2021
Sasha|A Fort of Nine Towers|26 Oct 2021|01 Nov 2021
;
Now we can use BY group processing in a data step to aggregate per student. We can set the upper and lower index for the array to be the values SAS uses to represent those days. Temporary arrays are automatically retained across observations, we just need to clear it out when we start a new student.
The SAS compiler does not expect to see a date literal as the index boundaries for an array so we can use %SYSEVALF() to convert the date literal to the integer it represents.
data want;
set have;
by student ;
array october [%sysevalf('01oct2021'd):%sysevalf('31oct2021'd)] _temporary_ ;
if first.student then call missing(of october[*]);
do date=max(date_borrowed,'01oct2021'd) to min(date_returned,'31oct2021'd);
october[date]=1;
end;
if last.student;
days = sum(0, of october[*]);
keep student days;
run;
Results:
Obs Student days
1 David 19
2 Fiona 27
3 Greg 26
4 Marcus 7
5 Priya 21
6 Sasha 29
You could also modify it slightly to not only count the number of "covered" (or unique) days, but also the total number of "book" days.
data want;
set have;
by student ;
array october [%sysevalf('01oct2021'd):%sysevalf('31oct2021'd)] _temporary_ ;
if first.student then call missing(of october[*]);
do date=max(date_borrowed,'01oct2021'd) to min(date_returned,'31oct2021'd);
october[date]=sum(october[date],1);
end;
if last.student;
unique_days = n(of october[*]);
book_days = sum(0,of october[*]);
keep student unique_days book_days;
run;
Results:
unique_ book_
Obs Student days days
1 David 19 19
2 Fiona 27 58
3 Greg 26 34
4 Marcus 7 7
5 Priya 21 22
6 Sasha 29 43

How to get the latest values day wise from a timeseries table?

I want to get the latest values of each SIZE_TYPE day wise, ordered by TIMESTAMP. So, only 1 value of each SIZE_TYPE must be present for a given day, and that is the latest value for the day.
How do I get the desired output? I'm using PostgreSQL here.
Input
|TIMESTAMP |SIZE_TYPE|SIZE|
|----------------------------------------|---------|----|
|1595833641356 [Mon Jul 27 2020 07:07:21]|0 |541 |
|1595833641356 [Mon Jul 27 2020 07:07:21]|1 |743 |
|1595833641356 [Mon Jul 27 2020 07:07:21]|2 |912 |
|1595876841356 [Mon Jul 27 2020 19:07:21]|1 |714 |
|1595876841356 [Mon Jul 27 2020 19:07:21]|2 |987 |
|1595963241356 [Tue Jul 28 2020 19:07:21]|0 |498 |
|1595920041356 [Tue Jul 28 2020 07:07:21]|2 |974 |
|1595920041356 [Tue Jul 28 2020 07:07:21]|0 |512 |
*Note: the TIMESTAMP values are in UNIX time. I have given
the date-time string for reference*
Output
|TIMESTAMP |SIZE_TYPE|SIZE|
|----------------------------------------|---------|----|
|1595833641356 [Mon Jul 27 2020 07:07:21]|0 |541 |
|1595876841356 [Mon Jul 27 2020 19:07:21]|1 |714 |
|1595876841356 [Mon Jul 27 2020 19:07:21]|2 |987 |
|1595920041356 [Tue Jul 28 2020 07:07:21]|2 |974 |
|1595963241356 [Tue Jul 28 2020 19:07:21]|0 |498 |
*Note: the TIMESTAMP values are in UNIX time. I have given
the date-time string for reference*
Explanation
For July 27, the latest values for
0: 541 (no other entries for the day)
1: 714
2: 987
For July 28, the latest values for
0: 498
1: nothing (ignore)
2: 974 (no other entries for the day)
You can use distinct on:
select distinct on (floor(timestamp / (24 * 60 * 60 * 1000)), size_type) t.*
from input
order by floor(timestamp / (24 * 60 * 60 * 1000)), size_type,
timestamp desc;
The arithmetic is just to extract the day from the timestamp.
Here is a db<>fiddle.

Is there a way to rejig a dataframe to show a better time series dataset?

Hi I have the following df:
Variable Total Month
Year
2011 110 01
2011 111 02
2011 112 03
2011 113 04
2011 114 05
2011 115 06
....
....
2021 302 04
2021 303 05
2021 304 06
Is it possible to rejig the dataset to this:
Jan Feb Mar Apr May .... Nov Dec
Year
2011 110 111 112 113 114
2012 ...
2013 ...
2014 ...
2015 ...
....
2020
2021
** I would also like to remove the "Variable" word at the corner of the table.
My eventual goal is to do some simple data visualization using matplotlib to create line plots of the respective years (2011...2021)
Thank you in advance!
Use pivot()+reindex():
from calendar import month_abbr
df['Month']=pd.to_datetime(df['Month'],format='%m').dt.strftime('%b')
df=df.pivot(columns='Month',values='Total').rename_axis(columns=None)
df=df.reindex(columns=month_abbr[1:])
OR
via pivot()+pd.Categorical():
df['Month']=pd.to_datetime(df['Month'],format='%m').dt.strftime('%b')
df=df.pivot(columns='Month',values='Total').rename_axis(columns=None)
df.columns=pd.Categorical(df.columns,month_abbr[1:],ordered=True)
df=df.sort_index(axis=1)
Now if you print df you will get your expected output

How to calculate median monthly from date of month table?

My dataset:
Date Num_orders
Mar 21 2019 69
Mar 22 2019 82
Mar 24 2019 312
Mar 25 2019 199
Mar 26 2019 2,629
Mar 27 2019 2,819
Mar 28 2019 3,123
Mar 29 2019 3,332
Mar 30 2019 1,863
Mar 31 2019 1,097
Apr 01 2019 1,578
Apr 02 2019 2,353
Apr 03 2019 2,768
Apr 04 2019 2,648
Apr 05 2019 3,192
Apr 06 2019 2,363
Apr 07 2019 1,578
Apr 08 2019 3,090
Apr 09 2019 3,814
Apr 10 2019 3,836
...
I need to calculate the monthly median number of orders from days of the same month:
The desired results:
Month Median_monthly
Mar 2019 1,863
Apr 2019 2,768
May 2019 2,876
Jun 2019 ...
...
I tried to use function date_trunc to extract month from the dataset then group by 'month' but it didn't work out. Thanks for your help, I use Google Bigquery (#standard) environment!
Probably you tried to use PERCENTILE_CONT which can not be used with GROUP BY:
Try to use APPROX_QUANTILES(x, 100)[OFFSET(50)]. It should work with GROUP BY.
SELECT APPROX_QUANTILES](Num_orders, 100)\[OFFSET(50)\] AS median
FROM myTable
GROUP BY Month
Alternativele you can use PERCENTILE_CONT within subquery:
SELECT
DISTINCT Month, median
FROM (
SELECT
Month,
PERCENTILE_CONT(Num_orders, 0.5) OVER(PARTITION BY Month) AS median
FROM myTable
)
This would often be done using DISTINCT:
SELECT DISTINCT DATE_TRUNC(month, date),
PERCENTILE_CONT(Num_orders, 0.5) OVER (PARTITION BY DATE_TRUNC(month, date) AS median
FROM myTable;
Note: There are two percentile functions, PERCENTILE_CONT() and PERCENTILE_DISC(). They have different results when there is a "tie" in the middle of the data.

Trying to pull the required rows from the single table with applying conditional statements on columns in sql server?

I have tried in n-number ways to solve this solution but unfortunately I got stuck in all the ways..
source table
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 12 15 16 17 18 19 20 21 22 23
1234 2013 05 06 12 15 16 17 18 19 20 21 22 23
Task: Assume that we are currently at March 2014, and we need 12 months back date ...(i.e., from Mar 2013 to Feb 2014, and the remaining values needs to be zero except year and id.)
Solution:
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 0 0 0 0 0 0 0 0 0 0
1234 2013 0 0 12 15 16 17 18 19 20 21 22 23
This needs a code solution for SQL Server 2008. I would be very happy if any body can solve this.
Note:
I got stuck to pull the column names dynamically.
You can try this:
select id, year, case when DATEDiff(month, getdate(), convert(datetime, year + '-01-01'))) < 12 then jan else 0,
DATEDiff(month, getdate(), convert(datetime, year + '-02-01'))) < 12 then fab else 0 ....