how to sep col as given char length in pandas? - pandas

How to seperate dataframe as follows:
yr mon day Tmax Tmin pcp
2013 4 22 5.09-10.92 0.0
2013 4 23 2.77 -9.63 0.5
2013 4 24 0.28 -9.90 9.9
2013 4 25 0.76 -6.70 12.2
2013 4 26 -0.35 -9.48 0.0
2013 4 27 7.22-10.47 0.0
2013 4 28 4.19-10.78 0.0
you see: there are no whitespaces between Tmax and Tmin in principle. The max width of Tmax and Tmin are 6 char-spaces. If less than 6 spaces , filled by whitespace. I want to read it to df and seperate each columns.
seperate columns as given char length?

try this:
df = pd.read_fwf(filename)

It seems you need str.extract floats and ints, solution works if all data are in one column which is selected by iloc:
pat="(\d+)\s*(\d+)\s*(\d+)\s*([-+]?\d+\.\d+|\d+)\s*([-+]?\d+\.\d+|\d+)\s*([-+]?\d+\.\d+|\d+)"
df1 = df.iloc[:, 0].str.extract(pat, expand=True)
df1.columns = ['year', 'mont','day','Tmax','Tmin','pcp']
print (df1)
year mont day Tmax Tmin pcp
0 2013 4 22 5.09 -10.92 0.0
1 2013 4 23 2.77 -9.63 0.5
2 2013 4 24 0.28 -9.90 9.9
3 2013 4 25 0.76 -6.70 12.2
4 2013 4 26 -0.35 -9.48 0.0
5 2013 4 27 7.22 -10.47 0.0
6 2013 4 28 4.19 -10.78 0.0
Another solution is use read_fwf and specify colspecs:
import pandas as pd
from pandas.compat import StringIO
temp=u"""yr mon day Tmax Tmin pcp
2013 4 22 5.09-10.92 0.0
2013 4 23 2.77 -9.63 0.5
2013 4 24 0.28 -9.90 9.9
2013 4 25 0.76 -6.70 12.2
2013 4 26 -0.35 -9.48 0.0
2013 4 27 7.22-10.47 0.0
2013 4 28 4.19-10.78 0.0 """
#after testing replace 'StringIO(temp)' to 'filename.csv'
names=['year', 'mont','day','Tmax','Tmin','pcp']
colspecs = [(0, 6), (9, 10), (12, 14), (21, 26),(26,32),(34,38)]
df = pd.read_fwf(StringIO(temp),colspecs=colspecs, names=names, header=0)
print (df)
year mont day Tmax Tmin pcp
0 2013 4 22 5.09 -10.92 0.0
1 2013 4 23 2.77 -9.63 0.5
2 2013 4 24 0.28 -9.90 9.9
3 2013 4 25 0.76 -6.70 12.2
4 2013 4 26 -0.35 -9.48 0.0
5 2013 4 27 7.22 -10.47 0.0
6 2013 4 28 4.19 -10.78 0.0

Related

How to change monthly table into one column with date index?

I downloaded the Broad Dollar Index from FRED with the following format:
DATE RTWEXBGS
0 2006-01-01 100.0000
1 2006-02-01 100.2651
2 2006-03-01 100.5424
3 2006-04-01 100.0540
4 2006-05-01 97.8681
.. ... ...
194 2022-03-01 111.2659
195 2022-04-01 111.8324
196 2022-05-01 114.6075
197 2022-06-01 115.6957
198 2022-07-01 118.2674
I also got an Excel file of inflation rate with a different format:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual
0 2022 0.07480 0.07871 0.08542 0.08259 0.08582 0.09060 0.08525 NaN NaN NaN NaN NaN NaN
1 2021 0.01400 0.01676 0.02620 0.04160 0.04993 0.05391 0.05365 0.05251 0.05390 0.06222 0.06809 0.07036 0.04698
2 2020 0.02487 0.02335 0.01539 0.00329 0.00118 0.00646 0.00986 0.01310 0.01371 0.01182 0.01175 0.01362 0.01234
3 2019 0.01551 0.01520 0.01863 0.01996 0.01790 0.01648 0.01811 0.01750 0.01711 0.01764 0.02051 0.02285 0.01812
4 2018 0.02071 0.02212 0.02360 0.02463 0.02801 0.02872 0.02950 0.02699 0.02277 0.02522 0.02177 0.01910 0.02443
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104 1918 0.19658 0.17500 0.16667 0.12698 0.13281 0.13077 0.17969 0.18462 0.18045 0.18519 0.20741 0.20438 0.17284
105 1917 0.12500 0.15385 0.14286 0.18868 0.19626 0.20370 0.18519 0.19266 0.19820 0.19469 0.17391 0.18103 0.17841
106 1916 0.02970 0.04000 0.06061 0.06000 0.05941 0.06931 0.06931 0.07921 0.09901 0.10784 0.11650 0.12621 0.07667
107 1915 0.01000 0.01010 0.00000 0.02041 0.02020 0.02020 0.01000 -0.00980 -0.00980 0.00990 0.00980 0.01980 0.00915
108 1914 0.02041 0.01020 0.01020 0.00000 0.02062 0.01020 0.01010 0.03030 0.02000 0.01000 0.00990 0.01000 0.01349
How do I change the inflation table into a format similar to the dollar index?
Something like this(didn't take column=Annual into account),
df
###
Year Jan Feb Mar Apr May Jun Jul Aug \
0 2022 0.07480 0.07871 0.08542 0.08259 0.08582 0.09060 0.08525 NaN
1 2021 0.01400 0.01676 0.02620 0.04160 0.04993 0.05391 0.05365 NaN
2 2020 0.02487 0.02335 0.01539 0.00329 0.00118 0.00646 0.00986 NaN
Sep Oct Nov Dec Annual
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df_melt = pd.melt(df, id_vars=['Year'], value_vars=month, var_name='Month', value_name='Sales')
df_melt['Date'] = pd.to_datetime(df_melt['Year'].astype(str) + '-' + df_melt['Month'].astype(str))
# convert Date column to datetime type
df_melt = df_melt[['Date', 'Sales']]
df_melt
###
Date Sales
0 2022-01-01 0.07480
1 2021-01-01 0.01400
2 2020-01-01 0.02487
3 2022-02-01 0.07871
4 2021-02-01 0.01676
5 2020-02-01 0.02335
6 2022-03-01 0.08542
7 2021-03-01 0.02620
8 2020-03-01 0.01539
9 2022-04-01 0.08259
10 2021-04-01 0.04160
11 2020-04-01 0.00329
12 2022-05-01 0.08582
13 2021-05-01 0.04993
14 2020-05-01 0.00118
15 2022-06-01 0.09060
16 2021-06-01 0.05391
17 2020-06-01 0.00646
18 2022-07-01 0.08525
19 2021-07-01 0.05365
20 2020-07-01 0.00986
21 2022-08-01 NaN
22 2021-08-01 NaN
23 2020-08-01 NaN
24 2022-09-01 NaN
25 2021-09-01 NaN
26 2020-09-01 NaN
27 2022-10-01 NaN
28 2021-10-01 NaN
29 2020-10-01 NaN
30 2022-11-01 NaN
31 2021-11-01 NaN
32 2020-11-01 NaN
33 2022-12-01 NaN
34 2021-12-01 NaN
35 2020-12-01 NaN

creating multi index from data grouped by month in Pandas

Consider this sample data:
Month Location Products Sales Profit
JAN 1 43 32 20
JAN 2 82 54 25
JAN 3 64 43 56
FEB 1 37 28 78
FEB 2 18 15 34
FEB 3 5 2 4
MAR 1 47 40 14
The multi-index transformation I am trying to achieve is this:
JAN FEB MAR
Location Products Sales Profit Products Sales Profit Products Sales Profit
1 43 32 29 37 28 78 47 40 14
2 82 54 25 18 15 34 null null null
3 64 43 56 5 2 4 null null null
I tried versions of this:
df.stack().to_frame().T
It put all the data into one row. So, that's not the goal.
I presume I am close in that it should be a stacking or unstacking, melting or unmelting, but my attempts have all resulted in data oatmeal at this point. Appreciate your time trying to solve this one.
You can use pivot with reorder_levels and sort_index():
df.pivot(index='Location',columns='Month').reorder_levels(order=[1,0],axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
In case you are interested, this answer elaborates between swaplevel and reoder_levels.
Use pivot:
>>> df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
To preserve order, you have to transform your Month column as CategoricalDtype before:
df['Month'] = df['Month'].astype(pd.CategoricalDtype(df['Month'].unique(), ordered=True))
out = df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
print(out)
# Output:
Month JAN FEB MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 43.0 20.0 32.0 37.0 78.0 28.0 47.0 14.0 40.0
2 82.0 25.0 54.0 18.0 34.0 15.0 NaN NaN NaN
3 64.0 56.0 43.0 5.0 4.0 2.0 NaN NaN NaN
Update 2
Try to force the order of level 2 columns:
df1 = df.set_index(['Month', 'Location'])
df1.columns = pd.CategoricalIndex(df1.columns, ordered=True)
df1 = df1.unstack('Month').swaplevel(axis=1).sort_index(axis=1)

From 10 years of data, I want to select only calendar days with max or min value

Ok, so I have a dataset of temperatures for each day of the year, over a period of ten years. Index is date converted to datetime.
I want to get a dataset with only the min and max value for each calendar day throughout the 10-year period.
I can convert the index to a string, remove the year and get the dataset that way, but I'm guessing there is a smarter way to do it.
Use Series.dt.strftime with aggregate by GroupBy.agg with min and max:
np.random.seed(2020)
d = pd.date_range('2000-01-01', '2010-12-31')
df = pd.DataFrame({"temp": np.random.randint(0, 30, size=len(d))}, index=d)
print(df)
temp
2000-01-01 0
2000-01-02 8
2000-01-03 3
2000-01-04 22
2000-01-05 3
...
2010-12-27 16
2010-12-28 10
2010-12-29 28
2010-12-30 1
2010-12-31 28
[4018 rows x 1 columns]
df = df.groupby(df.index.strftime('%m-%d'))['temp'].agg(['min','max'])
print (df)
min max
01-01 0 28
01-02 0 29
01-03 3 21
01-04 1 28
01-05 0 26
... ...
12-27 3 29
12-28 4 27
12-29 0 29
12-30 1 29
12-31 2 28
[366 rows x 2 columns]
Last for datetimes is possible add year (be careful with leap years):
df.index = pd.to_datetime('2000-' + df.index, format='%Y-%m-%d')
print (df)
min max
2000-01-01 0 28
2000-01-02 0 29
2000-01-03 3 21
2000-01-04 1 28
2000-01-05 0 26
... ...
2000-12-27 3 29
2000-12-28 4 27
2000-12-29 0 29
2000-12-30 1 29
2000-12-31 2 28
[366 rows x 2 columns]

pandas sort by multiple columns

I want to sort the values in column C in ascending order and values in column B in order "April","August","December" and any remaining values e.g NaN in current example. Can anyone help.
before
A B C
0 354.7 April 4
1 278.8 NaN 4
2 283.5 December 2
3 249.6 NaN 2
4 95.5 April 2
5 85.6 August 2
6 55.4 August 4
7 176.5 December 4
8 104.8 August 8
9 278.8 NaN 10
10 238.7 April 8
11 278.8 April 5
12 152 December 8
After :
A B C
0 95.5 April 2
1 85.6 August 2
2 283.5 December 2
3 249.6 NaN 2
4 354.7 April 4
5 55.4 August 4
6 176.5 December 4
7 278.8 NaN 4
8 278.8 April 5
9 238.7 April 8
10 104.8 August 8
11 152 December 8
12 278.8 NaN 10
Is this what you need ?
df.B=pd.Categorical(df.B,['December','April','August'])
df.sort_values(['C','B'])
Out[284]:
A B C
2 283.5 December 2
4 95.5 April 2
5 85.6 August 2
3 249.6 NaN 2
7 176.5 December 4
0 354.7 April 4
6 55.4 August 4
1 278.8 NaN 4
11 278.8 April 5
12 152.0 December 8
10 238.7 April 8
8 104.8 August 8
9 278.8 NaN 10

how can we give index while calculating 3 days moving average

I have a data sets like below and want to calculate the max value 3 days moving average and tried this code
pd.rolling_mean(data['prec'], 3).max()
this code gives the moving average but without date
year month day prec
0 1981 1 1 1.5
1 1981 1 2 0.0
2 1981 1 3 0.0
3 1981 1 4 0.4
4 1981 1 5 0.0
5 1981 1 6 1.0
6 1981 1 7 1.9
7 1981 1 8 0.6
8 1981 1 9 3.7
9 1981 1 10 0.0
10 1981 1 11 0.0
11 1981 1 12 0.0
12 1981 1 13 0.0
13 1981 1 14 12.2
14 1981 1 15 1.7
15 1981 1 16 0.6
16 1981 1 17 0.9
17 1981 1 18 0.6
18 1981 1 19 0.4
19 1981 1 20 0.2
20 1981 1 21 1.4
21 1981 1 22 3.2
22 1981 1 23 0.0
the format which I want is
year month day prec
.... .. .. ...
can anyone help to solve this problem
Assign the result of pd.rolling_mean or pd.rolling_max to a DataFrame column:
import pandas as pd
df = pd.read_table('data', sep='\s+')
df['moving average'] = pd.rolling_mean(df['prec'], 3)
df['max of moving average'] = pd.rolling_max(df['moving average'], 3)
yields
In [32]: df
Out[32]:
year month day prec moving average max of moving average
0 1981 1 1 1.5 NaN NaN
1 1981 1 2 0.0 NaN NaN
2 1981 1 3 0.0 5.000000e-01 NaN
3 1981 1 4 0.4 1.333333e-01 NaN
4 1981 1 5 0.0 1.333333e-01 0.500000
5 1981 1 6 1.0 4.666667e-01 0.466667
6 1981 1 7 1.9 9.666667e-01 0.966667
7 1981 1 8 0.6 1.166667e+00 1.166667
8 1981 1 9 3.7 2.066667e+00 2.066667
9 1981 1 10 0.0 1.433333e+00 2.066667
10 1981 1 11 0.0 1.233333e+00 2.066667
11 1981 1 12 0.0 1.480297e-16 1.433333
12 1981 1 13 0.0 1.480297e-16 1.233333
13 1981 1 14 12.2 4.066667e+00 4.066667
14 1981 1 15 1.7 4.633333e+00 4.633333
15 1981 1 16 0.6 4.833333e+00 4.833333
16 1981 1 17 0.9 1.066667e+00 4.833333
17 1981 1 18 0.6 7.000000e-01 4.833333
18 1981 1 19 0.4 6.333333e-01 1.066667
19 1981 1 20 0.2 4.000000e-01 0.700000
20 1981 1 21 1.4 6.666667e-01 0.666667
21 1981 1 22 3.2 1.600000e+00 1.600000
22 1981 1 23 0.0 1.533333e+00 1.600000