I have an excel file in O365 which has been shared and want to be import into pandas but failed. The file is shared as anyone with the link can edit. I’ve went through many related posts but always get different error messages. You can find a link which is a shared excel file. Please help me how to import! Thanks in advance!
text
import urllib.request
url = 'https://botizrt-my.sharepoint.com/:x:/g/personal/bauko_botizrt_onmicrosoft_com/ETu8P8uY5EJChvYhi4tMaqABnOLyU97Ijw2v-pEmM-jXaA'
urllib.request.urlretrieve(url, "test.xlsx")
You have to append &download=1 to your url:
import pandas as pd
import requests
from io import BytesIO
url = 'https://botizrt-my.sharepoint.com/:x:/g/personal/bauko_botizrt_onmicrosoft_com/ETu8P8uY5EJChvYhi4tMaqABnOLyU97Ijw2v-pEmM-jXaA?rtime=GhHyWm4O20g&download=1'
response = requests.get(url)
with BytesIO() as buf:
buf.write(response.content)
df = pd.read_excel(buf)
Output:
>>> df
date tavg tmin tmax prcp snow wdir wspd wpgt pres tsun
0 2023-01-01 9.1 5.6 13.5 0.0 NaN 152 12.6 25.9 1029.6 NaN
1 2023-01-02 7.2 1.9 12.9 0.0 NaN 148 10.2 27.8 1027.4 NaN
2 2023-01-03 5.2 1.8 6.7 0.3 NaN 207 5.1 20.4 1027.7 NaN
3 2023-01-04 5.5 3.4 6.9 0.4 NaN 213 7.5 25.9 1029.5 NaN
4 2023-01-05 5.9 3.4 8.6 0.7 NaN 200 17.6 35.2 1019.1 NaN
5 2023-01-06 6.2 2.7 9.0 0.0 NaN 241 13.7 35.2 1022.7 NaN
6 2023-01-07 5.6 1.2 11.1 0.0 NaN 145 7.8 27.8 1023.0 NaN
7 2023-01-08 4.8 -0.5 8.5 0.0 NaN 156 12.0 27.8 1018.6 NaN
8 2023-01-09 7.9 6.5 9.4 3.4 NaN 146 14.7 33.3 1009.3 NaN
9 2023-01-10 6.7 5.2 7.8 5.8 NaN 76 13.7 42.6 1008.0 NaN
10 2023-01-11 5.6 2.8 8.5 2.1 NaN 347 14.4 40.8 1017.8 NaN
11 2023-01-12 5.0 1.5 8.9 0.0 NaN 203 7.7 22.2 1022.4 NaN
12 2023-01-13 4.1 1.6 5.7 0.0 NaN 148 9.8 27.8 1021.7 NaN
13 2023-01-14 3.3 1.0 5.4 2.9 NaN 211 5.6 18.5 1022.8 NaN
14 2023-01-15 5.0 3.6 8.5 0.0 NaN 158 12.0 31.5 1018.0 NaN
15 2023-01-16 5.7 3.7 7.6 2.4 NaN 156 15.0 33.3 1004.5 NaN
16 2023-01-17 8.1 5.5 10.6 11.1 NaN 152 18.9 44.5 995.2 NaN
17 2023-01-18 9.7 6.8 12.0 3.4 NaN 159 19.5 44.5 995.5 NaN
18 2023-01-19 7.8 3.2 11.8 10.5 NaN 182 15.8 40.8 1002.4 NaN
19 2023-01-20 2.5 0.2 4.0 14.1 NaN 312 15.2 35.2 1007.3 NaN
20 2023-01-21 2.4 0.9 3.9 0.7 NaN 35 11.8 25.9 1013.5 NaN
21 2023-01-22 5.4 1.0 11.0 0.3 NaN 10 9.6 22.2 1020.7 NaN
22 2023-01-23 6.5 2.2 12.0 0.0 NaN 10 10.4 24.1 1025.9 NaN
23 2023-01-24 3.7 2.7 5.4 0.0 NaN 254 6.9 20.4 1031.5 NaN
24 2023-01-25 2.6 1.8 3.5 0.0 NaN 287 6.2 22.2 1030.8 NaN
25 2023-01-26 2.0 0.9 3.5 2.1 NaN 41 7.1 24.1 1021.5 NaN
26 2023-01-27 1.0 -0.4 2.6 0.5 NaN 348 14.7 24.1 1014.6 NaN
27 2023-01-28 1.9 0.9 3.2 1.7 NaN 342 15.1 25.9 1018.0 NaN
28 2023-01-29 0.7 -0.9 1.7 0.0 NaN 328 13.1 25.9 1024.5 NaN
29 2023-01-30 -0.2 -4.2 2.5 0.4 NaN 207 14.6 33.3 1018.7 NaN
30 2023-01-31 2.0 -2.4 6.3 0.3 NaN 260 11.8 33.3 1016.1 NaN
Related
Hi there I was wondering whether someone might assist with combining plots generated using the example provide on this page Depth Profiling visualization where I have analyzed data for salinity and depth, however I have a categorical variable dividing three estuaries based on whether the mouth is "closed", "open", or "semi-closed". I used the code of Depth Profiling Visualization, however each plot has its own salinity legend scale per plot.
Here is the data.
State Distance Depth pH DO Chla Salinity Max.depth
1 Closed 0.60 0.0 8.66 10.64 0.8880000 18.49 -1.3
2 Closed 0.60 0.5 8.68 10.79 1.4800000 18.51 -1.3
3 Closed 0.60 1.3 8.73 11.26 1.1840000 18.51 -1.3
4 Closed 1.00 0.0 8.48 9.07 5.3280000 18.18 -0.8
5 Closed 1.00 0.8 8.47 8.30 2.9600000 18.35 -0.8
6 Closed 1.60 0.0 8.38 9.70 1.1840000 18.38 -2.0
7 Closed 1.60 0.5 8.40 9.33 NA 18.39 -2.0
8 Closed 1.60 1.0 8.40 9.27 1.1840000 18.39 -2.0
9 Closed 1.60 1.5 8.41 9.27 NA 18.41 -2.0
10 Closed 1.60 2.0 8.47 9.23 1.4800000 18.57 -2.0
11 Closed 2.15 0.0 8.40 9.85 2.6640000 18.26 -3.5
12 Closed 2.15 0.5 8.41 9.95 NA 18.27 -3.5
13 Closed 2.15 1.0 8.42 9.16 1.1840000 18.28 -3.5
14 Closed 2.15 2.0 8.42 9.82 NA 18.28 -3.5
15 Closed 2.15 3.5 8.38 9.17 0.5920000 18.30 -3.5
16 Closed 3.50 0.0 8.30 9.82 2.0720000 17.71 -5.0
17 Closed 3.50 0.5 8.31 9.78 NA 17.71 -5.0
18 Closed 3.50 1.0 8.32 9.75 1.4800000 17.72 -5.0
19 Closed 3.50 2.0 8.32 9.73 NA 17.78 -5.0
20 Closed 3.50 3.0 8.30 9.20 NA 17.95 -5.0
21 Closed 3.50 4.0 8.29 8.80 NA 18.00 -5.0
22 Closed 3.50 5.0 8.24 7.47 1.4800000 18.06 -5.0
23 Closed 4.85 0.0 8.21 10.10 2.9600000 17.33 -1.6
24 Closed 4.85 0.5 8.21 9.90 2.0720000 17.33 -1.6
25 Closed 4.85 1.0 8.21 9.73 NA 17.32 -1.6
26 Closed 4.85 1.6 8.22 9.60 1.1840000 17.32 -1.6
27 Closed 6.00 0.0 8.07 9.07 4.4400000 16.65 -1.5
28 Closed 6.00 0.5 8.06 8.98 5.6240000 16.65 -1.5
29 Closed 6.00 1.0 8.06 8.81 NA 16.67 -1.5
30 Closed 6.00 1.5 8.10 8.80 4.1440000 16.67 -1.5
31 Closed 6.70 0.0 7.83 9.25 0.0000000 13.90 -0.5
32 Open 0.60 0.0 7.56 8.42 1.1840000 1.62 -0.5
33 Open 0.60 0.5 7.62 8.40 1.9733333 1.79 -0.5
34 Open 1.00 0.0 7.67 8.55 1.1840000 1.10 -0.4
35 Open 1.00 0.4 7.62 8.49 1.5786667 1.10 -0.4
36 Open 1.60 0.0 7.48 8.40 1.5786667 0.98 -1.0
37 Open 1.60 0.5 7.47 8.33 NA 0.98 -1.0
38 Open 1.60 1.0 7.45 8.33 2.7626667 0.99 -1.0
39 Open 2.15 0.0 7.19 7.99 1.1840000 0.85 -1.0
40 Open 2.15 0.5 7.19 7.96 NA 0.86 -1.0
41 Open 2.15 1.0 7.18 7.98 1.1840000 0.89 -1.0
42 Open 3.50 0.0 7.14 7.56 0.3946667 0.55 -4.8
43 Open 3.50 0.5 7.20 7.50 NA 0.55 -4.8
44 Open 3.50 1.0 7.28 7.38 1.9733333 0.55 -4.8
45 Open 3.50 2.0 7.38 7.10 NA 0.55 -4.8
46 Open 3.50 3.0 7.56 6.15 NA 0.56 -4.8
47 Open 3.50 4.0 7.20 4.43 NA 2.65 -4.8
48 Open 3.50 4.8 6.93 2.25 1.9733333 6.76 -4.8
49 Open 4.85 0.0 6.90 8.29 1.1840000 0.26 -1.2
50 Open 4.85 0.5 6.77 8.20 0.7893333 0.27 -1.2
51 Open 4.85 1.2 6.55 8.20 0.7893333 0.39 -1.2
52 Open 6.00 0.0 6.49 9.53 1.1840000 0.13 -1.0
53 Open 6.00 0.5 6.59 9.53 NA 0.13 -1.0
54 Open 6.00 1.0 6.79 9.53 1.1840000 0.13 -1.0
55 Open 6.70 0.0 6.48 9.48 0.7893333 0.11 -0.5
56 Semi-closed 0.60 0.0 8.05 6.30 19.7300000 18.86 -1.4
57 Semi-closed 0.60 0.5 8.04 5.19 19.7300000 24.07 -1.4
58 Semi-closed 0.60 1.0 8.00 5.98 NA 30.50 -1.4
59 Semi-closed 0.60 1.4 7.87 6.19 5.1300000 31.18 -1.4
60 Semi-closed 1.00 0.0 7.99 5.75 22.8900000 18.81 -0.9
61 Semi-closed 1.00 0.5 7.95 5.10 NA 19.08 -0.9
62 Semi-closed 1.00 0.9 7.86 3.42 11.8400000 26.60 -0.9
63 Semi-closed 1.60 0.0 7.88 6.05 11.4500000 17.29 -1.7
64 Semi-closed 1.60 0.5 7.87 5.78 NA 17.32 -1.7
65 Semi-closed 1.60 1.0 7.86 4.74 8.6800000 17.44 -1.7
66 Semi-closed 1.60 1.5 7.84 3.90 NA 19.65 -1.7
67 Semi-closed 1.60 1.7 7.91 3.75 9.0800000 21.07 -1.7
68 Semi-closed 2.15 0.0 7.91 6.95 22.8900000 16.50 -1.3
69 Semi-closed 2.15 0.5 7.92 6.76 26.4400000 16.50 -1.3
70 Semi-closed 2.15 1.0 7.91 5.99 NA 17.40 -1.3
71 Semi-closed 2.15 1.3 7.97 4.10 7.1000000 18.79 -1.3
72 Semi-closed 3.50 0.0 7.75 6.13 18.5500000 15.86 -4.5
73 Semi-closed 3.50 0.5 7.72 5.90 NA 15.86 -4.5
74 Semi-closed 3.50 1.0 7.65 4.38 9.0800000 16.38 -4.5
75 Semi-closed 3.50 1.5 7.56 1.59 NA 20.09 -4.5
76 Semi-closed 3.50 2.0 7.55 0.38 NA 22.11 -4.5
77 Semi-closed 3.50 3.0 7.53 0.42 NA 30.36 -4.5
78 Semi-closed 3.50 4.0 7.52 0.52 NA 31.50 -4.5
79 Semi-closed 3.50 4.5 7.54 0.68 1.1800000 31.84 -4.5
80 Semi-closed 4.85 0.0 7.66 6.31 21.7100000 15.41 -1.6
81 Semi-closed 4.85 0.5 7.65 6.18 NA 15.44 -1.6
82 Semi-closed 4.85 1.0 7.65 5.57 21.3100000 15.54 -1.6
83 Semi-closed 4.85 1.6 7.52 0.76 6.7100000 22.60 -1.6
84 Semi-closed 6.00 0.0 7.74 8.50 87.6200000 13.11 -1.0
85 Semi-closed 6.00 0.5 7.66 7.38 NA 13.92 -1.0
86 Semi-closed 6.00 1.0 7.60 3.20 7.5000000 15.42 -1.0
87 Semi-closed 6.70 0.0 8.55 6.94 0.0000000 0.25 -0.5
I was hoping someone might be able to assist to unify the scales of the three legends from the three mouth conditions of estuary so that only one legend describing salinity for all plots is possible.
I have numpy array with the sape of 178 rows X 14 columns like this:
0 1 2 3 4 5 6 7 8 9 10 11 \
0 1.0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04
1 1.0 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05
2 1.0 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03
3 1.0 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86
4 1.0 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04
.. ... ... ... ... ... ... ... ... ... ... ... ...
173 3.0 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06 7.70 0.64
174 3.0 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41 7.30 0.70
175 3.0 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.20 0.59
176 3.0 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60
177 3.0 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.20 0.61
12 13
0 3.92 1065.0
1 3.40 1050.0
2 3.17 1185.0
3 3.45 1480.0
4 2.93 735.0
.. ... ...
173 1.74 740.0
174 1.56 750.0
175 1.56 835.0
176 1.62 840.0
177 1.60 560.0
[178 rows x 14 columns]
I tried to print it in dataframe for all the rows and only the first (index 0) column and the output worked like this:
0
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
.. ...
173 3.0
174 3.0
175 3.0
176 3.0
177 3.0
[178 rows x 1 columns]
using the same logic, I want totake all the rows and only the first column with the value is below 2. I tried to do it like this and it doesn't work:
reduced = data[data[:,0:1]<=2]
I got an
IndexError
like this:
IndexError Traceback (most recent call last)
<ipython-input-159-7eab0abd8f99> in <module>()
----> 1 reduced = data[data[:,0:1]<=2]
IndexError: boolean index did not match indexed array along dimension 1; dimension is 14 but corresponding boolean dimension is 1.
anybody can help me?
thank in advance
Solved it.
It is just as simple as just convert the numpy array to dataframe and then select rows based on condition in dataframe:
reduced = data[data['class'] <= 2]
I have a dataframe that looks like this:
df_segments =
id seg_length
15 000b994d-1a6b-4698-a270-b0f671b1e612 16.3
11 000b994d-1a6b-4698-a270-b0f671b1e612 1.1
3 000b994d-1a6b-4698-a270-b0f671b1e612 1.1
7 000b994d-1a6b-4698-a270-b0f671b1e612 16.3
31 016490a8-8740-4205-bfe4-c9fe45e853d3 1.0
27 016490a8-8740-4205-bfe4-c9fe45e853d3 1.4
19 016490a8-8740-4205-bfe4-c9fe45e853d3 1.4
23 016490a8-8740-4205-bfe4-c9fe45e853d3 1.0
39 05290fe1-ead2-462b-bbec-a7669eed7883 1.1
35 05290fe1-ead2-462b-bbec-a7669eed7883 1.4
47 05290fe1-ead2-462b-bbec-a7669eed7883 1.1
43 05290fe1-ead2-462b-bbec-a7669eed7883 1.4
63 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.1
59 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.4
51 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.4
55 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.1
71 05577c2e-da7d-4753-bba6-66762385e159 1.0
67 05577c2e-da7d-4753-bba6-66762385e159 5.4
79 05577c2e-da7d-4753-bba6-66762385e159 1.0
75 05577c2e-da7d-4753-bba6-66762385e159 5.4
1475 5a104c86-327e-466f-b14a-6953cacddcbb 0.5
1479 5a104c86-327e-466f-b14a-6953cacddcbb 0.5
1487 5a104c86-327e-466f-b14a-6953cacddcbb 0.5
1483 5a104c86-327e-466f-b14a-6953cacddcbb 0.5
2287 8e853797-a7f3-4605-8848-f6b211f9b055 2.1
2283 8e853797-a7f3-4605-8848-f6b211f9b055 2.1
2279 8e853797-a7f3-4605-8848-f6b211f9b055 2.1
2275 8e853797-a7f3-4605-8848-f6b211f9b055 2.1
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.6
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.5
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
All id have four row. For most id, dropping duplicates results in two rows. But for a few ids one of two things can happen:
Either all rows are equal, in which case drop_duplicates() will result in a single row.
drop_duplicates() with result in three or for rows because all values of seg_length are different.
However, all seg_length are the length of the sides in a rectangle (or very close to it) and squares. So, what I would like to do are the following things:
A. If all rows by id have the same seg_length value, keep two rows.
B. Replace the two largest (resp. smallest) values (by id) with their average. In other words:
df_segments =
id seg_length
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.6
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.5
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
would become:
df_segments =
id seg_length
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.55
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.55
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
Alternatively, use min/max if it is easier:
df_segments =
id seg_length
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.6
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.6
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
I have tried to use np.where and define conditions but without any luck. I also tried to create a separate dataframe with the ids whose count was not 2 after dropping duplicates from the original dataframe, df_segments but I ended up in the same situation.
If anyone has an idea, I would be thankful for insights.
If I understand well, you want to average values 2 by 2 within each id. This also happens to do what you want when it’s 4 times the same value.
>>> averages = df.groupby('id')['seg_length'].apply(
... lambda s: s.sort_values().groupby([0, 0, 1, 1]).mean()
... )
>>> averages
id
000b994d-1a6b-4698-a270-b0f671b1e612 0 1.10
1 16.30
016490a8-8740-4205-bfe4-c9fe45e853d3 0 1.00
1 1.40
05290fe1-ead2-462b-bbec-a7669eed7883 0 1.10
1 1.40
0537a9e3-09c4-459c-a6e4-25694cfbacbd 0 1.10
1 1.40
05577c2e-da7d-4753-bba6-66762385e159 0 1.00
1 5.40
5a104c86-327e-466f-b14a-6953cacddcbb 0 0.50
1 0.50
8e853797-a7f3-4605-8848-f6b211f9b055 0 2.10
1 2.10
c1120018-c626-4c1b-81a5-476ce38f346b 0 0.55
1 1.20
Name: seg_length, dtype: float64
If you want to keep the original shape, you can use transform (on both groupbys):
>>> replaced_seglengths = df.groupby('id')['seg_length'].transform(
... lambda s: s.sort_values().groupby([0, 0, 1, 1]).transform('mean')
... )
>>> replaced_seglengths
15 1.10
11 1.10
3 16.30
7 16.30
31 1.00
27 1.00
19 1.40
23 1.40
39 1.10
35 1.10
47 1.40
43 1.40
63 1.10
59 1.10
51 1.40
55 1.40
71 1.00
67 1.00
79 5.40
75 5.40
1475 0.50
1479 0.50
1487 0.50
1483 0.50
2287 2.10
2283 2.10
2279 2.10
2275 2.10
3351 0.55
3347 0.55
3359 1.20
3355 1.20
Finally replace the column in the dataframe:
>>> df['seg_length'] = replaced_seglengths
>>> df
id seg_length
15 000b994d-1a6b-4698-a270-b0f671b1e612 1.10
11 000b994d-1a6b-4698-a270-b0f671b1e612 1.10
3 000b994d-1a6b-4698-a270-b0f671b1e612 16.30
7 000b994d-1a6b-4698-a270-b0f671b1e612 16.30
31 016490a8-8740-4205-bfe4-c9fe45e853d3 1.00
27 016490a8-8740-4205-bfe4-c9fe45e853d3 1.00
19 016490a8-8740-4205-bfe4-c9fe45e853d3 1.40
23 016490a8-8740-4205-bfe4-c9fe45e853d3 1.40
39 05290fe1-ead2-462b-bbec-a7669eed7883 1.10
35 05290fe1-ead2-462b-bbec-a7669eed7883 1.10
47 05290fe1-ead2-462b-bbec-a7669eed7883 1.40
43 05290fe1-ead2-462b-bbec-a7669eed7883 1.40
63 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.10
59 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.10
51 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.40
55 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.40
71 05577c2e-da7d-4753-bba6-66762385e159 1.00
67 05577c2e-da7d-4753-bba6-66762385e159 1.00
79 05577c2e-da7d-4753-bba6-66762385e159 5.40
75 05577c2e-da7d-4753-bba6-66762385e159 5.40
1475 5a104c86-327e-466f-b14a-6953cacddcbb 0.50
1479 5a104c86-327e-466f-b14a-6953cacddcbb 0.50
1487 5a104c86-327e-466f-b14a-6953cacddcbb 0.50
1483 5a104c86-327e-466f-b14a-6953cacddcbb 0.50
2287 8e853797-a7f3-4605-8848-f6b211f9b055 2.10
2283 8e853797-a7f3-4605-8848-f6b211f9b055 2.10
2279 8e853797-a7f3-4605-8848-f6b211f9b055 2.10
2275 8e853797-a7f3-4605-8848-f6b211f9b055 2.10
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.55
3347 c1120018-c626-4c1b-81a5-476ce38f346b 0.55
3359 c1120018-c626-4c1b-81a5-476ce38f346b 1.20
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.20
use np.select([conditions],[solutions])
conditons
condition1=df2.groupby('id')['seg_length'].apply(lambda x:x.duplicated(keep=False))
condition2=df2.groupby('id')['seg_length'].apply(lambda x:~x.duplicated(keep=False))
Solution
sol1=df2['seg_length']
sol2=(df2.loc[condition2,'seg_length'].sum(0))/2
df2['newseg_length']=np.select([condition1,condition2],[sol1,sol2])
id seg_length newseg_length
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.6 0.55
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2 1.20
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.5 0.55
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2 1.20
I have a df called "YMp" and I have made a boxplot for the data with the dates 1991 - 2019 but I need to show the current year (2020) values as colored points or values with a legend showing the year 2020 over-plotted on the boxplot.
The data looks like this -
month 01 02 03 04 05 06 07 08 09 10 11 12
year
1991 -4.9 12.2 -11.1 -18.0 -27.5 1.7 7.4 22.7 38.3 4.2 -0.9 5.3
1992 -10.9 -17.1 -7.7 14.8 14.8 -9.6 17.0 24.7 32.3 0.3 -21.6 15.3
1993 -1.8 -2.3 -3.8 0.4 -4.8 -7.7 11.7 26.3 17.1 2.6 4.4 2.4
1994 2.6 2.5 -6.2 -3.2 2.2 -3.0 13.8 3.9 30.4 -25.7 -1.8 -2.2
1995 -8.6 -3.3 -18.4 -14.0 -19.3 13.2 9.8 -23.2 16.0 -15.2 0.6 -8.5
1996 -5.5 -10.4 -0.3 7.2 13.0 3.6 5.2 1.4 -10.3 -2.9 15.4 -0.6
1997 -11.1 8.9 -1.1 -12.7 3.0 -4.0 27.1 32.6 -4.5 -15.5 -5.5 -20.9
1998 -22.0 -16.6 3.2 0.7 15.4 16.0 18.5 2.7 -32.3 16.3 -5.4 12.9
1999 -1.0 0.3 -8.5 9.9 7.4 -2.1 10.9 -5.5 18.5 17.4 17.5 11.1
2000 5.4 12.9 -24.8 15.7 -9.3 20.7 18.2 23.2 16.6 26.8 -17.7 17.3
2001 -3.9 14.5 -4.7 18.6 5.6 22.4 -3.3 18.2 5.3 31.2 6.0 -4.0
2002 -9.0 19.5 12.5 24.5 27.6 -9.3 3.7 13.7 -32.7 -19.5 0.7 -6.1
2003 23.6 -11.7 -16.5 -2.1 6.5 -13.7 0.4 8.0 -13.7 -16.1 7.3 13.1
2004 6.6 4.7 36.8 12.8 29.5 6.4 -12.2 -0.6 -7.7 -15.2 -1.1 12.7
2005 6.3 1.1 -14.6 9.4 -7.5 6.1 -9.2 -1.3 36.1 -4.9 10.8 -11.7
2006 7.3 8.3 1.7 11.8 -14.7 33.3 9.1 -0.0 3.0 1.4 -2.8 8.8
2007 5.4 0.2 7.2 -3.9 6.6 -8.3 -28.2 -7.6 3.3 -7.4 25.0 -7.3
2008 5.0 -5.6 7.6 -0.4 -1.2 13.9 -11.3 -29.7 16.7 43.1 2.4 3.5
2009 -2.2 17.1 9.8 8.9 -9.2 -14.4 6.1 21.7 -0.2 -26.7 -9.1 -18.2
2010 -2.6 -12.1 0.8 -16.5 4.1 3.9 -21.5 -3.3 -18.9 22.8 -6.5 -5.3
2011 -12.4 -3.8 1.2 -14.9 -2.0 6.8 -12.6 -16.9 8.3 10.7 -0.7 4.6
2012 0.5 -3.0 -1.0 -6.5 7.5 -17.9 -4.3 -26.3 -2.6 3.0 12.3 -15.3
2013 -1.7 -15.1 18.8 -8.3 7.5 -4.5 -19.3 0.9 -33.9 -10.6 -0.4 4.4
2014 7.2 -20.0 -8.4 2.0 10.1 -20.2 7.8 -14.9 -11.4 -6.9 -0.3 6.4
2015 18.4 6.2 10.5 -16.5 -11.9 7.0 -7.3 -6.7 -20.8 -13.9 -3.3 -14.8
2016 -11.3 28.5 -9.2 -4.2 -9.7 1.0 -5.1 -18.9 -3.3 19.1 -1.1 10.1
2017 -8.6 -8.1 21.2 4.5 -21.2 -28.5 -6.8 -30.8 -19.7 13.3 7.2 9.9
2018 26.9 3.1 -7.1 -3.4 -8.7 -15.5 12.8 3.9 -16.4 -7.9 -25.7 -9.2
2019 2.1 -17.3 10.2 1.0 -13.5 -3.4 -14.0 -20.7 -1.5 -28.4 -5.4 -13.9
The current year 2020 data looks like this -
2020 0.3 6.2 2.0 -17.9 -0.4 6.0 -24.5 2.5 -12.1 4.6 NaN NaN
My boxplot looks like this without the 2020 data plotted or highlighted below. Thank you for helping with ideas about doing this.
Try catching the axis instance and plot again:
ax = df.boxplot()
ax.scatter(np.arange(df.shape[1])+1, df.loc[2000], color='r')
Output:
I'm trying to treat an entire column of date values to change it in a column of numbers from "1" to "the last day of the month" in a Pandas dataframe.
The code has to be able to deal with columns of 28,29,30 or 31 values depending on which month is concerned.
So my df:
DAY TX TN
0 20190201 4.9 -0.6
1 20190202 2.7 0.0
2 20190203 4.6 -0.3
3 20190204 2.9 -0.5
4 20190205 6.2 1.3
5 20190206 7.5 2.4
6 20190207 8.6 4.6
7 20190208 8.6 5.0
8 20190209 9.2 6.7
9 20190210 9.1 3.8
10 20190211 6.9 0.7
11 20190212 7.0 -0.5
12 20190213 7.8 -0.5
13 20190214 13.4 0.0
14 20190215 16.4 2.0
15 20190216 14.8 2.0
16 20190217 15.7 1.2
17 20190218 15.4 1.2
18 20190219 9.8 4.3
19 20190220 11.1 2.8
20 20190221 13.1 5.8
21 20190222 10.7 4.1
22 20190223 12.9 1.5
23 20190224 14.5 1.2
24 20190225 16.1 2.2
25 20190226 17.2 0.3
26 20190227 19.3 1.1
27 20190228 11.3 5.1
should become
DAY TX TN
0 1 4.9 -0.6
1 2 2.7 0.0
2 3 4.6 -0.3
3 4 2.9 -0.5
4 5 6.2 1.3
5 6 7.5 2.4
6 7 8.6 4.6
7 8 8.6 5.0
8 9 9.2 6.7
9 10 9.1 3.8
10 11 6.9 0.7
11 12 7.0 -0.5
12 13 7.8 -0.5
13 14 13.4 0.0
14 15 16.4 2.0
15 16 14.8 2.0
16 17 15.7 1.2
17 18 15.4 1.2
18 19 9.8 4.3
19 20 11.1 2.8
20 21 13.1 5.8
21 22 10.7 4.1
22 23 12.9 1.5
23 24 14.5 1.2
24 25 16.1 2.2
25 26 17.2 0.3
26 27 19.3 1.1
27 28 11.3 5.1
I have to treat each value of this column so I can also check that there is no day missing and that the generation of numbers adapts to each month-df I will provide.
I searched in the Pandas documentation for an instruction that could help but I didn't find it.
Any help would be appreciated.
Use to_datetime with Series.dt.day:
df['DAY'] = pd.to_datetime(df['DAY'], format='%Y%m%d').dt.day
Another solution is casting values to strings, get last 2 integers by indexing and cast to integers:
df['DAY'] = df['DAY'].astype(str).str[-2:].astype(int)
print (df)
DAY TX TN
0 1 4.9 -0.6
1 2 2.7 0.0
2 3 4.6 -0.3
3 4 2.9 -0.5
4 5 6.2 1.3
5 6 7.5 2.4
6 7 8.6 4.6
7 8 8.6 5.0
8 9 9.2 6.7
9 10 9.1 3.8
10 11 6.9 0.7
11 12 7.0 -0.5
12 13 7.8 -0.5
13 14 13.4 0.0
14 15 16.4 2.0
15 16 14.8 2.0
16 17 15.7 1.2
17 18 15.4 1.2
18 19 9.8 4.3
19 20 11.1 2.8
20 21 13.1 5.8
21 22 10.7 4.1
22 23 12.9 1.5
23 24 14.5 1.2
24 25 16.1 2.2
25 26 17.2 0.3
26 27 19.3 1.1
27 28 11.3 5.1
You can just slice the column to get the last 2 digits and cast to int:
In[85]:
df['DAY'] = df['DAY'].str[-2:].astype(int)
df
Out[85]:
DAY TX TN
0 1 4.9 -0.6
1 2 2.7 0.0
2 3 4.6 -0.3
3 4 2.9 -0.5
4 5 6.2 1.3
5 6 7.5 2.4
6 7 8.6 4.6
7 8 8.6 5.0
8 9 9.2 6.7
9 10 9.1 3.8
10 11 6.9 0.7
11 12 7.0 -0.5
12 13 7.8 -0.5
13 14 13.4 0.0
14 15 16.4 2.0
15 16 14.8 2.0
16 17 15.7 1.2
17 18 15.4 1.2
18 19 9.8 4.3
19 20 11.1 2.8
20 21 13.1 5.8
21 22 10.7 4.1
22 23 12.9 1.5
23 24 14.5 1.2
24 25 16.1 2.2
25 26 17.2 0.3
26 27 19.3 1.1
27 28 11.3 5.1
If the dtype is int already then you just need to cast to str first:
df['DAY'] = df['DAY'].astype(str).str[-2:].astype(int)