Using .loc to categorize continuous data for range of values

Using .loc to categorize continuous data for range of values - pandas

0, 10.65
1, 15.27
2, 15.96
3, 13.49
4, 12.69
5, 7.90
6, 15.96
7, 18.64
8, 21.28
9, 12.69
10, 14.65
11, 12.69
12, 13.49
13, 9.91
14, 10.65
15, 16.29
the code I write is
data2.loc[data2['int_rate'] <= 8.00, 'int_rate'] = "low"
data2.loc[8.00 < data2['int_rate'] <= 30.00, 'int_rate'] = "medium"
data2.loc[15.00 < data2['int_rate'] < 30.00, 'int_rate'] = "high"
In result I get all the value lower than 8.0 as low but no changes to other value.

The answer my problem will be:
data2['int_rate'] = (pd.cut(data2.int_rate, bins=[0, 8.00, 15.00, 30.00], labels=['low', 'medium', 'high']))
above code will implent the low to values lower than 8.00 and high to its respectively.

Related

Selecting the value at a given date for each lat/lon point in xarray

I have a xr.DataArray object that has a day of 2015 (as a cftime.DateTimeNoLeap object) for each lat-lon point on the grid.
date_matrix2015
<xarray.DataArray (lat: 160, lon: 320)>
array([[cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0), ...,
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0)],
[cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0), ...,
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0)],
[cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0), ...,
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 12, 11, 12, 0, 0, 0)],
...,
[cftime.DatetimeNoLeap(2015, 3, 14, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 3, 14, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 3, 14, 12, 0, 0, 0), ...,
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0)],
[cftime.DatetimeNoLeap(2015, 9, 15, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 15, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 15, 12, 0, 0, 0), ...,
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 15, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 15, 12, 0, 0, 0)],
[cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0), ...,
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0),
cftime.DatetimeNoLeap(2015, 9, 16, 12, 0, 0, 0)]], dtype=object)
Coordinates:
year int64 2015
* lat (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
* lon (lon) float64 0.0 1.125 2.25 3.375 4.5 ... 355.5 356.6 357.8 358.9
I have another xr.DataArray on the same lat-lon grid for vertical velocity (omega) that has data for every day in 2015. At each lat-lon point I would like to select the velocity value on the corresponding day given in date_matrix2015. Ideally I would like to do something like this:
omega.sel(time=date_matrix2015)
I have tried constructing the new dataarray manually with iteration, but I haven't had much luck.
Does anyone have any ideas? Thank you in advance!
------------EDIT---------------
Here is a minimal reproducible example for the problem. To clarify what I am looking for: I have two DataArrays, one for daily precipitation values, and one for daily omega values. I want to determine for each lat/lon point the day that saw the maximum precipitation (I think I have done this part correctly). From there I want to select at each lat/lon point the omega value that occurred on the day of maximum precipitation. So ultimately I would like to end up with a DataArray of omega values that has two dimensions, lat and lon, where the value at each lat/lon point is the omega value on the day of maximum rainfall at that location.
import numpy as np
import xarray as xr
import pandas as pd
precip = np.abs(8*np.random.randn(10,10,10))
omega = 15*np.random.randn(10,10,10)
lat = np.arange(0,10)
lon = np.arange(0, 10)
##Note: actual data resolution is 160x360
dates = pd.date_range('01-01-2015', '01-10-2015')
precip_da = xr.DataArray(precip).rename({'dim_0':'time', 'dim_1':'lat', 'dim_2':'lon'}).assign_coords({'time':dates, 'lat':lat, 'lon':lon})
omega_da = xr.DataArray(omega).rename({'dim_0':'time', 'dim_1':'lat', 'dim_2':'lon'}).assign_coords({'time':dates, 'lat':lat, 'lon':lon})
#Find Date of maximum precip for each lat lon point and store in an array
maxDateMatrix = precip_da.idxmax(dim='time')
#For each lat lon point, select the value from omega_da on the day of maximum precip (i.e. the date given at that location in the maxDateMatrix)

You can pair da.sel with da.idxmax to select the index of the maxima along any number of dimensions:
In [10]: omega_da.sel(time=precip_da.idxmax(dim='time'))
Out[10]:
<xarray.DataArray (lat: 10, lon: 10)>
array([[ 17.72211193, -16.20781517, 9.65493368, -28.16691093,
18.8756182 , 16.81924325, -20.55251804, -18.36625778,
-19.57938236, -10.77385357],
[ 3.95402784, -5.28478105, -8.6632994 , 2.46787932,
20.53981254, -4.74908659, 9.5274101 , -1.08191372,
9.4637305 , -10.91884369],
[-31.30033085, 6.6284144 , 8.15945444, 5.74849304,
12.49505739, 2.11797825, -18.12861347, 7.27497695,
5.16197504, -32.99882591],
...
[-34.73945635, 24.40515233, 14.56982584, 12.16550083,
-8.3558104 , -20.16328749, -33.89051472, -0.09599935,
2.65689584, 29.54056082],
[-18.8660847 , -7.58120994, 15.57632568, 4.19142695,
8.71046261, 9.05684805, 8.48128361, 0.34166869,
8.41090015, -2.31386572],
[ -4.38999926, 17.00411671, 16.66619606, 24.99390669,
-14.01424591, 19.85606151, -16.87897 , 12.84205521,
-16.78824975, -6.33920671]])
Coordinates:
time (lat, lon) datetime64[ns] 2015-01-01 2015-01-01 ... 2015-01-10
* lat (lat) int64 0 1 2 3 4 5 6 7 8 9
* lon (lon) int64 0 1 2 3 4 5 6 7 8 9
See the great section of the xarray docs on Indexing and Selecting Data for more info, especially the section on Advanced Indexing, which goes into using DataArrays as indexers for powerful reshaping operations.

Outliers in data

I have a dataset like so -
15643, 14087, 12020, 8402, 7875, 3250, 2688, 2654, 2501, 2482, 1246, 1214, 1171, 1165, 1048, 897, 849, 579, 382, 285, 222, 168, 115, 92, 71, 57, 56, 51, 47, 43, 40, 31, 29, 29, 29, 29, 28, 22, 20, 19, 18, 18, 17, 15, 14, 14, 12, 12, 11, 11, 10, 9, 9, 8, 8, 8, 8, 7, 6, 5, 5, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Based on domain knowledge, I know that larger values are the only ones we want to include in our analysis. How do I determine where to cut off our analysis? Should it be don't include 15 and lower or 50 and lower etc?

You can do a distribution check with quantile function. Then you can remove values below lowest 1 percentile or 2 percentile. Following is an example:
import numpy as np
data = np.array(data)
print(np.quantile(data, (.01, .02)))
Another method is calculating the inter quartile range (IQR) and setting lowest bar for analysis is Q1-1.5*IQR
Q1, Q3 = np.quantile(data, (0.25, 0.75))
data_floor = Q1 - 1.5 * (Q3 - Q1)

Increasing the label size in matplotlib in pie chart

I have the following dictionary
{'Electronic Arts': 66,
'GT Interactive': 1,
'Palcom': 1,
'Fox Interactive': 1,
'LucasArts': 5,
'Bethesda Softworks': 9,
'SquareSoft': 3,
'Nintendo': 142,
'Virgin Interactive': 4,
'Atari': 7,
'Ubisoft': 28,
'Konami Digital Entertainment': 11,
'Hasbro Interactive': 1,
'MTV Games': 1,
'Sega': 11,
'Enix Corporation': 4,
'Capcom': 13,
'Warner Bros. Interactive Entertainment': 7,
'Acclaim Entertainment': 1,
'Universal Interactive': 1,
'Namco Bandai Games': 7,
'Eidos Interactive': 9,
'THQ': 7,
'RedOctane': 1,
'Sony Computer Entertainment Europe': 3,
'Take-Two Interactive': 24,
'Square Enix': 5,
'Microsoft Game Studios': 22,
'Disney Interactive Studios': 2,
'Vivendi Games': 2,
'Sony Computer Entertainment': 52,
'Activision': 45,
'505 Games': 4}
Now the problem I am facing is viewing the labels. The labels are extremely small and invisible.
Please anyone can suggest on how to increase the label size.
I have tried the below code:
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(),labels=vg_dict.keys())
plt.show()

Adding textprops argument in plt.pie method:
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(), labels=vg_dict.keys(), textprops={'fontsize': 30})
plt.show()
You can check all the properties of Text object here.
Updated
I don't know if your labels order matter? To avoid overlapping labels, you can try to modify your start angle (plt start drawing pie counterclockwise from the x-axis), and re-order the "crowded" labels:
vg_dict = {
'Palcom': 1,
'Electronic Arts': 66,
'GT Interactive': 1,
'LucasArts': 5,
'Bethesda Softworks': 9,
'SquareSoft': 3,
'Nintendo': 142,
'Virgin Interactive': 4,
'Atari': 7,
'Ubisoft': 28,
'Hasbro Interactive': 1,
'Konami Digital Entertainment': 11,
'MTV Games': 1,
'Sega': 11,
'Enix Corporation': 4,
'Capcom': 13,
'Acclaim Entertainment': 1,
'Warner Bros. Interactive Entertainment': 7,
'Universal Interactive': 1,
'Namco Bandai Games': 7,
'Eidos Interactive': 9,
'THQ': 7,
'RedOctane': 1,
'Sony Computer Entertainment Europe': 3,
'Take-Two Interactive': 24,
'Vivendi Games': 2,
'Square Enix': 5,
'Microsoft Game Studios': 22,
'Disney Interactive Studios': 2,
'Sony Computer Entertainment': 52,
'Fox Interactive': 1,
'Activision': 45,
'505 Games': 4}
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(), labels=vg_dict.keys(), textprops={'fontsize': 35}, startangle=-35)
plt.show()
Result:

Creating empty pandas dataframe with Multi-Index

I'm trying to create an empty pandas.Dataframe with a Multi-Index that I can later fill columnwise with my data. I've looked at other answers (here and here), but they all work with data that does not fill in columnwise, or that is somehow connected in the different columns.
The information I want to be contained in the Multi-Index looks like this:
GCM_list = ['BCC-CSM2-MR', 'CAMS-CSM1-0', 'CESM2', 'CESM2-WACCM', 'CMCC-CM2-SR5', 'EC-Earth3', 'EC-Earth3-Veg', 'FGOALS-f3-L', 'GFDL-ESM4', 'INM-CM4-8', 'INM-CM5-0', 'MPI-ESM1-2-HR', 'MRI-ESM2-0', 'NorESM2-MM', 'TaiESM1']
SSP_list = ['SSP_126', 'SSP_245', 'SSP_370', 'SSP_585']
index_years = [2030, 2040, 2050, 2060, 2070, 2080, 2090, 2100]
And I want it to look somewhat like this (for the three first items in GCM_list):
BCC-CSM2-MR CAMS-CSM1-0 CESM2
SSP_126 SSP_245 SSP_370 SSP_585 SSP_126 SSP_245 SSP_370 SSP_585 SSP_126 SSP_245 SSP_370 SSP_585
2030 | |
2040 | |
2050 V V
2060 1 2
2070
2080
2090
2100
The "arrows" in the first two columns should represent how and in what order I want to fill the Dataframe after the Index is created - if that's important for this question.
I've tried building the index like this, but I'm not sure what to make of the result. How should I proceed? Is there a way to build this empty dataframe so that I can fill it column after column?
arrays = [GCM_list, SSP_list]
index = pd.MultiIndex.from_arrays(arrays, names=('GCM', 'SSP'))
>>> index
MultiIndex(levels=[[u'BCC-CSM2-MR', u'CAMS-CSM1-0', u'CESM2', u'CESM2-WACCM', u'CMCC-CM2-SR5', u'EC-Earth3', u'EC-Earth3-Veg', u'FGOALS-f3-L', u'GFDL-ESM4', u'INM-CM4-8', u'INM-CM5-0', u'MPI-ESM1-2-HR', u'MRI-ESM2-0', u'NorESM2-MM', u'TaiESM1'], [u'SSP_126', u'SSP_245', u'SSP_370', u'SSP_585']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]],
names=[u'GCM', u'SSP'])

Use MultiIndex.from_product:
arrays = [GCM_list, SSP_list]
mux = pd.MultiIndex.from_product(arrays, names=('GCM', 'SSP'))
df = pd.DataFrame(columns=mux, index=index_years)

Chart Axes in VB.NET

My requirement is to graph (scatter graph) data from 2 arrays. I can now connect the data from the array and use it on the chart. My question is, how do I set the graph's X- and Y- axes to show consistency in their intervals?
For example, I have points from X = {1, 3, 4, 6, 8, 9} and Y = {7, 10, 11, 15, 18, 19}. What I would like to see is that these points are graphed in a scatter manner, but, the intervals for x-axis should be (intervals of) 2 up to 10 (such that it will show 0, 2, 4, 6, 8, 10 on x-axis) and intervals of 5 for the y-axis (such that it will show 5, 10, 15, 20 on y-axis). What code/property should I use/manipulate?
ADDED PART:
I currently have this data:
x_column = {12, 24, 1, 7, 29, 28, 25, 24, 15, 19}
y_column = {3, 5, 8, 3, 3, 3, 3, 3, 19, 15}
each y_column element is a pair of each respective x_column element
Now, I want MyChart to display a scatter graph of the x_column and y_column data in such a way that the x-axis will show 5, 10, 15, 20, 25, 30 and the y-axis will show 2, 4, 6, 8, 10, 12, 14, 16, 18, 20.
My current code is:
' add points
MyChart.Series("Scatter Plot").Points.DataBindXY(x_Column, y_Column)
The code above only adds points.

Try:
Chart1.ChartAreas("Default").AxisX.Interval = 2
Chart1.ChartAreas("Default").AxisY.Interval = 5

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using .loc to categorize continuous data for range of values - pandas

The answer my problem will be: data2['int_rate'] = (pd.cut(data2.int_rate, bins=[0, 8.00, 15.00, 30.00], labels=['low', 'medium', 'high'])) above code will implent the low to values lower than 8.00 and high to its respectively.

Related

Selecting the value at a given date for each lat/lon point in xarray

Outliers in data

Increasing the label size in matplotlib in pie chart

Creating empty pandas dataframe with Multi-Index

Chart Axes in VB.NET

Categories

Resources