Suppose that I have an emission data with shape (21600,43200),
which corresponds to the lat and lon,i.e,
lat = np.arange(21600)*(-0.008333333)+90
lon = np.arange(43200)*0.00833333-180
And I also have a scaling factor with shape of (720,1440,7),which corresponds to lat , lon, day of week, and
lat = np.arange(720)*0.25-90
lon = np.arange(1440)*0.25-180
For now, I want to apply the factor to the emission data and I think I need to interpolate the factor on (720,1440) to (21600,43200). After that I can multiply the interpolated factor with the emission data to get the new emission output.
But I have a difficulty on the interpolation method.
Could anyone give me some suggestions?
Here's a complete example of the kind of interpolation you're trying to do. For example purposes I used emission data with shape (10, 20) and scale data with shape (5, 10). It uses scipy.interpolate.RectBivariateSpline, which is the recommended method for interpolating on regular grids:
import scipy.interpolate as sci
def latlon(res):
return (np.arange(res)*(180/res) - 90,
np.arange(2*res)*(360/(2*res)) - 180)
lat_fine,lon_fine = latlon(10)
emission = np.ones(10*20).reshape(10,20)
lat_coarse,lon_coarse = latlon(5)
scale = np.linspace(0, .5, num=5).reshape(-1, 1) + np.linspace(0, .5, num=10)
f = sci.RectBivariateSpline(lat_coarse, lon_coarse, scale)
scale_interp = f(lat_em, lon_em)
with np.printoptions(precision=1, suppress=True, linewidth=9999):
print('original emission data:\n%s\n' % emission)
print('original scale data:\n%s\n' % scale)
print('interpolated scale data:\n%s\n' % scale_interp)
print('scaled emission data:\n%s\n' % (emission*scale_interp))
which outputs:
original emission data:
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
original scale data:
[[0. 0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5]
[0.1 0.2 0.2 0.3 0.3 0.4 0.5 0.5 0.6 0.6]
[0.2 0.3 0.4 0.4 0.5 0.5 0.6 0.6 0.7 0.8]
[0.4 0.4 0.5 0.5 0.6 0.7 0.7 0.8 0.8 0.9]
[0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1. ]]
interpolated scale data:
[[0. 0. 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5]
[0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6]
[0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6]
[0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7]
[0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8]
[0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.8]
[0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.9 0.9]
[0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9]
[0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 1. 1. 1. ]
[0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 1. 1. 1. ]]
scaled emission data:
[[0. 0. 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5]
[0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6]
[0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6]
[0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7]
[0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8]
[0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.8]
[0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.9 0.9]
[0.4 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9]
[0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 1. 1. 1. ]
[0.5 0.5 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 0.9 1. 1. 1. ]]
Notes
The interpolation methods in scipy.interpolate expect both x and y to be strictly increasing, so you'll have to make sure that your emission data is arranged in a grid such that:
lat = np.arange(21600)*0.008333333 - 90
instead of:
lat = np.arange(21600)*(-0.008333333) + 90
like you have above. You can flip your emission data like so:
emission = emission[::-1, :]
If you're just looking for nearest neighbor or linear interpolation, you can use xarray's native da.interp method:
scaling_interped = scaling_factor.interp(
lon=emissions.lon,
lat=emissions.lat,
method='nearest') # or 'linear'
note that this will dramatically increase the size of the array. Assuming these are 64-bit floats, the result will be approximately (21600*43200*7)*8/(1024**3) or 48.7 GB. You could cut the in-memory size by a factor of 7 by chunking the array by day of week and doing the computing out of core with dask.
If you want to use an interpolation scheme other than nearest or linear, use the method suggested by tel.
Related
For example, if i have a dataset in the first table below....
Name
Code
Thickness
CH1
3
0.5
CH1
3
0.3
CH1
4
0.4
CH1
3
0.2
CH1
5
0.6
CH1
5
0.4
.... and i want to achieve the result in the next table by grouping by the "Code" column and summing the "Thickness" column
Name
Code
Thickness
Grp_Thinckness
CH1
3
0.5
0.8
CH1
3
0.3
0.8
CH1
4
0.4
0.4
CH1
3
0.2
0.2
CH1
5
0.6
1.0
CH1
5
0.4
1.0
How do I go about this?
It's a gap-and-island problem. Anytime the Code changes, it creates a new island. You can solve these problems with a cumsum:
s = (df["Code"] != df["Code"].shift()).cumsum()
df["Grp_Thickness"] = df.groupby(s)["Thickness"].transform("sum")
first time posting
working on this project in BigQuery Where I want to round of the weight slab in multiples of 0.5KG
For example
0.4KG round it of to 0.5KG
2.1KG THEN round it of to 2.5KG
even if it is 50 or 100 grams above than the current weight slab than I want it to round it of to the next weight slab
I tried
WHEN WKG.Weight_KG<=0.5 THEN WKG.Weight_KG=0.5
but the output comes in boolean format
even tried this
WHEN WKG.Weight_KG<=0.5 THEN ROUND(WKG.Weight_KG/ .5,0) * .5
but few of the numbers were rounded of to 0.0 instead of 0.5
This should work
case when round(weight * 10) % 10 = 0 then round(weight) else case when round(weight * 10) % 10 >= 5 then round(weight) + 1 else round(weight) + 0.5 end end as weight
You can use
TRUNC(value) + CEIL(MOD(value, 1.0) / 0.5) * 0.5
For instance,
SELECT value, TRUNC(value) + CEIL(MOD(value, 1.0) / 0.5) * 0.5 AS rounded_up
FROM UNNEST(generate_array(CAST(-0.6 AS NUMERIC), 2.6, 0.1)) AS value
returns
value rounded_up
-0.6 -0.5
-0.5 -0.5
-0.4 0
-0.3 0
-0.2 0
-0.1 0
0 0
0.1 0.5
0.2 0.5
0.3 0.5
0.4 0.5
0.5 0.5
0.6 1
0.7 1
0.8 1
0.9 1
1 1
1.1 1.5
1.2 1.5
1.3 1.5
1.4 1.5
1.5 1.5
1.6 2
1.7 2
1.8 2
1.9 2
2 2
2.1 2.5
2.2 2.5
2.3 2.5
2.4 2.5
2.5 2.5
2.6 3
I have a data frame as shown below
B_ID Session no_show cumulative_no_show u_no_show
1 s1 0.4 0.4 0.4
2 s1 0.6 1.0 1.0
3 s1 0.2 1.2 0.2
4 s1 0.1 1.3 0.3
5 s1 0.4 1.7 0.7
6 s1 0.2 1.9 0.9
7 s1 0.3 2.2 0.2
10 s2 0.3 0.3 0.3
11 s2 0.4 0.7 0.7
12 s2 0.3 1.0 1.0
13 s2 0.6 1.6 0.6
14 s2 0.2 1.8 1.8
15 s2 0.5 2.3 0.3
From the above I woulk like to estimate new column slot_num depends on u_no_show as explained below. if u_no_show increases increase slot_num by one else keep it as same.
Expected Output
B_ID Session no_show cumulative_no_show u_no_show slot_num
1 s1 0.4 0.4 0.4 1
2 s1 0.6 1.0 1.0 2
3 s1 0.2 1.2 0.2 2
4 s1 0.1 1.3 0.3 3
5 s1 0.4 1.7 0.7 4
6 s1 0.2 1.9 0.9 5
7 s1 0.3 2.2 0.2 5
10 s2 0.3 0.3 0.3 1
11 s2 0.4 0.7 0.7 2
12 s2 0.3 1.0 1.0 3
13 s2 0.6 1.6 0.6 3
14 s2 0.2 1.8 0.8 4
15 s2 0.5 2.3 0.3 4
I would do with two groupby:
s = df.groupby('Session').u_no_show.diff().gt(0).astype(int)
df['slot_num'] = s.groupby(df.Session).cumsum().add(1)
Output:
B_ID Session no_show cumulative_no_show u_no_show slot_num
0 1 s1 0.4 0.4 0.4 1
1 2 s1 0.6 1.0 1.0 2
2 3 s1 0.2 1.2 0.2 2
3 4 s1 0.1 1.3 0.3 3
4 5 s1 0.4 1.7 0.7 4
5 6 s1 0.2 1.9 0.9 5
6 7 s1 0.3 2.2 0.2 5
7 10 s2 0.3 0.3 0.3 1
8 11 s2 0.4 0.7 0.7 2
9 12 s2 0.3 1.0 1.0 3
10 13 s2 0.6 1.6 0.6 3
11 14 s2 0.2 1.8 1.8 4
12 15 s2 0.5 2.3 0.3 4
Ok, as a python beginner I found multiplication matrix in pandas dataframes is very difficult to conduct.
I have two tables look like:
df1
Id lifetime 0 1 2 3 4 5 .... 30
0 1 4 0.1 0.2 0.1 0.4 0.5 0.4... 0.2
1 2 7 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
2 3 8 0.5 0.2 0.1 0.4 0.5 0.4... 0.6
.......
9 6 10 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
df2
Group lifetime 0 1 2 3 4 5 .... 30
0 2 4 0.9 0.8 0.9 0.8 0.8 0.8... 0.9
1 2 7 0.8 0.9 0.9 0.9 0.8 0.8... 0.9
2 3 8 0.9 0.7 0.8 0.8 0.9 0.9... 0.9
.......
9 5 10 0.8 0.9 0.7 0.7 0.9 0.7... 0.9
I want to perform excel's sumproduct function in my codes and the length of the columns that need to be summed are based on the lifetime in column 1 of both dfs, e,g.,
for row 0 in df1&df2, lifetime=4:
sumproduct(df1 row 0 from column 0 to column 3,
df2 row 0 from column 0 to column 3)
for row 1 in df1&df2, lifetime=7
sumproduct(df1 row 2 from column 0 to column 6,
df2 row 2 from column 0 to column 6)
.......
How can I do this?
You can use .iloc to access row and columns with integers.
So where lifetime==4 is row 0, and if you count the column numbers where Id is zero, then column labeled as 0 would be 2, and column labeled as 3 would be 5, to get that interval you would enter 2:6.
Once you get the correct data in both data frames with .iloc[0,2:6], you run np.dot
See below:
import numpy as np
np.dot(df1.iloc[0,2:6], df2.iloc[1,2:6])
Just to make sure you have the right data, try just running
df1.iloc[0,2:6]
Then try the np.dot product. You can read up on "pandas iloc" and "slicing" for more info.
I am finding this issue quite complex:
I have the following df:
values_1 values_2 values_3 id name
0.1 0.2 0.3 1 AAAA_living_thing
0.1 0.2 0.3 1 AAA_mammals
0.1 0.2 0.3 1 AA_dog
0.2 0.4 0.6 2 AAAA_living_thing
0.2 0.4 0.6 2 AAA_something
0.2 0.4 0.6 2 AA_dog
The ouput should be:
values_1 values_2 values_3 id name
0.3 0.6 0.9 3 AAAA_living_thing
0.1 0.2 0.3 1 AAA_mammals
0.1 0.2 0.3 1 AA_dog
0.2 0.4 0.6 2 AAA_something
0.2 0.4 0.6 2 AA_dog
It would be like a group_by().sum() but only the AAAA_living_thing as the rows below are childs of AAAA_living_thing
Seperate the dataframe first by using query and getting the rows only with AAAA_living_thing and without. Then use groupby and finally concat them back together:
temp = df.query('name.str.startswith("AAAA")').groupby('name', as_index=False).sum()
temp2 = df.query('~name.str.startswith("AAAA")')
final = pd.concat([temp, temp2])
Output
id name values_1 values_2 values_3
0 3 AAAA_living_thing 0.3 0.6 0.9
1 1 AAA_mammals 0.1 0.2 0.3
2 1 AA_dog 0.1 0.2 0.3
4 2 AAA_something 0.2 0.4 0.6
5 2 AA_dog 0.2 0.4 0.6
Another way would be to make a unique identifier for rows which are not AAAA_living_thing with np.where and then groupby on name + unique identifier:
s = np.where(df['name'].str.startswith('AAAA'), 0, df.index)
final = df.groupby(['name', s], as_index=False).sum()
Output
name values_1 values_2 values_3 id
0 AAAA_living_thing 0.3 0.6 0.9 3
1 AAA_mammals 0.1 0.2 0.3 1
2 AAA_something 0.2 0.4 0.6 2
3 AA_dog 0.1 0.2 0.3 1
4 AA_dog 0.2 0.4 0.6 2