I have a dataframe column that I want to split into equal size buckets. The values in this column are floats between 0-1. Most of the data is skewed, so most values fall in the 0.90's and 1.
Bucket 10: All 1's (the size of this bucket will be different from 2-9 and 1)
Bucket 2-9: Any values > 0 and < 1 (equal sized)
Bucket 1: All 0's (the size of this bucket will be different from 2-9 and 10)
Example:
continous_number_col
Bucket
0.001
2
0.95
9
1
10
0
1
This should be how it looks when I groupBy("Bucket")
Counts of bucket 1 and 10 aren't significant here, they will just be in their own bucket.
And the 75 count will be different, just using as an example.
Bucket
Count
Values
1
1000
0
2
75
0.01 - 0.50
3
75
0.51 - 0.63
4
75
0.64 - 0.71
5
75
0.72 - 0.83
6
75
0.84 - 0.89
7
75
0.90 - 0.92
8
75
0.93 - 0.95
9
75
0.95 - 0.99
10
2000
1
I've tried using the QuantileDiscretizer() Function as this:
val df = {
rawDf
//Taking 1's and 0's out for the moment
.filter(col("continuous_number_col") =!= 1 && col("continuous_number_col") =!= 0)
}
val discretizer = new QuantileDiscretizer()
.setInputCol("continuous_number_col")
.setOutputCol("bucket_result")
.setNumBuckets(8)
val result = discretizer.fit(df).transform(df)
However, this gives me the following, not equal buckets:
bucket_result
count
7.0
20845806
6.0
21096698
5.0
21538813
4.0
21222511
3.0
21193393
2.0
21413413
1.0
21032666
0.0
21681424
Hopefully this gives enough context to what I'm trying to do. Thanks in advance.
I have 2 different Dataframe of unequal shape:
pack_df:
pack_id
item_size
temperature
pck1
M
7.3
pck2
S
10.0
pck3
L
5.2
pck4
S
15.3
pck5
M
3.3
pck6
L
9.3
pck7
L
20.3
pck8
M
8.1
pck9
M
21.3
pck10
S
9.7
temperature_range_df:
min_temp
max_temp
S
M
L
-4
10
2
1
1
10.1
20
4
3
2
20.1
30
6
4
2
I need to match if pack_df['temperature'] with in range temperature_range_df['min_temp'], temperature_range_df['max_temp'] when a match is found i need to assign the packet counts ( temperature_range_df['S']/temperature_range_df['M']/temperature_range_df['L'] based on pack_df['item_size']. How can we achieve this without iterating over each rows and comparing them in our dataframe as the dataframe is likely to grow a lot over time.( not using dataframe.iterrows() )
My final Dataframe should look like
final_dataframe:
pack_id
item_size
temperature
pack_count
pck1
M
7.3
1
pck2
S
10.0
2
pck3
L
5.2
1
pck4
S
15.3
4
pck5
M
3.3
1
pck6
L
9.3
1
pck7
L
20.3
2
pck8
M
8.1
1
pck9
M
21.3
4
pck10
S
9.7
2
calculation- pck1 item size should be 'M' and temperature '7.3'. this falls under temp range '-4 to 10' based on item size 'M' we can assign only 1 packet.
Thanks for the help!
So I have this query where I need to select subtraction value of parent with busines of the codes.
For example let's say in the container table we have Parent value 0.39 and the child value 0.7 than the value fromthe selection would be -0.31.
Now this value (-0.31) I need to multiply it with the value of Quality column which is found in another table. Than I need the top 3 values. That means ordering by desc ofcourse.
But ofcourse it should be multiplied when NetNames is equal with BetNames and Codes column value is equal with one the columns in table container. (Q_1, Q_2, Q_3).
I'm lost here guys.
Below is info about my tables.
/*Table Container*/
BetNamesID | Parent_B_Id | Year | Month | Q_1 | Q_2 | Q_3
1 null 2020 5 0.36 0.3 0.21
6 2 2020 8 0.39 0.64 1.0
7 1 2020 9 0.76 0.65 0.29
8 3 2020 13 0.62 0.34 0.81
9 2 2020 2 0.28 0.8 1.0
/*Table Configuration*/
NetNames | Codes | Quality
Test 1 Q_1 5
Test 2 Q_5 7
Test 3 Q_2 24
Test 4 Q_3 98
Test 5 Q_4 22
/*Table BetNames Info*/
ID | Parent_B_Id | Name
1 null Test 1
6 2 Test 2
7 1 Test 3
8 3 Test 4
9 2 Test 5
What I have done until now is this query :
SELECT
child.[BetNamesID],
child.[Parent_B_Id],
child.[Q_1] - parent.[Q_1] AS Q_1,
child.[Q_2] - parent.[Q_2] AS Q_2,
child.[Q_3] - parent.[Q_3] AS Q_3,
// this is just a test case.. this is how it is supposed in my mind(child.[Q_3] - parent.[Q_3]) * qualityvalue AS Q_3, //this is how it is supposed to be
, n.name
FROM [dbo].[Container] child
join [dbo].[Container] parent on child.Parent_B_Id = parent.BetNamesID
join dbo.NetNames n on n.id = parent.Parent_B_Id //with this I get the names for BetNamesID
And this is the result of my query until now:
BetNamesID | Parent_B_Id | Q_1 | Q_2 | Q_3
3 2 0.21 -0.3 -0.1
5 4 -0.39 0.64 -0.9
8 5 0.99 0.65 0.59
What I need now is to multiply the values of Q_1, Q_2, Q_3 columns, with the values found in Config table (Quality column), only when BetNames is equal with NetNames and Codes row value is equal with Q_1 or Q_2 or Q_3 column.
These are the expected values.
BetNamesID | Parent_B_Id | Q_1 | Q_2 | Q_3
3 2 1.05(0.21 * 5) -7.2(-0.3* 24) -9.8 (-0.1* 98)
5 4 1.95(0.39*5) 15.36(0.64*24) -88.2 (-0.9*98)
How does the new Table come in play? How can I join? How does the where condition work in this case?
I have a query that pulls some aggregate stats by age group
Agegroup Freq
0-5 2.3
6-10 3.2
11-15 3.6
For various reasons, I need the output data to be a lookup table for every age 1-100 of the following format
Age Agegroup Freq
1 0-5 2.3
2 0-5 2.3
3 0-5 2.3
4 0-5 2.3
5 0-5 2.3
6 6-10 3.2
7 6-10 3.2
8 6-10 3.2
9 6-10 3.2
10 6-10 3.2
...
How could I go about doing this? I'm not able to create tables, so I'm thinking if there's a way to write some kind of select statement that will have all ages 1-100 and the agegroup and then join it to the original query which has the calculated frequencies by agegroup - something like this
SELECT t1.age, [case when statemenet that assigns the correct age group from t1.Age] "Agegroup"
FROM ([statemement that generates numbers 1-100] "age") t1
JOIN (Original query that creates the aggreated agegroup data) t2 on t1.Agegroup = t2.Agegroup
So I have two questions
Is this an approach that makes sense at all?
Is it possible to generate the t1 I'm looking for? I.e. a select statement that will create a t1 of the form
Age Agegroup
1 0-5
2 0-5
3 0-5
4 0-5
5 0-5
6 6-10
7 6-10
8 6-10
9 6-10
10 6-10
...
that could then be joined with the query that has the frequency by agegroup?
Something like this... I included age 0 (it can be excluded if need be), and I only went through age 15. That is hard-coded; with a little extra work, it can be made to match the highest age in the ranges.
This version does unnecessary work, because it computes the substrings repeatedly. It may still execute in less than a second, but if performance becomes important, it can be written to compute those substrings in a CTE first, so they are not computed repeatedly. (Not shown here.)
with
inputs (agegroup, freq ) as (
select '0-5', 2.3 from dual union all
select '6-10', 3.2 from dual union all
select '11-15', 3.6 from dual
)
select c.age, i.agegroup, i.freq
from (select level - 1 as age from dual connect by level <= 16) c
inner join inputs i
on age between to_number(substr(i.agegroup, 1, instr(i.agegroup, '-') - 1))
and to_number(substr(i.agegroup, instr(i.agegroup, '-') + 1))
order by age
;
Output:
AGE AGEGROUP FREQ
---- --------- ----------
0 0-5 2.3
1 0-5 2.3
2 0-5 2.3
3 0-5 2.3
4 0-5 2.3
5 0-5 2.3
6 6-10 3.2
7 6-10 3.2
8 6-10 3.2
9 6-10 3.2
10 6-10 3.2
11 11-15 3.6
12 11-15 3.6
13 11-15 3.6
14 11-15 3.6
15 11-15 3.6
16 rows selected.
Here is a different solution, using a hierarchical query. It doesn't need "magic numbers" anymore, the ages are logically determined by the ranges, and there's no join (other than whatever the query engine does behind the scenes in the hierarchical query). On the admittedly very small sample you provided, the optimizer cost is about 20% less than the join-based solution I provided - that may result in slightly faster execution.
(NOTE - I posted two different solutions so I believe these are separate Answers - as opposed to editing my earlier post. I wasn't sure which action is appropriate.)
Also another note to acknowledge that #AlexPoole mentioned this approach in his post; I didn't see it till now, or I would have acknowledged it from the outset.
with
inputs (agegroup, freq ) as (
select '0-5', 2.3 from dual union all
select '6-10', 3.2 from dual union all
select '11-15', 3.6 from dual
)
select to_number(substr(agegroup, 1, instr(agegroup, '-') - 1)) + level - 1 as age,
agegroup, freq
from inputs
connect by level <= 1 + to_number(substr(agegroup, instr(agegroup, '-') + 1)) -
to_number(substr(agegroup, 1, instr(agegroup, '-') - 1))
and prior agegroup = agegroup
and prior sys_guid() is not null
order by age
;
An alternative approach, if you're on 11gR2 or higher, is to use recursive subquery factoring with a regular expression to extract the lower and upper age in each range from your string value;
with original_query (agegroup, freq) as (
-- Original query that creates the aggreated agegroup data
select '0-5', 2.3 from dual
union all select '6-10', 3.2 from dual
union all select '11-15', 3.6 from dual
),
r (age, agegroup, freq) as (
select to_number(regexp_substr(agegroup, '\d+', 1, 1)), agegroup, freq
from original_query
union all
select age + 1, agegroup, freq
from r
where age < to_number(regexp_substr(agegroup, '\d+', 1, 2))
)
select age, agegroup, freq
from r
order by age;
AGE AGEGR FREQ
---------- ----- ----------
0 0-5 2.3
1 0-5 2.3
2 0-5 2.3
3 0-5 2.3
4 0-5 2.3
5 0-5 2.3
6 6-10 3.2
7 6-10 3.2
8 6-10 3.2
9 6-10 3.2
10 6-10 3.2
11 11-15 3.6
12 11-15 3.6
13 11-15 3.6
14 11-15 3.6
15 11-15 3.6
The anchor member gets each original row from your existing result set, and extracts the lower-bound number (0, 6, 11, ...) using a simple regular expression - that could also be done with substr/instr.
The recursive member than repeats each of those anchor rows, adding one to the age each time, until it reaches the upper-bound number of the range.
You could use connect by as well, but it's a bit more awkward with multiple source rows.
Answers on questions:
Yes, using join approach with generated table "t1" is good idea.
To generate table "t1" you can use next query:
SELECT age as "Age",
CASE l_age WHEN 0 THEN 0 ELSE l_age + 1 END || '-' || r_age AS "Agegroup"
FROM (
SELECT lvl age,
CASE m5 WHEN 0 THEN (t5-1)*5 ELSE t5 *5 END l_age,
CASE m5 WHEN 0 THEN t5 *5 ELSE (t5+1)*5 END r_age
FROM (
SELECT /*+ cardinality(100) */
level lvl, mod(level, 5) m5, TRUNC(level/5) t5
FROM dual
CONNECT BY level <= 100
)
);
Output:
Age Agegroup
1 0-5
2 0-5
3 0-5
4 0-5
5 0-5
6 6-10
7 6-10
8 6-10
9 6-10
10 6-10
11 11-15
12 11-15
13 11-15
14 11-15
15 11-15
16 16-20
17 16-20
18 16-20
19 16-20
20 16-20
21 21-25
22 21-25
23 21-25
24 21-25
25 21-25
26 26-30
27 26-30
28 26-30
29 26-30
30 26-30
.........
80 76-80
81 81-85
82 81-85
83 81-85
84 81-85
85 81-85
86 86-90
87 86-90
88 86-90
89 86-90
90 86-90
91 91-95
92 91-95
93 91-95
94 91-95
95 91-95
96 96-100
97 96-100
98 96-100
99 96-100
100 96-100
I have a pandas dataframe
index A
1 3.4
2 4.5
3 5.3
4 2.1
5 4.0
6 5.3
...
95 3.4
96 1.2
97 8.9
98 3.4
99 2.7
100 7.6
from this I would like to create a dataframe B
1-5 sum(1-5)
6-10 sum(6-10)
...
96-100 sum(96-100)
Any ideas how to do this elegantly rather than brute-force?
Cheers, Mike
This will give you a series with the partial sums:
df['bin'] = df.index / 5
bin_sums = df.groupby('bin')['A'].sum()
Then, if you want to rename the index:
bin_sums.index = ['%s - %s' % (5*i, 5*(i+1)) for i in bin_sums.index]