Using case statement to do binning in sql - sql

We have a small table in following structure;
act_id | ratio
123 | 0
234 | 0.001
235 | 0.05
I am trying to use CASE statement to add anew column which does bucketing sth like;
SELECT act_id,
ratio,
CASE WHEN 0 <ratio <= 0.04 THEN '(0,0.04]'
WHEN ratio > 0.04 THEN ratio END AS new_col
FROM table
But it gives the following error; SQL compilation error: error line 50 at position 56 invalid identifier '"(0,0.04]"'. The desired output is
act_id | ratio | new_col
123 | 0. |. 0
234 | 0.001| (0,0.04]
235 | 0.05 | 0.05
may I know how can we use the CASE statement here to put desired strings in this new_col, using open or closed intervals. Helps appreciated.

A case expression returns a single value. So you need to return a single type -- strings. Otherwise, if one branch returns a number and another a string, then the string is converted to a number. And you get an error:
SELECT act_id,
ratio,
(CASE WHEN 0 < ratio <= 0.04 THEN '(0,0.04]'
WHEN ratio > 0.04 THEN ratio::string
END) AS new_col
FROM table
I'm not sure if 0 < ratio <= 0.04 does what you expect. I would recommend:
SELECT act_id,
ratio,
(CASE WHEN ratio > 0 and ratio <= 0.04 THEN '(0,0.04]'
WHEN ratio > 0.04 THEN ratio::string
END) AS new_col
FROM table

Related

Achieve incremental values for a month based on value in another column and date

I’m having a scenario where I have to increment the numbers in a month.
Condition 1 : If the value in col2 is greater than 0 then expected output is 0.
Condition 2: If value in col1 is 0 then expected output should be 999.
Condition 3: If the value in col2 is 0 then increment the numbers from 1.
Note: If either condition 1 or condition 2 is satisfied while incrementing then we must increment again from 1.
Id Date Col1 col2. Expected Output
101 01/01 28 1. 0
101 01/02 43 0 1
101 01/03 46 0. 2
101 01/04 0 0. 999
101 01/05 56 0 1
101 01/06 95 5. 0
101 01/07 0 0. 999
101 01/08 65 0. 1
101 01/09 1 0. 2
101 01/10 2 0. 3
Please suggest how this can be achieved
A cumulative count plus Teradata's RESET WHEN option:
-- similar to ROW_NUMBER, but counts only zeros
case
when col1 = 0 then 999
else count(case when col2 > 0 or col1 = 0 then NULL else 1 end)
over (partition by id
order by date_
reset when col2 > 0 or col1 = 0
rows unbounded preceding)
end

difference between two values in a pandas dataframe that are variable lengths apart

I am trying to automatically calculate the profit/loss from my trades. currently i have my pandas daatframe set up to return a hold column that contains 1's while the purcahse is active and a -1 once i have sold. the price column records the price of the stock while the hold time and count columns keep track of how long the trade has been held in two different ways.
what i am struggling to do is to calculate how much money i have made/lost. i need it to calculate (as a percentage)the difference ebtween the purchased price (the first non zero value ) and the sold value (the last non zero value in a series). the challenge comes from the tardes being of variable length so df.shift doesnt work.
below is a sample dataset:
thank you and if anything is unclear please ask
Date Hold Price Hold_Time count
148 20190801 0 0.00 0 0
149 20190802 0 0.00 0 0
150 20190805 0 0.00 0 0
151 20190806 1 21.50 1 1
152 20190807 1 22.48 1 2
153 20190808 1 22.78 1 3
154 20190809 1 24.17 1 4
155 20190812 1 23.72 1 5
156 20190813 -1 23.39 0 0
157 20190814 0 0.00 0 0
158 20190815 0 0.00 0 0
159 20190816 0 0.00 0 0
160 20190819 0 0.00 0 0
161 20190820 0 0.00 0 0
162 20190821 0 0.00 0 0
163 20190822 0 0.00 0 0
164 20190823 1 24.80 1 1
165 20190826 1 24.00 1 2
166 20190827 -1 24.65 0 0
167 20190828 0 0 0 0
168 20190829 0 0 0 0
Thank you for providing with an easy-to-work-with dataset. Considering it is named as 'data',
I propose the following solution
import pandas as pd
import numpy as np
data = pd.read_clipboard()
df = data.copy() # copy data on another dataframe
# keep only rows where you bought or sell:
df['transaction_id'] = df.Hold_Time - df.Hold_Time.shift()
df = df.query('transaction_id!=0').dropna()
# calculate profit/loss for each time you sold
df['profit'] = np.where(df.Hold == -1, df.Price - df.Price.shift(), 0)
# calculate total profit (or anything else you want, I hope it will be easy at this point)
TOTAL_PROFIT = df.profit.sum()
pd.groupby is your friend here, albeit in a somewhat roundabout way. You can use it to get each individual "holding" series in a separate bin by comparing the values to 0 and the previous value - the "0" series also create a group here which we have to drop subsequently.
blocks = df["Price"].groupby(((df["Price"] != 0) != (df["Price"] != 0).shift()).cumsum())
buy_values = blocks.first()
buy_values = buy_values[buy_values != 0]
sell_values = blocks.last()
sell_values = sell_values[sell_values != 0]
difference = sell_values - buy_values
percent_difference = difference / buy_values * 100
This only uses the "Price" column of your dataset. Using the other columns could make an easier / clearer solution, but this should do what you want!

Separating a column data into multiple columns in hive

I have a sample data for a device which contains two controller and it's version. The sample data is as follows:
device_id controller_id versions
123 1 0.1
123 2 0.15
456 2 0.25
143 1 0.35
143 2 0.36
This above data should be in the below format:
device_id 1st_ctrl_id_ver 2nd_ctrl_id_ver
123 0.1 0.15
456 NULL 0.25
143 0.35 0.36
I used the below code which is not working:
select
device_id,
case when controller_id="1" then versions end as 1st_ctrl_id_ver,
case when controller_id="2" then versions end as 2nd_ctrl_id_ver
from device_versions
The ouput which i got is:
device_id 1st_ctrl_id_ver 2nd_ctrl_id_ver
123 0.1 NULL
123 NULL 0.15
456 NULL 0.25
143 0.35 NULL
143 NULL 0.36
I don't want the Null values in each row.Can someone help me in writing the correct code?
To "fold" all lines with a given key to a single line, you have to run an aggregation. Even if you don't really aggregate values in practise.
Something like
select device_id,
MAX(case when controller_id="1" then versions end) as 1st_ctrl_id_ver,
MAX(case when controller_id="2" then versions end) as 2nd_ctrl_id_ver
from device_versions
GROUP BY device_id
But be aware that this code will work if and only if you have at most one entry per controller per device, and any controller with a version higher than 2 will be ignored. In other words it is rather brittle (but you can't do better in SQL anway)

Ensure percentages are between 0 and 1, inclusive (using a single function)

I have percentages in a condition table:
create table condition (percent_decimal number(3,2));
insert into condition values (-0.01);
insert into condition values (0.1);
insert into condition values (1);
insert into condition values (1.1);
commit;
PERCENT_DECIMAL
---------------
-0.01
.1
1
1.1
I want to select the values, but modify them to present them as percentages between 0 and 1 (inclusive):
Convert -0.01 to 0
Leave .1 as is
Leave 1 as is
Convert 1.1 to 1
I can successfully do this using the greatest and least functions:
select
percent_decimal,
least(1,greatest(0,percent_decimal)) as percent_modified
from
condition
PERCENT_DECIMAL PERCENT_MODIFIED
--------------- ----------------
-0.01 0
.1 .1
1 1
1.1 1
However, I'm wondering if there is a more succinct way of doing this--with a single function.
You could use a single case expression:
select
percent_decimal,
case when percent_decimal < 0 then 0
when percent_decimal > 1 then 1
else percent_decimal
end as percent_modified
from
condition
/
PERCENT_DECIMAL PERCENT_MODIFIED
--------------- ----------------
-0.01 0
.1 .1
1 1
1.1 1
which is longer, but uses no functions, and I think it's clearer to someone coming along later what your logic is.

Tabulate Command Stata

I don't know if Stata can do this but I use the tabulate command a lot in order to find frequencies. For instance, I have a success variable which takes on values 0 to 1 and I would like to know the success rate for a certain group of observations ie tab success if group==1. I was wondering if I can do sort of the inverse of this operation. That is, I would like to know if I can find a value of "group" for which the frequency is greater than or equal to 15% for example.
Is there a command that does this?
Thanks
As an example
sysuse auto
gen success=mpg<29
Now I want to find the value of price such that the frequency of the success variable is greater than 75% for example.
According to #Nick:
ssc install groups
sysuse auto
count
74
#return list optional
local nobs=r(N) # r(N) gives total observation
groups rep78, sel(f >(0.15*`r(N)')) #gives the group for which freq >15 %
+---------------------------------+
| rep78 Freq. Percent % <= |
|---------------------------------|
| 3 30 43.48 57.97 |
| 4 18 26.09 84.06 |
+---------------------------------+
groups rep78, sel(f >(0.10*`nobs'))# more than 10 %
+----------------------------------+
| rep78 Freq. Percent % <= |
|----------------------------------|
| 2 8 11.59 14.49 |
| 3 30 43.48 57.97 |
| 4 18 26.09 84.06 |
| 5 11 15.94 100.00 |
+----------------------------------+
I'm not sure if I fully understand your question/situation, but I believe this might be useful. You can egen a variable that is equal to the mean of success, by group, and then see which observations have the value for mean(success) that you're looking for.
egen avgsuccess = mean(success), by(group)
tab group if avgsuccess >= 0.15
list group if avgsuccess >= 0.15
Does that accomplish what you want?