Is there a way to reference the tuple below in a calculation? - sql

i have this view here:
x | y | z
-----+------+-----
a | 645 |
b | 46 |
c | 356 |
d | 509 |
Is there a way to write a query for a z item to reference a different row?
For example, if i want z to be the value of the tuple below's y value - 1
So:
z.a = y.b - 1 = 46 - 1 = 45
z.b = y.c - 1 = 356 - 1 = 355
z.c = y.d - 1 = 509 - 1 = 508

You are describing window function lead(), which lets you access any column on the the "next" row (given a partiton and an order by criteria):
select
x,
y,
lead(y) over(order by x) - 1 as z
from mytbale

Related

Pandas - how to get the minimum value for each row from values across several rows

I have a pandas dataframe in the following structure:
|index | a | b | c | d | e |
| ---- | -- | -- | -- | -- | -- |
|0 | -1 | -2| 5 | 3 | 1 |
How can I get the minimum value for each row using only the positive values in columns a-e?
For the example row above, the minimum of (5,3,1) should be 1 and not (-2).
You can use the loop on all rows and apply your condition on the rows.
for example:
df = pd.DataFrame([{"a":-2,"b":2,"c":5},{"a":3,"b":0,"c":-1}])
# a b c
#0 -2 2 5
#1 3 0 -1
def my_condition(li):
li = [i for i in li if i>=0]
return min(li)
min_cel = []
for k,r in df.iterrows():
li = r.to_dict().values()
min_cel.append( my_condition(li) )
df["min"] = min_cel
# a b c min
#0 -2 2 5 2
#1 3 0 -1 0
You can also write the same code on one line:
df['min'] = ddd.apply(lambda row: min([i for i in row.to_dict().values() if i>=0]) , axis=1)

Python/Pandas: Transformation of column within a list of columns

I'd like to select a subset of columns from a DataFrame while applying a transformation to some of those columns at the same time. Is it possible to transform a column when that column is selected as one in a list of columns?
For example, I have a column StartDate that is of type np.datetime[64] that I'd like to extract the month from.
When dealing with that Series on its own, I'd do something like
print(df['StartDate'].transform(lambda x: x.month))
to see the transformed data. Can I accomplish the same thing when the above expression is part of a list of columns? Something like:
print(df[['ColumnA', 'ColumnB', 'StartDate'.transform(lambda x: x.month)]])
Of course the above gives the error
AttributeError: 'str' object has no attribute 'month'
So, if my data looks like:
Metadata | Metadata | 2020-01-01
Metadata | Metadata | 2020-02-06
Metadata | Metadata | 2020-02-25
I'd like to see:
Metadata | Metadata | 1
Metadata | Metadata | 2
Metadata | Metadata | 2
Without appending a new separate "Month" column to the DataFrame. Is this possible?
If you have some data like below
df = pd.DataFrame({'col1' : np.random.randint(10, size = 366), 'col2': np.random.randint(10, size = 366),'StartDate' : pd.date_range('2018', '2019')})
which looks like
col1 col2 StartDate
0 0 2 2018-01-01
1 8 0 2018-01-02
2 0 5 2018-01-03
3 3 4 2018-01-04
4 8 6 2018-01-05
... ... ... ...
361 8 8 2018-12-28
362 9 9 2018-12-29
363 4 1 2018-12-30
364 2 4 2018-12-31
365 0 9 2019-01-01
You could redefine the column, or you could assign and create a temporary view, like.
df.assign(StartDate = df['StartDate'].dt.month)
which outputs.
col1 col2 StartDate
0 0 2 1
1 8 0 1
2 0 5 1
3 3 4 1
4 8 6 1
... ... ... ...
361 8 8 12
362 9 9 12
363 4 1 12
364 2 4 12
365 0 9 1
This also doesn't change the original dataframe. If you want to create a permanent version, then just reassign.
df = df.assign(StartDate = df['StartDate'].dt.month)
You could also take this further, such as.
df.assign(StartDate = df['StartDate'].dt.month, col1 = df['col1'] + 100)[['col1', 'StartDate']]
You can apply whatever transform you need and then access any columns you want after assigning these transforms.
col1 StartDate
0 105 1
1 109 1
2 108 1
3 101 1
4 108 1
... ... ...
361 104 12
362 102 12
363 109 12
364 102 12
365 100 1
I guess you could use the attribute name of the Series.
Something like:
dt_to_month = lambda x: [d.month for d in x] if x.name == 'StartDate' else x
df[['ColumnA', 'ColumnB', 'StartDate']].apply(dt_to_month)
will do the trick.

How to assign the multiple values of an output to new multiple columns of a dataframe?

I have the following function:
def sum(x):
oneS = x.iloc[0:len(x)//10].agg('sum')
twoS = x.iloc[len(x)//10:2*len(x)//10].agg('sum')
threeS = x.iloc[2*len(x)//10:3*len(x)//10].agg('sum')
fourS = x.iloc[3*len(x)//10:4*len(x)//10].agg('sum')
fiveS = x.iloc[4*len(x)//10:5*len(x)//10].agg('sum')
sixS = x.iloc[5*len(x)//10:6*len(x)//10].agg('sum')
sevenS = x.iloc[6*len(x)//10:7*len(x)//10].agg('sum')
eightS = x.iloc[7*len(x)//10:8*len(x)//10].agg('sum')
nineS = x.iloc[8*len(x)//10:9*len(x)//10].agg('sum')
tenS = x.iloc[9*len(x)//10:len(x)//10].agg('sum')
return [oneS,twoS,threeS,fourS,fiveS,sixS,sevenS,eightS,nineS,tenS]
How to assign the outputs of this function to columns of dataframe (which already exists)
The dataframe I am applying the function is as below
Cycle Type Time
1 1 101
1 1 102
1 1 103
1 1 104
1 1 105
1 1 106
9 1 101
9 1 102
9 1 103
9 1 104
9 1 105
9 1 106
The dataframe I want to add the columns is something like below & the new columns Ones, TwoS..... Should be added like shown & filled with the results of the function.
Cycle Type OneS TwoS ThreeS
1 1
9 1
8 1
10 1
3 1
5 2
6 2
7 2
If I write a function for just one value and apply it like the following, it is possible:
grouped_data['fm']= data_train_bel1800.groupby(['Cycle', 'Type'])['Time'].apply( lambda x: fm(x))
But I want to do it all at once so that it is neat and clear.
You can use:
def f(x):
out = []
for i in range(10):
out.append(x.iloc[i*len(x)//10:(i+1)*len(x)//10].agg('sum'))
return pd.Series(out)
df1 = (data_train_bel1800.groupby(['Cycle', 'Type'])['Time']
.apply(f)
.unstack()
.add_prefix('new_')
.reset_index())
print (df1)
Cycle Type new_0 new_1 new_2 new_3 new_4 new_5 new_6 new_7 new_8 \
0 1 1 0 101 102 205 207 209 315 211 211
1 9 1 0 101 102 205 207 209 315 211 211
new_9
0 106
1 106

R or SQL select records having different value for different category

I have dataframe/or table like this
RowNumber Category Value
1 . A 12
2 . A 3
3 . B 24
4. B 32
5 . B 11
6 . C 30
7 . D 2
8 . D 33
..
Use SQL (Hive) or R hope get guidance of both language:
Select record based on Having different cut off point for different Category
For Category A, I want to choose the Value >= 10
but for all the other category, B,C,D need to choose the value >= 20
the results:
RowNumber Category Value
1 . A 12
3 . B 24
4. B 32
6 . C 30
8 . D 33
How could I do this?
Thank you!!
In base R it can be done using:
df <- data.frame(RowNumber = c(1, 2, 3, 4, 5, 6, 7 ,8), Category = c("A", "A", "B", "B", "B", "C", "D", "D"), Value = c(12, 3, 24, 32, 11, 30, 2, 33))
df[df$Category == "A" & df$Value >= 10 | df$Category != "A" & df$Value >= 20, ]
You'll get desired results:
RowNumber Category Value
1 1 A 12
3 3 B 24
4 4 B 32
6 6 C 30
8 8 D 33
Here are some alternatives.
library(sqldf)
# 1
sqldf("select * from DF
where (Category = 'A' and Value >= 10) or (not Category = 'A' and Value >= 20)")
# 2
sqldf("select * from DF where Value >= (case when Category = 'A' then 10 else 20 end)")
# 3
sqldf("select * from DF where Value >= (10 * (not Category = 'A') + 10)")
# 4
subset(DF, (Category == "A" & Value >= 10) | (Category != "A" & Value >= 20))
# 5
subset(DF, Value >= ifelse(Category == "A", 10, 20))
# 6
subset(DF, Value >= 10 * (Category != "A") + 10)
Any of the above give:
RowNumber Category Value
1 1 A 12
2 3 B 24
3 4 B 32
4 6 C 30
5 8 D 33
Note
The input in reproducible form is:
Lines <- "RowNumber Category Value
1 A 12
2 A 3
3 B 24
4 B 32
5 B 11
6 C 30
7 D 2
8 D 33"
DF <- read.table(text = Lines, header = TRUE)
a simple query
select c1,c2 from tbl where c2 >= 10 and c1 = 'A'
union all
select c1,c2 from tbl where c2 >= 20 and c1 != 'A'
+---------+---------+--+
| _u1.c1 | _u1.c2 |
+---------+---------+--+
| A | 12 |
| B | 24 |
| B | 32 |
| C | 30 |
| D | 33 |
+---------+---------+--+

Implementing Convolution in SQL

I have a table d with fields x, y, f, (PK is x,y) and would like to implement convolution, where a new column, c, is defined, as the Convolution (2D) given an arbitrary kernel. In a procedural language, this is easy to define (see below). I'm confident it can be defined in SQL using a JOIN, but I'm having trouble doing so.
In a procedural language, I would do:
def conv(x, y):
c = 0
# x_ and y_ are pronounced "x prime" and "y prime",
# and take on *all* x and y values in the table;
# that is, we iterate through *all* rows
for all x_, y_
c += f(x_, y_) * kernel(x_ - x, y_ - y)
return c
kernel can be any arbitrary function. In my case, it's 1/k^(sqrt(x_dist^2, y_dist^2)), with kernel(0,0) = 1.
For performance reasons, we don't need to look at every x_, y_. We can filter it where the distance < threshold.
I think this can be done using a Cartesian product join, followed by aggregate SQL SUM, along with a WHERE clause.
One additional challenge of doing this in SQL is NULLs. A naive implementation would treat them as zeroes. What I'd like to do is instead treat the kernel as a weighted average, and just leave out NULLs. That is, I'd use a function wkernel as my kernel, and modify the code above to be:
def conv(x, y):
c = 0
w = 0
for all x_, y_
c += f(x_, y_) * wkernel(x_ - x, y_ - y)
w += wkernel(x_ - x, y_ - y)
return c/w
That would make NULLs work great.
To clarify: You can't have a partial observation, where x=NULL and y=3. However, you can have a missing observation, e.g. there is no record where x=2 and y=3. I am referring to this as NULL, in the sense that the entire record is missing. My procedural code above will handle this fine.
I believe the above can be done in SQL (assuming wkernel is already implemented as a function), but I can't figure out how. I'm using Postgres 9.4.
Sample table:
Table d
x | y | f
0 | 0 | 1.4
1 | 0 | 2.3
0 | 1 | 1.7
1 | 1 | 1.2
Output (just showing one row):
x | y | c
0 | 0 | 1.4*1 + 2.3*1/k + 1.7*1/k + 1.2*1/k^1.414
Convolution https://en.wikipedia.org/wiki/Convolution is a standard algorithm used throughout image processing and signal processing, and I believe it can be done in SQL, which is very useful given the large data sets we're now using.
I assumed a function wkernel, for example:
create or replace function wkernel(k numeric, xdist numeric, ydist numeric)
returns numeric language sql as $$
select 1. / pow(k, sqrt(xdist*xdist + ydist*ydist))
$$;
The following query gives what you want but without restricting to close values:
select d1.x, d1.y, SUM(d2.f*wkernel(2, d2.x-d1.x, d2.y-d1.y)) AS c
from d d1 cross join d d2
group by d1.x, d1.y;
x | y | c
---+---+-------------------------
0 | 0 | 3.850257072695778143380
1 | 0 | 4.237864186319019036455
0 | 1 | 3.862992722666908108145
1 | 1 | 3.725299918145074500610
(4 rows)
With some arbitrary restriction:
select d1.x, d1.y, SUM(d2.f*wkernel(2, d2.x-d1.x, d2.y-d1.y)) AS c
from d d1 cross join d d2
where abs(d2.x-d1.x)+abs(d2.y-d1.y) < 1.1
group by d1.x, d1.y;
x | y | c
---+---+-------------------------
0 | 0 | 3.400000000000000000000
1 | 0 | 3.600000000000000000000
0 | 1 | 3.000000000000000000000
1 | 1 | 3.200000000000000000000
(4 rows)
For the weighted average point:
select d1.x, d1.y, SUM(d2.f*wkernel(2, d2.x-d1.x, d2.y-d1.y)) / SUM(wkernel(2, d2.x-d1.x, d2.y-d1.y)) AS c
from d d1 cross join d d2
where abs(d2.x-d1.x)+abs(d2.y-d1.y) < 1.1
group by d1.x, d1.y;
Now onto the missing information thing. In the following code, please replace 2 by the maximum distance to be considered.
The idea is the following: We find the bounds of the considered image and we generate all the information that could be needed. With your example and with a maximum scope of 1, we need all the couples (x, y) such that (-1 <= x <= 2) and (-1 <= y <= 2).
Finding bounds and fixing scope=1 and k=2 (call this relation cfg):
SELECT MIN(x), MAX(x), MIN(y), MAX(y), 1, 2
FROM d;
min | max | min | max | ?column? | ?column?
-----+-----+-----+-----+----------+----------
0 | 1 | 0 | 1 | 1 | 2
Generating completed set of values (call this relation completed):
SELECT x.*, y.*, COALESCE(f, 0)
FROM cfg
CROSS JOIN generate_series(minx - scope, maxx + scope) x
CROSS JOIN generate_series(miny - scope, maxy + scope) y
LEFT JOIN d ON d.x = x.* AND d.y = y.*;
x | y | coalesce
----+----+----------
-1 | -1 | 0
-1 | 0 | 0
-1 | 1 | 0
-1 | 2 | 0
0 | -1 | 0
0 | 0 | 1.4
0 | 1 | 1.7
0 | 2 | 0
1 | -1 | 0
1 | 0 | 2.3
1 | 1 | 1.2
1 | 2 | 0
2 | -1 | 0
2 | 0 | 0
2 | 1 | 0
2 | 2 | 0
(16 rows)
Now we just have to compute the values with the query given before and the cfg and completed relations. Note that we do not compute convolution for the values on the borders:
SELECT d1.x, d1.y, SUM(d2.f*wkernel(k, d2.x-d1.x, d2.y-d1.y)) / SUM(wkernel(k, d2.x-d1.x, d2.y-d1.y)) AS c
FROM cfg cross join completed d1 cross join completed d2
WHERE d1.x BETWEEN minx AND maxx
AND d1.y BETWEEN miny AND maxy
AND abs(d2.x-d1.x)+abs(d2.y-d1.y) <= scope
GROUP BY d1.x, d1.y;
x | y | c
---+---+-------------------------
0 | 0 | 1.400000000000000000000
0 | 1 | 1.700000000000000000000
1 | 0 | 2.300000000000000000000
1 | 1 | 1.200000000000000000000
(4 rows)
All in one, this gives:
WITH cfg(minx, maxx, miny, maxy, scope, k) AS (
SELECT MIN(x), MAX(x), MIN(y), MAX(y), 1, 2
FROM d
), completed(x, y, f) AS (
SELECT x.*, y.*, COALESCE(f, 0)
FROM cfg
CROSS JOIN generate_series(minx - scope, maxx + scope) x
CROSS JOIN generate_series(miny - scope, maxy + scope) y
LEFT JOIN d ON d.x = x.* AND d.y = y.*
)
SELECT d1.x, d1.y, SUM(d2.f*wkernel(k, d2.x-d1.x, d2.y-d1.y)) / SUM(wkernel(k, d2.x-d1.x, d2.y-d1.y)) AS c
FROM cfg cross join completed d1 cross join completed d2
WHERE d1.x BETWEEN minx AND maxx
AND d1.y BETWEEN miny AND maxy
AND abs(d2.x-d1.x)+abs(d2.y-d1.y) <= scope
GROUP BY d1.x, d1.y;
I hope this helps :-)