How to alias a list of values in SQL - sql

I need to see if any of a set of columns contains a value in a list.
E.G
...
SELECT *
FROM Account
WHERE
NOT (
AccountWarningCode1 IN (02, 05, 15, 20, 21, 24, 31, 36, 40, 42, 45, 47, 50, 51, 52, 53, 55, 56, 62, 65, 66, 78, 79, 84, 110, 119, 120, 121, 125, 202)
OR AccountWarningCode2 IN (02, 05, 15, 20, 21, 24, 31, 36, 40, 42, 45, 47, 50, 51, 52, 53, 55, 56, 62, 65, 66, 78, 79, 84, 110, 119, 120, 121, 125, 202)
OR AccountWarningCode3 IN (02, 05, 15, 20, 21, 24, 31, 36, 40, 42, 45, 47, 50, 51, 52, 53, 55, 56, 62, 65, 66, 78, 79, 84, 110, 119, 120, 121, 125, 202)
...
)
The above does work, but what i'd like to do instead is alias the list some how so I don't repeat myself quite as much.
For example (this doesn't actually work)
WITH bad_warnings AS (02, 05, 15, 20, 21, 24, 31, 36, 40, 42, 45, 47, 50, 51, 52, 53, 55, 56, 62, 65, 66, 78, 79, 84, 110, 119, 120, 121, 125, 202)
SELECT *
FROM Account
WHERE
NOT (
AccountWarningCode1 IN bad_warnings
OR AccountWarningCode2 IN bad_warnings
OR AccountWarningCode3 IN bad_warnings
...
)
Is this possible in T-SQL?

Your second version is actually close. You can use a common table expression:
WITH bad_warnings(code) AS(
SELECT * FROM(VALUES
('02'), ('05'), ('15'), ('20'), ('21'), ('24'),
('31'), ('36'), ('40'), ('42'), ('45'), ('47'),
('50'), ('51'), ('52'), ('53'), ('55'), ('56'),
('62'), ('65'), ('66'), ('78'), ('79'), ('84'),
('110'), ('119'), ('120'), ('121'), ('125'), ('202')
) a(b)
)
SELECT *
FROM Account
WHERE
NOT (
AccountWarningCode1 IN (SELECT code FROM bad_warnings)
OR AccountWarningCode2 IN (SELECT code FROM bad_warnings)
OR AccountWarningCode3 IN (SELECT code FROM bad_warnings)
)

This is the way to define a derived table with your values as CTE.
WITH bad_warnings AS
(SELECT val FROM (VALUES(02),(05),(15),(20),(21),(24),(31),(36),(40),(42),(45),(47),(50),(51),(52),(53),(55),(56),(62),(65),(66),(78),(79),(84),(110),(119),(120),(121),(125),(202)) AS tbl(val)
)
SELECT *
FROM bad_warnings
You can use this as any table in your query.
Your check would be something like
WHERE SomeValue IN(SELECT val FROM badWarnings)
With NOT IN you would negate this list

Is this possible in T-SQL?
Yes, either use a table variable or a temporary table. Populate those inlist data in table variable and use it as many places within your procedure you want.
Example:
declare #inlist1 table(elem int);
insert into #inlist1
select 02
union
select 05
union
select 15
union
select 20
union
select 21
union
select 24
Use it now
WHERE
NOT (
AccountWarningCode1 IN (select elem from #inlist1)
(OR)
You can as well perform a JOIN operation saying
FROM Account a
JOIN #inlist1 i ON a.AccountWarningCode1 = i.elem

You can do it like this:
with bad_warnings as
(select '02'
union
select '15'
etc
)
select * from account
where not
(AccountWarningCode1 IN (SELECT code FROM bad_warnings
etc)

Related

how to do list_agg with a character limit of 1440 characters in Snowflake

I have a table as below, I have 1775 ids and length of id column is 10 characters, I want to create multiple groups of list_agg of ids with a limit of not more than 1440 characters to distribute 1775 ids into groups
id
distributor_name
1234567890
Sample_name1
2345678901
Sample_name1
3456789012
Sample_name1
4567890123
Sample_name2
5678901234
Sample_name2
6789012345
Sample_name3
7890123456
Sample_name3
8901234567
Sample_name3
Required output is:
group
id_count
list_agg
1
120
1234567890,2345678901,3456789012...
2
122
7890123456,5678901234,8901234567...
Very much appreciate your help!
If your ID space is symmetrically distributed you can use WIDTH_BUCKET
with data1 as (
select
row_number() over (order by true)-1 as rn
from table(generator(rowcount=>100))
)
select
width_bucket(rn, 0, 100, 5) as group_id,
count(*) as c,
array_agg(rn) within group(order by rn) as bucket_values
from data1
group by 1
order by 1;
GROUP_ID
C
BUCKET_VALUES
1
20
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ]
2
20
[ 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39 ]
3
20
[ 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59 ]
4
20
[ 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79 ]
5
20
[ 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 ]
If your data is not symmetrical, you can allocate row numbers to each row, and then shave the yack again.
you can also be data driven:
width_bucket(rn, (select min(rn) from data1), (select max(rn) from data1)+1, 5) as group_id,
It's easier done with array_agg but if you must use listagg, here is a spin on Simeon's answer. The basic idea is to keep track of length of numbers when stitched together and also account for the number of commas so we don't go over 1440 char limit.
create or replace temporary table t as
select row_number() over (order by true)-1 as id,
uniform(1000000000, 1999999999, random()) as num
from table(generator(rowcount=>1775));
with cte as
(select *, ceil((sum(len(num)) over (order by id) + count(num) over (order by id) -1)/1440) as group_id
from t)
select group_id,
count(num) as id_count,
listagg(num,',') as id_list,
len(id_list) as len_check
from cte
group by group_id
order by group_id;

Null values at the end of rows after INSERT INTO

I am currently trying to INSERT INTO my SQL database a row of 144 columns.
The problem is that the last 10 values of the new row are NULL while they are supposed to be float and int.
That's an example of what I have in my DB after the INSERT INTO :
First column
Before last column
Last column
1
NULL
NULL
That's the SQL request I am using
INSERT INTO "historic_data2"
VALUES (28438, 163, 156, 1, 'FIST 2', 91, 81, 82, 84, 90, 6, '2 Pts Int M', 'Offensive', 0, '91_81_82_84_90', 86, 85, 0, 36, 62, 24, 0, 132, 86, 0, 83, 0, 0, 0, 0, 42, 77, 24, 0, 173, 107, 0, 204, 0, 0, 0, 0, 42, 77, 24, 0, 173, 107, 0, 204, 0, 0, 0, 81, 62, 34, 23, 19, 45, 32, 18, 9, 19, 0.5555555555555556, 0.5161290322580645, 0.5294117647058824, 0.391304347826087, 1.0, 82, 54, 34, 18, 28, 49, 27, 17, 8, 28, 0.5975609756097561, 0.5, 0.5, 0.4444444444444444, 1.0, 302, 233, 132, 89, 69, 168, 116, 69, 35, 69, 0.5562913907284768, 0.4978540772532189, 0.5227272727272727, 0.39325842696629215, 1.0, 214, 161, 84, 73, 53, 119, 79, 39, 36, 53, 0.5560747663551402, 0.4906832298136646, 0.4642857142857143, 0.4931506849315068, 1.0, 717, 544, 298, 233, 173, 416, 285, 175, 97, 173, 0.5801952580195258, 0.5238970588235294, 0.587248322147651, 0.41630901287553645, 1.0, 466, 315, 183, 128, 151, 357, 233, 138, 91, 151, 0.7660944206008584, 0.7396825396825397, 0.7540983606557377, 0.7109375,1.0,112)
I can't figure out how to solve this issue. My guess would be that there is a hard limit on how much column you can insert at once but I don't know how to solve that.
Thank you in advance for your help

I need to filter the column from the beginning of a sentence

In my code, I can filter a column from exact texts, and it works without problems. However, it is necessary to filter another column with the beginning of a sentence.
The phrases in this column are:
A_2020.092222
A_2020.090787
B_2020.983898
B_2020.209308
So, I need to receive everything that starts with A_20 and B_20.
Thanks in advance
My code:
from bs4 import BeautifulSoup
import pandas as pd
import zipfile, urllib.request, shutil, time, csv, datetime, os, sys, os.path
#location
dt = datetime.datetime.now()
file_csv = "/home/Downloads/source.CSV"
file_csv_new = "/var/www/html/Data/Test.csv"
#open CSV
with open(file_csv, 'r', encoding='CP1251') as file:
reader = csv.reader(file, delimiter=';')
data = list(reader)
#list to dataframe
df = pd.DataFrame(data)
#filter UF
df = df.loc[df[9].isin(['PR','SC','RS'])]
#filter key
# A_ & B_
df = df.loc[df[35].isin(['A_20','B_20'])]
#print (df)
#Empty DataFrame
#Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
#Index: []
#[0 rows x 119 columns]```
Give the following a try:
lst1 = ['A_2020.092222', 'A_2020.090787 ', 'B_2020.983898', 'B_2020.209308', 'C_2020.209308', 'D_2020.209308']
df = pd.DataFrame(lst1, columns =['Name'])
df.loc[df.Name.str.startswith(('A_20','B_20'))]

Performance decrease with function call

For the following function:
func CycleClock(c *ballclock.Clock) int {
for i := 0; i < fiveMinutesPerDay; i++ {
c.TickFive()
}
return 1 + CalculateBallCycle(append([]int{}, c.BallQueue...))
}
where c.BallQueue is defined as []int and CalculateBallCycle is defined as func CalculateBallCycle(s []int) int. I am having a huge performance decrease between the for loop and the return statement.
I wrote the following benchmarks to test. The first benchmarks the entire function, the second benchmarks the for loop, while the third benchmarks the CalculateBallCycle function:
func BenchmarkCycleClock(b *testing.B) {
for i := ballclock.MinBalls; i <= ballclock.MaxBalls; i++ {
j := i
b.Run("BallCount="+strconv.Itoa(i), func(b *testing.B) {
for n := 0; n < b.N; n++ {
c, _ := ballclock.NewClock(j)
CycleClock(c)
}
})
}
}
func BenchmarkCycle24(b *testing.B) {
for i := ballclock.MinBalls; i <= ballclock.MaxBalls; i++ {
j := i
b.Run("BallCount="+strconv.Itoa(i), func(b *testing.B) {
for n := 0; n < b.N; n++ {
c, _ := ballclock.NewClock(j)
for k := 0; k < fiveMinutesPerDay; k++ {
c.TickFive()
}
}
})
}
}
func BenchmarkCalculateBallCycle123(b *testing.B) {
m := []int{8, 62, 42, 87, 108, 35, 17, 6, 22, 75, 116, 112, 39, 119, 52, 60, 30, 88, 56, 36, 38, 26, 51, 31, 55, 120, 33, 99, 111, 24, 45, 21, 23, 34, 43, 41, 67, 65, 66, 85, 82, 89, 9, 25, 109, 47, 40, 0, 83, 46, 73, 13, 12, 63, 15, 90, 121, 2, 69, 53, 28, 72, 97, 3, 4, 94, 106, 61, 96, 18, 80, 74, 44, 84, 107, 98, 93, 103, 5, 91, 32, 76, 20, 68, 81, 95, 29, 27, 86, 104, 7, 64, 113, 78, 105, 58, 118, 117, 50, 70, 10, 101, 110, 19, 1, 115, 102, 71, 79, 57, 77, 122, 48, 114, 54, 37, 59, 49, 100, 11, 14, 92, 16}
for n := 0; n < b.N; n++ {
CalculateBallCycle(m)
}
}
Using 123 balls, this gives the following result:
BenchmarkCycleClock/BallCount=123-8 200 9254136 ns/op
BenchmarkCycle24/BallCount=123-8 200000 7610 ns/op
BenchmarkCalculateBallCycle123-8 3000000 456 ns/op
Looking at this, there is a huge disparity between benchmarks. I would expect that the first benchmark would take roughly ~8000 ns/op since that would be the sum of the parts.
Here is the github repository.
EDIT:
I discovered that the result from the benchmark and the result from the running program are widely different. I took what #yazgazan found and modified the benchmark function in main.go mimic somewhat the BenchmarkCalculateBallCycle123 from main_test.go:
func Benchmark() {
for i := ballclock.MinBalls; i <= ballclock.MaxBalls; i++ {
if i != 123 {
continue
}
start := time.Now()
t := CalculateBallCycle([]int{8, 62, 42, 87, 108, 35, 17, 6, 22, 75, 116, 112, 39, 119, 52, 60, 30, 88, 56, 36, 38, 26, 51, 31, 55, 120, 33, 99, 111, 24, 45, 21, 23, 34, 43, 41, 67, 65, 66, 85, 82, 89, 9, 25, 109, 47, 40, 0, 83, 46, 73, 13, 12, 63, 15, 90, 121, 2, 69, 53, 28, 72, 97, 3, 4, 94, 106, 61, 96, 18, 80, 74, 44, 84, 107, 98, 93, 103, 5, 91, 32, 76, 20, 68, 81, 95, 29, 27, 86, 104, 7, 64, 113, 78, 105, 58, 118, 117, 50, 70, 10, 101, 110, 19, 1, 115, 102, 71, 79, 57, 77, 122, 48, 114, 54, 37, 59, 49, 100, 11, 14, 92, 16})
duration := time.Since(start)
fmt.Printf("Ballclock with %v balls took %s;\n", i, duration)
}
}
This gave the output of:
Ballclock with 123 balls took 11.86748ms;
As you can see, the total time was 11.86 ms, all of which was spent in the CalculateBallCycle function. What would cause the benchmark to run in 456 ns/op while the running program runs in around 11867480 ms/op?
You wrote that CalcualteBallCycle() modifies the slice by design.
I can't speak to correctness of that approach, but it is why benchmark time of BenchmarkCalculateBallCycle123 is so different.
On first run it does the expected thing but on subsequent runs it does something completely different, because you're passing different data as input.
Benchmark this modified code:
func BenchmarkCalculateBallCycle123v2(b *testing.B) {
m := []int{8, 62, 42, 87, 108, 35, 17, 6, 22, 75, 116, 112, 39, 119, 52, 60, 30, 88, 56, 36, 38, 26, 51, 31, 55, 120, 33, 99, 111, 24, 45, 21, 23, 34, 43, 41, 67, 65, 66, 85, 82, 89, 9, 25, 109, 47, 40, 0, 83, 46, 73, 13, 12, 63, 15, 90, 121, 2, 69, 53, 28, 72, 97, 3, 4, 94, 106, 61, 96, 18, 80, 74, 44, 84, 107, 98, 93, 103, 5, 91, 32, 76, 20, 68, 81, 95, 29, 27, 86, 104, 7, 64, 113, 78, 105, 58, 118, 117, 50, 70, 10, 101, 110, 19, 1, 115, 102, 71, 79, 57, 77, 122, 48, 114, 54, 37, 59, 49, 100, 11, 14, 92, 16}
for n := 0; n < b.N; n++ {
tmp := append([]int{}, m...)
CalculateBallCycle(tmp)
}
}
This works-around this behavior by making a copy of m, so that CalculateBallCycle modifies a local copy.
The running time becomes more like the others:
BenchmarkCalculateBallCycle123-8 3000000 500 ns/op
BenchmarkCalculateBallCycle123v2-8 100 10483347 ns/op
In your CycleClock function, you are copying the c.BallQueue slice. You can improve performance significantly by using CalculateBallCycle(c.BallQueue) instead (assuming CalculateBallCycle doesn't modify the slice)
For example:
func Sum(values []int) int {
sum := 0
for _, v := range values {
sum += v
}
return sum
}
func BenchmarkNoCopy(b *testing.B) {
for n := 0; n < b.N; n++ {
Sum(m)
}
}
func BenchmarkWithCopy(b *testing.B) {
for n := 0; n < b.N; n++ {
Sum(append([]int{}, m...))
}
}
// BenchmarkNoCopy-4 20000000 73.5 ns/op
// BenchmarkWithCopy-4 5000000 306 ns/op
// PASS
There is a subtle bug in your tests.
Both methods BenchmarkCycleClock and BenchmarkCycle24 run the benchmark in a for loop, passing a closure to b.Run. Inside of those closures you initialize the clocks using the loop variable i like this:ballclock.NewClock(i).
The problem is, that all instances of your anonymous function share the same variable. And, by the time the function is run by the test runner, the loop will be finished, and all of the clocks will be initialized using the same value: ballclock.MaxBalls.
You can fix this using a local variable:
for i := ballclock.MinBalls; i <= ballclock.MaxBalls; i++ {
i := i
b.Run("BallCount="+strconv.Itoa(i), func(b *testing.B) {
for n := 0; n < b.N; n++ {
c, _ := ballclock.NewClock(i)
CycleClock(c)
}
})
}
The line i := i stores a copy of the current value of i (different for each instance of your anonymous function).

What is the difference between subquery and values when passed to NOT IN on Postgresql?

In Rails4 app (versions: rails 4.2.3, postgresql 9.3.5), I have model classes like below
class Message < ActiveRecord::Base
belongs_to :receiver, class_name: 'User'
belongs_to :sender, class_name: 'User'
validate :receiver, presence: true
validate :sender, presence: true
end
class User < ActiveRecord::Base
has_many :received_messages, class_name: 'Message', foreign_key: :receiver_id
has_many :sent_messages, class_name: 'Message', foreign_key: :sender_id
end
I want to get collection of users who are NOT received message from specific user, So I wrote these scopes:
class User < ActiveRecord::Base
...
scope :received_messages_from, -> (user) {
includes(:received_messages).
where('messages.sender_id': user.id).
references(:received_messages)
}
scope :not_received_messages_from, -> (user) {
includes(:received_messages).
where.not(id: received_messages_from(user).select(:id)).
references(:received_messages)
}
end
I have these rows in messages table:
message_00:
sender_user_id: 11
receiver_user_id: 12
message_01:
sender_user_id: 11
receiver_user_id: 12
message_02:
sender_user_id: 12
receiver_user_id: 11
message_11:
sender_user_id: 17
receiver_user_id: 11
message_12:
sender_user_id: 11
receiver_user_id: 17
message_13:
sender_user_id: 18
receiver_user_id: 12
message_14:
sender_user_id: 12
receiver_user_id: 18
message_15:
sender_user_id: 17
receiver_user_id: 12
message_16:
sender_user_id: 17
receiver_user_id: 13
message_17:
sender_user_id: 17
receiver_user_id: 14
So, User.received_messages_from(User.find(17)).pluck(:id) results: [11, 12, 13, 14], and User.not_received_messages_from(User.find(17)).pluck(:id) results sholdn't contain these ids.
But the not_received_messages_from scope dosen't work as it returning users who has received messages from specific user. This generates SQL like this (in this example, user's id is 17):
SELECT "users"."id" FROM "users"
LEFT OUTER JOIN "messages" ON "messages"."receiver_id" = "users"."id"
WHERE ("users"."id"
NOT IN (
SELECT "users"."id" FROM "users"
WHERE "messages"."sender_id" = 17))
User.not_received_messages_from(User.find(17)).pluck(:id) results:
[11, 12, 12, 12, 15, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 17, 18, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
So, I tried fixing .select(:id) to .pluck(:id) in where in not_received_messages_from scope and this works.
scope :not_received_messages_from, -> (user) {
includes(:received_messages).
where.not(id: received_messages_from(user).pluck(:id)).
references(:received_messages)
}
SQL:
SELECT "users"."id" FROM "users"
LEFT OUTER JOIN "messages" ON "messages"."receiver_id" = "users"."id"
WHERE ("users"."id" NOT IN (11, 12, 13, 14))
User.not_received_messages_from(User.find(17)).pluck(:id) results:
[15, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 17, 18, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
I think the defferece between two SQLs is only subquery or static ids array passed to 'NOT IN'. Why the results differ each other?
This is likely because your sub-select is not returning the expected response.
SELECT "users"."id" FROM "users"
LEFT OUTER JOIN "messages" ON "messages"."receiver_user_id" = "users"."id"
WHERE ("users"."id"
NOT IN (
SELECT "users"."id" FROM "users"
WHERE "messages"."sender_user_id" = 17))
It's been a while since I've looked at PostresQL joins, but I don't know what, if anything that sub-select would produce. It's operating with a join, but... which one? There's no reference in the documentation that explains that.