Query for many-to-many association - sql

There are 3 models in my project: Users, Games, GameVisits
class User < ActiveRecord::Base
has_many :game_visits
has_many :games, through: :game_visits
end
class Game < ActiveRecord::Base
has_many :game_visits
has_many :users, through: :game_visits
end
class GameVisit < ActiveRecord::Base
self.table_name = :users_games
enum status: [:visited, :not_visited, :unknown]
belongs_to :user
belongs_to :game
end
My frontend is written in angularjs, so I want a function that for a list of games returns array of hashes each containing name of user and status or nil. Data is presented as table, cols - players, rows - games, cell - presence of player on game.
This is example of such function:
def team_json(game_ids = nil)
game_ids ||= Game.pluck(:id)
games = Game.find(game_ids)
users = self.users
result = []
games.each do |game|
record = {name: game.name, date: game.date }
users_array = []
users.each do |user|
users_array << {
name: user.name,
status: user.game_visits.find_by_game_id(game.id)
}
end
record[:users] = users_array
result << record
end
result
end
Output:
[{:name=>"game 0", :date=>Fri, 19 Dec 2014 11:16:20 UTC +00:00, :users=>[{:name=>"Bob", :status=>nil}]},
{:name=>"game 1", :date=>Sat, 20 Dec 2014 11:16:20 UTC +00:00, :users=>[{:name=>"Bob", :status=>nil}]},
{:name=>"game 2", :date=>Fri, 19 Dec 2014 11:16:20 UTC +00:00, :users=>[{:name=>"Bob", :status=>nil}]}]
But this function produces tons of sql queries for GameVisits:
(1.0ms) SELECT "games"."id" FROM "games"
Game Load (1.3ms) SELECT "games".* FROM "games" WHERE "games"."id" IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101)
User Load (0.6ms) SELECT "users".* FROM "users" WHERE "users"."team_id" = $1 [["team_id", 1]]
GameVisit Load (0.6ms) SELECT "users_games".* FROM "users_games" WHERE "users_games"."user_id" = $1 AND "users_games"."game_id" = $2 LIMIT 1 [["user_id", 7], ["game_id", 1]]
GameVisit Load (0.6ms) SELECT "users_games".* FROM "users_games" WHERE "users_games"."user_id" = $1 AND "users_games"."game_id" = $2 LIMIT 1 [["user_id", 7], ["game_id", 2]]
GameVisit Load (0.5ms) SELECT "users_games".* FROM "users_games" WHERE "users_games"."user_id" = $1 AND "users_games"."game_id" = $2 LIMIT 1 [["user_id", 7], ["game_id", 3]]
..............
How can I optimise it?

Your user.game_visits.find_by_game_id(game.id) is what's firing all those queries. You can use eager-loading to get around this problem.
Change your assignment of users to the following:
users = self.users.includes(:game_visits)
Then change the value assignment of status in users_array to this:
status: user.game_visits.find{ |gv| gv.game_id == game.id }
The above line is using the enumerable's find, not ActiveRelation's. If you stick with find_by_game_id or some other find of AR, it'll still fire a query.
An additional optimization could be creating a hash for game_visits so that you can get around find altogether to do something like this:
user_game_visits_hash = \
user.game_visits.each_with_object({}) do |gv, hash|
hash[gv.game_id] = gv
end
# more code here
status: user_game_visits_hash[game.id]
And a minor point: In case your game_ids in the method's arguments are nil, the first two lines of your method will unnecessarily fire two queries, because you pluck all ids from games just to fetch all games. You can change that to this:
games = game_ids ? Game.find(game_ids) : Game.all

Related

How to merge classes in multiclass image segmentation

I am performing an image segmentation with a u-net model.
My mask has classes from 0-50.
I also have a text file dictionary with codes representing each class.
For example -
{1: '1234', 2:'5678', 3:'1245'} etc.
How do I combine when the 2 first string characters are the same so for example above key 1 and 3 are the same because they both start with "12".
How can I do this for all classes?
firstTwoCharDict = {}
for key, value in dictionary.items():
if key == 0:
value == value
firstTwoCharDict[key] = value
else:
value = value[:2]
firstTwoCharDict[key] = value
newDict = {}
for key, value in firstTwoCharDict.items():
if value not in newDict:
newDict[value] = [key]
else:
newDict[value].append(key)
This provides this
{'62': [1, 39],
'90': [2, 5, 9, 20, 32, 42, 47, 72, 88, 91, 95],
'97': [3, 49, 55],
'98': [4, 24, 34, 40, 53, 76, 81, 90, 96],
'31': [6, 17, 30, 48, 83],
'69': [7, 13, 15, 16, 27, 44, 51, 54, 56, 75],
'79': [8, 50],
'71': [10, 19, 22, 35, 61, 63, 65],
'99': [11, 12, 21, 46, 52, 69, 78, 84, 89],
'48': [14, 36, 74],
'60': [18],
'64': [23, 38, 66, 97]
```
Now i have an 2d array with integers, how do I replace them with they keys if the array values are equal to the values in the dict?

I need to filter the column from the beginning of a sentence

In my code, I can filter a column from exact texts, and it works without problems. However, it is necessary to filter another column with the beginning of a sentence.
The phrases in this column are:
A_2020.092222
A_2020.090787
B_2020.983898
B_2020.209308
So, I need to receive everything that starts with A_20 and B_20.
Thanks in advance
My code:
from bs4 import BeautifulSoup
import pandas as pd
import zipfile, urllib.request, shutil, time, csv, datetime, os, sys, os.path
#location
dt = datetime.datetime.now()
file_csv = "/home/Downloads/source.CSV"
file_csv_new = "/var/www/html/Data/Test.csv"
#open CSV
with open(file_csv, 'r', encoding='CP1251') as file:
reader = csv.reader(file, delimiter=';')
data = list(reader)
#list to dataframe
df = pd.DataFrame(data)
#filter UF
df = df.loc[df[9].isin(['PR','SC','RS'])]
#filter key
# A_ & B_
df = df.loc[df[35].isin(['A_20','B_20'])]
#print (df)
#Empty DataFrame
#Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
#Index: []
#[0 rows x 119 columns]```
Give the following a try:
lst1 = ['A_2020.092222', 'A_2020.090787 ', 'B_2020.983898', 'B_2020.209308', 'C_2020.209308', 'D_2020.209308']
df = pd.DataFrame(lst1, columns =['Name'])
df.loc[df.Name.str.startswith(('A_20','B_20'))]

Performance decrease with function call

For the following function:
func CycleClock(c *ballclock.Clock) int {
for i := 0; i < fiveMinutesPerDay; i++ {
c.TickFive()
}
return 1 + CalculateBallCycle(append([]int{}, c.BallQueue...))
}
where c.BallQueue is defined as []int and CalculateBallCycle is defined as func CalculateBallCycle(s []int) int. I am having a huge performance decrease between the for loop and the return statement.
I wrote the following benchmarks to test. The first benchmarks the entire function, the second benchmarks the for loop, while the third benchmarks the CalculateBallCycle function:
func BenchmarkCycleClock(b *testing.B) {
for i := ballclock.MinBalls; i <= ballclock.MaxBalls; i++ {
j := i
b.Run("BallCount="+strconv.Itoa(i), func(b *testing.B) {
for n := 0; n < b.N; n++ {
c, _ := ballclock.NewClock(j)
CycleClock(c)
}
})
}
}
func BenchmarkCycle24(b *testing.B) {
for i := ballclock.MinBalls; i <= ballclock.MaxBalls; i++ {
j := i
b.Run("BallCount="+strconv.Itoa(i), func(b *testing.B) {
for n := 0; n < b.N; n++ {
c, _ := ballclock.NewClock(j)
for k := 0; k < fiveMinutesPerDay; k++ {
c.TickFive()
}
}
})
}
}
func BenchmarkCalculateBallCycle123(b *testing.B) {
m := []int{8, 62, 42, 87, 108, 35, 17, 6, 22, 75, 116, 112, 39, 119, 52, 60, 30, 88, 56, 36, 38, 26, 51, 31, 55, 120, 33, 99, 111, 24, 45, 21, 23, 34, 43, 41, 67, 65, 66, 85, 82, 89, 9, 25, 109, 47, 40, 0, 83, 46, 73, 13, 12, 63, 15, 90, 121, 2, 69, 53, 28, 72, 97, 3, 4, 94, 106, 61, 96, 18, 80, 74, 44, 84, 107, 98, 93, 103, 5, 91, 32, 76, 20, 68, 81, 95, 29, 27, 86, 104, 7, 64, 113, 78, 105, 58, 118, 117, 50, 70, 10, 101, 110, 19, 1, 115, 102, 71, 79, 57, 77, 122, 48, 114, 54, 37, 59, 49, 100, 11, 14, 92, 16}
for n := 0; n < b.N; n++ {
CalculateBallCycle(m)
}
}
Using 123 balls, this gives the following result:
BenchmarkCycleClock/BallCount=123-8 200 9254136 ns/op
BenchmarkCycle24/BallCount=123-8 200000 7610 ns/op
BenchmarkCalculateBallCycle123-8 3000000 456 ns/op
Looking at this, there is a huge disparity between benchmarks. I would expect that the first benchmark would take roughly ~8000 ns/op since that would be the sum of the parts.
Here is the github repository.
EDIT:
I discovered that the result from the benchmark and the result from the running program are widely different. I took what #yazgazan found and modified the benchmark function in main.go mimic somewhat the BenchmarkCalculateBallCycle123 from main_test.go:
func Benchmark() {
for i := ballclock.MinBalls; i <= ballclock.MaxBalls; i++ {
if i != 123 {
continue
}
start := time.Now()
t := CalculateBallCycle([]int{8, 62, 42, 87, 108, 35, 17, 6, 22, 75, 116, 112, 39, 119, 52, 60, 30, 88, 56, 36, 38, 26, 51, 31, 55, 120, 33, 99, 111, 24, 45, 21, 23, 34, 43, 41, 67, 65, 66, 85, 82, 89, 9, 25, 109, 47, 40, 0, 83, 46, 73, 13, 12, 63, 15, 90, 121, 2, 69, 53, 28, 72, 97, 3, 4, 94, 106, 61, 96, 18, 80, 74, 44, 84, 107, 98, 93, 103, 5, 91, 32, 76, 20, 68, 81, 95, 29, 27, 86, 104, 7, 64, 113, 78, 105, 58, 118, 117, 50, 70, 10, 101, 110, 19, 1, 115, 102, 71, 79, 57, 77, 122, 48, 114, 54, 37, 59, 49, 100, 11, 14, 92, 16})
duration := time.Since(start)
fmt.Printf("Ballclock with %v balls took %s;\n", i, duration)
}
}
This gave the output of:
Ballclock with 123 balls took 11.86748ms;
As you can see, the total time was 11.86 ms, all of which was spent in the CalculateBallCycle function. What would cause the benchmark to run in 456 ns/op while the running program runs in around 11867480 ms/op?
You wrote that CalcualteBallCycle() modifies the slice by design.
I can't speak to correctness of that approach, but it is why benchmark time of BenchmarkCalculateBallCycle123 is so different.
On first run it does the expected thing but on subsequent runs it does something completely different, because you're passing different data as input.
Benchmark this modified code:
func BenchmarkCalculateBallCycle123v2(b *testing.B) {
m := []int{8, 62, 42, 87, 108, 35, 17, 6, 22, 75, 116, 112, 39, 119, 52, 60, 30, 88, 56, 36, 38, 26, 51, 31, 55, 120, 33, 99, 111, 24, 45, 21, 23, 34, 43, 41, 67, 65, 66, 85, 82, 89, 9, 25, 109, 47, 40, 0, 83, 46, 73, 13, 12, 63, 15, 90, 121, 2, 69, 53, 28, 72, 97, 3, 4, 94, 106, 61, 96, 18, 80, 74, 44, 84, 107, 98, 93, 103, 5, 91, 32, 76, 20, 68, 81, 95, 29, 27, 86, 104, 7, 64, 113, 78, 105, 58, 118, 117, 50, 70, 10, 101, 110, 19, 1, 115, 102, 71, 79, 57, 77, 122, 48, 114, 54, 37, 59, 49, 100, 11, 14, 92, 16}
for n := 0; n < b.N; n++ {
tmp := append([]int{}, m...)
CalculateBallCycle(tmp)
}
}
This works-around this behavior by making a copy of m, so that CalculateBallCycle modifies a local copy.
The running time becomes more like the others:
BenchmarkCalculateBallCycle123-8 3000000 500 ns/op
BenchmarkCalculateBallCycle123v2-8 100 10483347 ns/op
In your CycleClock function, you are copying the c.BallQueue slice. You can improve performance significantly by using CalculateBallCycle(c.BallQueue) instead (assuming CalculateBallCycle doesn't modify the slice)
For example:
func Sum(values []int) int {
sum := 0
for _, v := range values {
sum += v
}
return sum
}
func BenchmarkNoCopy(b *testing.B) {
for n := 0; n < b.N; n++ {
Sum(m)
}
}
func BenchmarkWithCopy(b *testing.B) {
for n := 0; n < b.N; n++ {
Sum(append([]int{}, m...))
}
}
// BenchmarkNoCopy-4 20000000 73.5 ns/op
// BenchmarkWithCopy-4 5000000 306 ns/op
// PASS
There is a subtle bug in your tests.
Both methods BenchmarkCycleClock and BenchmarkCycle24 run the benchmark in a for loop, passing a closure to b.Run. Inside of those closures you initialize the clocks using the loop variable i like this:ballclock.NewClock(i).
The problem is, that all instances of your anonymous function share the same variable. And, by the time the function is run by the test runner, the loop will be finished, and all of the clocks will be initialized using the same value: ballclock.MaxBalls.
You can fix this using a local variable:
for i := ballclock.MinBalls; i <= ballclock.MaxBalls; i++ {
i := i
b.Run("BallCount="+strconv.Itoa(i), func(b *testing.B) {
for n := 0; n < b.N; n++ {
c, _ := ballclock.NewClock(i)
CycleClock(c)
}
})
}
The line i := i stores a copy of the current value of i (different for each instance of your anonymous function).

What is the difference between subquery and values when passed to NOT IN on Postgresql?

In Rails4 app (versions: rails 4.2.3, postgresql 9.3.5), I have model classes like below
class Message < ActiveRecord::Base
belongs_to :receiver, class_name: 'User'
belongs_to :sender, class_name: 'User'
validate :receiver, presence: true
validate :sender, presence: true
end
class User < ActiveRecord::Base
has_many :received_messages, class_name: 'Message', foreign_key: :receiver_id
has_many :sent_messages, class_name: 'Message', foreign_key: :sender_id
end
I want to get collection of users who are NOT received message from specific user, So I wrote these scopes:
class User < ActiveRecord::Base
...
scope :received_messages_from, -> (user) {
includes(:received_messages).
where('messages.sender_id': user.id).
references(:received_messages)
}
scope :not_received_messages_from, -> (user) {
includes(:received_messages).
where.not(id: received_messages_from(user).select(:id)).
references(:received_messages)
}
end
I have these rows in messages table:
message_00:
sender_user_id: 11
receiver_user_id: 12
message_01:
sender_user_id: 11
receiver_user_id: 12
message_02:
sender_user_id: 12
receiver_user_id: 11
message_11:
sender_user_id: 17
receiver_user_id: 11
message_12:
sender_user_id: 11
receiver_user_id: 17
message_13:
sender_user_id: 18
receiver_user_id: 12
message_14:
sender_user_id: 12
receiver_user_id: 18
message_15:
sender_user_id: 17
receiver_user_id: 12
message_16:
sender_user_id: 17
receiver_user_id: 13
message_17:
sender_user_id: 17
receiver_user_id: 14
So, User.received_messages_from(User.find(17)).pluck(:id) results: [11, 12, 13, 14], and User.not_received_messages_from(User.find(17)).pluck(:id) results sholdn't contain these ids.
But the not_received_messages_from scope dosen't work as it returning users who has received messages from specific user. This generates SQL like this (in this example, user's id is 17):
SELECT "users"."id" FROM "users"
LEFT OUTER JOIN "messages" ON "messages"."receiver_id" = "users"."id"
WHERE ("users"."id"
NOT IN (
SELECT "users"."id" FROM "users"
WHERE "messages"."sender_id" = 17))
User.not_received_messages_from(User.find(17)).pluck(:id) results:
[11, 12, 12, 12, 15, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 17, 18, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
So, I tried fixing .select(:id) to .pluck(:id) in where in not_received_messages_from scope and this works.
scope :not_received_messages_from, -> (user) {
includes(:received_messages).
where.not(id: received_messages_from(user).pluck(:id)).
references(:received_messages)
}
SQL:
SELECT "users"."id" FROM "users"
LEFT OUTER JOIN "messages" ON "messages"."receiver_id" = "users"."id"
WHERE ("users"."id" NOT IN (11, 12, 13, 14))
User.not_received_messages_from(User.find(17)).pluck(:id) results:
[15, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 17, 18, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
I think the defferece between two SQLs is only subquery or static ids array passed to 'NOT IN'. Why the results differ each other?
This is likely because your sub-select is not returning the expected response.
SELECT "users"."id" FROM "users"
LEFT OUTER JOIN "messages" ON "messages"."receiver_user_id" = "users"."id"
WHERE ("users"."id"
NOT IN (
SELECT "users"."id" FROM "users"
WHERE "messages"."sender_user_id" = 17))
It's been a while since I've looked at PostresQL joins, but I don't know what, if anything that sub-select would produce. It's operating with a join, but... which one? There's no reference in the documentation that explains that.

ActiveRecord where.not is not working/ weird behavior

User has_many Plans. I'm trying to find the IDs of all Users that do NOT have a Plan with status of "canceled". Would love to know what's explaining the behavior below.
For context, what should be returned is this:
User.select { |u| u.plans.select { |p| p.status != "canceled" }.count > 0 }.map(&:id)
# => [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 61, 62, 63]
Here's what I'm getting:
# statement 1
User.joins(:plans).where.not("plans.status" => "canceled").map(&:id)
# User Load (0.3ms) SELECT "users".* FROM "users" INNER JOIN "plans" ON "plans"."user_id" = "users"."id" WHERE ("plans"."status" != 'canceled')
# => [44]
# statement 2
User.joins(:plans).where("plans.status != ?", "canceled").map(&:id)
# User Load (0.3ms) SELECT "users".* FROM "users" INNER JOIN "plans" ON "plans"."user_id" = "users"."id" WHERE (plans.status != 'canceled')
# => [44]
# statement 3
User.joins(:plans).where("plans.status == ?", nil).map(&:id)
# User Load (0.3ms) SELECT "users".* FROM "users" INNER JOIN "plans" ON "plans"."user_id" = "users"."id" WHERE (plans.status == NULL)
# => []
# statement 4
User.joins(:plans).where("plans.status" => nil).map(&:id)
# User Load (0.7ms) SELECT "users".* FROM "users" INNER JOIN "plans" ON "plans"."user_id" = "users"."id" WHERE "plans"."status" IS NULL
# => [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 41, 44, 42, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 61, 60, 62, 63]
Questions:
Why are statement 3 and 4 not returning the same result?
Why are statement 1 and 2 (fortunately these are the same and are returning the same result) not returning the same result as statement 4? For context, I'd rather not search on nil, but on "canceled". And I can confirm that all plans have either a status of nil or "canceled"
UPDATE PER REQUEST
# plan with nil status
Plan.where(status: nil).first
# => <Plan id: 1, zipcode: "94282", selected_plan: 1, meal_type: "Chef's choice (mixed)", most_favorite: "", least_favorite: "", allergies: "", start_date: "2015-05-27 00:00:00", delivery_address: "d", delivery_instructions: "", phone1: "10", phone2: "222", phone3: "2222", agree_tos: true, user_id: 20, created_at: "2015-05-24 05:18:40", updated_at: "2015-06-21 04:54:31", stripe_subscription_id: nil, stripe_invoice_number: nil, cancel_reason: nil, cancel_reason_other: nil, nps: nil, nps_open: nil, cancel_open: nil, status: nil, referred_by_code: nil>
# plan with canceled status
Plan.where(status: "canceled").first
# => <Plan id: 20, zipcode: "12345", selected_plan: 5, meal_type: "Meat (with veggies)", most_favorite: "", least_favorite: "", allergies: "", start_date: "2015-06-08 00:00:00", delivery_address: "asdf", delivery_instructions: "", phone1: "333", phone2: "333", phone3: "3333", agree_tos: true, user_id: 38, created_at: "2015-06-01 21:39:54", updated_at: "2015-06-23 06:23:10", stripe_subscription_id: "sub_6OKkJoNx2u8ZXZ", stripe_invoice_number: 0, cancel_reason: nil, cancel_reason_other: "", nps: 6, nps_open: "", cancel_open: "", status: "canceled", referred_by_code: nil>
Answer to Question 1:
You missed the concept that in conditional sql, the arguments that follow the condition are replaced in place of ? not compared to. So, replace the double == with =.
User.joins(:plans).where("plans.status = ?", nil).map(&:id)