Comparing 2 tables in rails 5 - sql

I have 2 tables "existing_practices" & "latest_practices", both contain a column "practice_id"
What I want to do is compare latest_practices with the existing_practices to find which practices are on the latest_practices table that I don't have on my existing_practices (in other words I need to find the new practices)
Example:
existing_practices latest_practices
------------------ ------------------
practice_id practice_id
A123 A123
B123 B123
C123 C123
D123
So given the 2 above tables I would need to identify that "D123" is a new practice.
I have tried the following but it doesn't seem to work:
existing_practices = ExistingPractice.select(:practice_id).all
latest_practices = LatestPractice.select(:practice_id).all
new_practices = latest_practices.to_a - existing_practices.to_a
I'm thinking the easiest way is to just write the raw sql but i want to do it the rails way (if there is one).
Can anyone help?

pluck used to fetch column value as array
new_practices = LatestPractice.pluck(:practice_id) - ExistingPractice.pluck(:practice_id)

You can use sql directly for better performance.
new_practices_id = ActiveRecord::Base.connection.execute("SELECT DISTINCT latest_practices.practice_id FROM latest_practices LEFT JOIN existing_practices ON latest_practices.practice_id =
existing_practices.practice_id WHERE existing_practices.practice_id IS NULL")
This will return an array of the practice_id that doesn't exist at the existing_practices table.
here is the fiddle example company two columns or tables
you can do it like this with pluck
new_practices = LatestPractice.pluck(:practice_id) - ExistingPractice.pluck(:practice_id)

If you have a lot of data the solution of plucking practice_id and subtracting them is poorly performing. Instead I suggest:
Okish:
LatestPractice.where.not(practice_id: ExistingPractice.all)
better:
LatestPractice.where.not("EXISTS(SELECT 1 from existing_practices where latest_practices.practice_id = existing_practices.practice_id)")
much better:
LatestPractice.where('practice_id NOT IN(SELECT DISTINCT(practice_id) FROM existing_practices)')

Related

Best way to summarize a huge list of CASE WHEN

Suppose i have a table with a string and i want to create an aggregation of this table by grouping different string in a 'category'.
In order to understand to which category assign each string I have a list of possibilities that I could sum up as it follows:
CASE WHEN string = 'aaa' THEN 'cat_aaa'
CASE WHEN string = 'bbb' THEN 'cat_bbb'
[...]
CASE WHEN string LIKE '%abc%' THEN 'cat_abc'
Now, the list may be very huge and may needs update, so I don't want to make an infinite list of CASE WHEN. I'd like instead to have a table with the string used for the comparison and the corresponding category.
So let's suppose to have a first table with all the strings:
TABLE A
=======
string
--------
aaa
bbb
aaa
aaa
aaa
dabc
fabc
------
and another table
TABLE B
=======
string_comparison | category
aaa | cat_aaa
bbb | cat_bbb
%abc% | cat_abc
If they were all = condition, i could have just joined on the two strings. However, depending on the type of string_comparison, I may need to perform a LIKE comparison.
Do you have any fresh idea on how to solve this situation? I wouldn't like to join the two tables on a LIKE basis because of performances. Is there the possibility to use regular expressions on the string to solve this?
I am using redshift.
A like without a wildcard is effectively the same as a =, and any reasonable optimizer should handle it properly, so I wouldn't try to overthink things and just try joining witha like:
SELECT category, COUNT(*)
FROM a
JOIN b ON string LIKE string_comparison
GROUP BY category
If you're really concerned about the performance of the like operator you could try to check if the string_comparison doesn't have a wildcard in it and short-circuit it out, but I doubt it would be any faster than just using like directly:
SELECT category, COUNT(*)
FROM a
JOIN b ON (POSITION('%' IN string_comparison) > 0 AND
POSITION('_' IN string_comparison) > 0 AND
string LIKE string_coparison) OR
string = string_comparison
GROUP BY category
Note: You didn't tag the question with the RDBMS you are using, so I gave an example using Postgresql's position function. Other RDBMSs should have functions with the same functionality, although their names may differ.

Querying array of text in postgres

I have an array type I want to store in Postgres. One of the major use cases I have is to see if any of the records has an array which has a string in it.
eg.
| A | ["NY", "Paris", "Milan"] |
| B | ["Paris", "NY"] |
| C | [] |
| D | ["Milan"] |
Does there exist a row with Paris in the array? Which rows have Milan in the array? and so on.
I have 2 options on how to store the column. I can either make it a type text[] or convert it into a json as {"cities": ["NY", "Paris", "Milan"]} and then store as a JSONB field
However, I am not sure what would allow the fastest querying for the use case I have. Is there any one obviously better way of doing this? Am I tying myself down in any way by choosing one over the other? If I choose one over the other then how can I query the DB?
As you seem to be storing simple lists of values, I would recommend to use datataype Array over JSON, which better fits more complex cases (nested datastructures, associative arrays, ...).
To check for the value of an element at any position in the array, you can use array function ANY().
Here is a query that will return all records where the array stored in column cities contains 'Paris' :
SELECT t.* FROM mytable t WHERE 'Paris' = ANY(t.cities);
Yields :
id cities
---------------------------
A ["NY","Paris","Milan"]
B ["Paris","NY"]
Demo on DB Fiddle
For more information :
Postgres Arrays Documentation
Postgres Arrays Tutorial
I've noticed it is better to query JSONB, if it is a simple key-value store.
As in for instance you want to store arbitrary info on a row that your not sure what the columns(keys) would be.
info = {"a":"apple", "b":"ball"}
For use cases like yours, it would be better if you could design the db with simple tables so you could use JOINS and Indexes to your advantage.
You could restructure the tables like :
Location
id | name
----------
1 | Paris
2 | NY
3 | Milan
Other Table (with foreign key on location table)
user | location_id
--------------------
A | 1
A | 3
B | 2
Using these set of tables it would be easy to query all users with location paris using JOINS.

How do you 'join' multiple SQL data sets side by side (that don't link to each other)?

How would I go about joining results from multiple SQL queries so that they are side by side (but unrelated)?
The reason I am thinking of this is so that I can run 1 query in Google Big Query and it will return 1 single table which I can import into Excel and do some charts.
e.g. Query 1 looks at dataset TableA and returns:
**Metric:** Sales
**Value:** 3,402
And then Query 2 looks at dataset TableB and returns:
**Name:** John
**DOB:** 13 March
They would both use different tables and different filters, etc.
What would I do to make it look like:
---Sales----------John----
---3,402-------13 March----
Or alternatively:
-----Sales--------3,402-----
-----John-------13 March----
Or is there a totally different way to do this?
I can see the use case for the above, I've used something similar to create a single table from multiple tables with different metrics to query in Data Studio so that filters apply to all data in the dataset for example. However in that case, the data did share some dimensions that made it worthwhile doing.
If you are going to put those together with no relationship between the tables, I'd have 4 columns with TYPE describing the data in that row to make for easier filtering.
Type | Sales | Name | DOB
Use UNION ALL to put the rows together so you have something like
"Sales" | 3402 | null | null
"Customer Details" | null | John | 13 March
However, like the others said, make sure you have a good reason to do that otherwise you're just creating a bigger table to query for no reason.

How can I perform a similar UPDATE-WHERE statement on SparkSQL2.0?

How can I implement an SQL query like this, in SparkSQL 2.0 using DataFrames and Scala language? I've read a lot of posts but none of them seems to achieve what I need, or if you can point me one, would do. Here's the problem:
UPDATE table SET value = 100 WHERE id = 2
UPDATE table SET value = 70 WHERE id = 4
.....
Suppose that you have a table table with two columns like this:
id | value
--- | ---
1 | 1
2 | null
3 | 3
4 | null
5 | 5
Is there a way to implement the above query, using map, match cases, UDFs or if-else statements? The values that I need to store in the value field are not sequential, so I have specific values to put there. I'm aware too that it is not possible to modify a immutable data when dealing with DataFrames. I have no code to share because I can't get it to work nor reproduce any errors.
Yes you can, it's very simple. You can use when and otherwise.
val pf = df.select($"id", when($"id" === 2, lit(100)).otherwise(when($"id" === 4, lit(70)).otherwise($"value")).as("value"))

Getting distinct rows based on a certain field from a database in Django

I need to construct a query in Django, and I'm wondering if this is somehow possible (it may be really obvious but I'm missing it...).
I have a normal query Model.objects.filter(x=True)[:5] which can return results like this:
FirstName LastName Country
Bob Jones UK
Bill Thompson UK
David Smith USA
I need to only grab rows which are distinct based on the Country field, something like Model.objects.filter(x=True).distinct('Country')[:5] would be ideal but that's not possible with Django.
The rows I want the query to grab ultimately are:
FirstName LastName Country
Bob Jones UK
David Smith USA
I also need the query to use the same ordering as set in the model's Meta class (ie. I can't override the ordering in any way).
How would I go about doing this?
Thanks a lot.
I haven't tested this, but it seems to me a dict should do the job, although ordering could be off then:
d = {}
for x in Model.objects.all():
d[x.country] = x
records_with_distinct_countries = d.values()
countries = [f.country in Model.objects.all()]
for c in countries:
try:
print Model.objects.filter(country=c)
except Model.DoesNotExist:
pass
I think that #skrobul is on the right track, but a little bit off.
I don't think you'll be able to do this with a single query, because the distinct() method adds the SELECT DISTINCT modifier to the query, which acts on the entire row. You'll likely have to create a list of countries and then return limited QuerySets based on iterating that list.
Something like this:
maxrows = 5
countries = set([x.country for x in Model.objects.all()])
rows = []
count = 0
for c in countries:
if count >= maxrows:
break
try:
rows.append(Model.objects.filter(country=c)[0])
except Model.DoesNotExist:
pass
count += 1
This is a very generic example, but it gives the intended result.
Can you post the raw SQL that returns what you want from the source database? I have a hunch that the actual problem here is the query/data structure...