bigquery mapping tables using LIKE with duplicate rows - google-bigquery

based on this question: bigquery update table using LIKE returns "UPDATE/MERGE must match at most one source row for each target row" I came up with a follow-up question that might end up in a complete different solution. That's why I posted a new question rather than a comment.
I am referring to the solution posted by #jon-armstrong. After testing it with different data, there is still the issue that it does not work if there are duplicate rows in 'table1'.
Of course this problem comes from the 'GROUP BY' statement - and w/o this, the UPDATE query does not work, resulting in the error message stated in my original question. It doesn't work either, if I 'GROUP' every value, or group nothing as suggested here. I also came up with the idea of using 'PARTITION BY', however, I get a syntax error in BigQuery.
There can be duplicates in my 'table1' (Data) and my mapping table 'table2'. So to make it very precise, this is my goal:
Table1 (data table)
textWithFoundItemInIt | foundItem
-------------------------------------------
hallo Adam |
Bert says hello |
Bert says byebye |
Want to find "Caesar"bdjehg |
Want to find "Caesar"bdjehg |
Want to find "Caesar"again |
Want to find "Caesar"again and also Bert | <== It is no problem, if only MAX()=Caesar or MIN()=Bert name is found.
Want to find "CaesarCaesar"again and again | <== This is no problem, just finding one Caesar is enough
Table2 (mapping table)
mappingItem
------------
Adam
Bert
Caesar
Bert
Caesar
Adam
Expected result
textWithFoundItemInIt | foundItem
--------------------------------------------
hallo Adam | Adam
Bert says hello | Bert
Bert says byebye | Bert
Want to find "Caesar"bdjehg | Caesar
Want to find "Caesar"bdjehg | Caesar
Want to find "Caesar"again | Caesar
Want to find "Caesar"again and also Bert | Caesar [or Bert]
Want to find "CaesarCaesar"again and again | Caesar
It doesn't matter which Adam from Table2 is found and inserted into Table1, they will be the same. So it is even okay if the first Adam will be over written by the second Adam, or if the query just stops to search any further once one Adam is found.
If I execute Jon's 'SELECT' query, it would result in:
textWithFoundItemInIt | foundItem
--------------------------------------------
hallo Adam | Adam
Bert says hello | Bert
Bert says byebye | Bert
Want to find "Caesar"bdjehg | Caesar
Want to find "Caesar"again | Caesar
Want to find "Caesar"again and also Bert | Caesar (if MAX() chosen)
Want to find "CaesarCaesar"again and again | Caesar
It (correctly) omits the second "Want to find "Caesar"bdjehg", but that's unfortunately not what I need.
If it is easier, it would also be okay that in cases that two names are found in one row
textWithFoundItemInIt | foundItem
---------------------------------------------
hallo Adam and Bert | Adam, Bert
Bert says hello to Caesar | Bert, Caesar
or
textWithFoundItemInIt | foundItem1 | foundItem2
---------------------------------------------------------------
hallo Adam and Bert | Adam | Bert
Bert says hello to Caesar | Bert | Caesar
I hope this helps to understand my issue. In easy words: "It's just a mapping with multiple equal rows" ;-)
Thanks a lot :)

Consider below approach
select textWithFoundItemInIt,
regexp_extract(textWithFoundItemInIt, r'(?i)' || mappingItems) foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2)
if applied to sample data in your question - output is

The SELECT statement from #Mikhail works very well. But when I put it into the UPDATE statement, I get my well-known error:
"UPDATE/MERGE must match at most one source row for each target row".
The problem occurs, because the SELECT statement correctly returns duplicates. A simple solution to this issue is to SELECT DISTINCT. If done so, there is no error anymore.
If more than one regexp should be found, then this query is helpful:
select textWithFoundItemInIt,
ARRAY_TO_STRING(regexp_extract_all(textWithFoundItemInIt, r'(?i)' || mappingItems), " --- ") AS foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2)
I hope my logic using the DISTINCT statement is fail-proof and feasible in all cases. If anyone has any remarks, I am very happy for feedback.

Related

Does using HAVING in SQL here compute an aggregate function a second time?

I saw this query as an answer to another question on this site:
SELECT MAX(date), thread_id
FROM table
GROUP BY thread_id
HAVING MAX(date) < 1555
With this database sample:
+-----------------------------+
| id | date | thread_id |
+-----+---------+-------------+
| 1 | 1111 | 4 |
| 2 | 1333 | 4 |
| 3 | 1444 | 5 |
| 4 | 1666 | 5 |
+-----------------------------+
Am I correct in assuming MAX(date) is computed twice here?
If so, this would definitely reduce the efficiency of this query. Is it possible to refactor the query so that MAX(date) is only computed once, so that performance can be maximised?
A peek into the query pipeline/execution plan will answer your question. During the GROUP BY aggregation step, MySQL will compute the max date for each thread_id. Then, during the HAVING filter, the max date will already be available to use. So, I would expect MAX(date) to be computed only once.
Note that MySQL actually permits using aliases in the HAVING clause, so you could have written your query as:
SELECT thread_id, MAX(date) AS max_date
FROM yourTable
GROUP BY thread_id
HAVING max_date < 1555;
Absolutly NOT !
The letters SQL means Structured Query Language. The most important word into this name is QUERY that means it is not a procedural language. In a procedural language, you write the exact commands that you want the computer to do. In SQL, a "query" language, you do not write a program code, but only the desired answer, then the SQL algrebrizer/optimizer have to compute the program that will be executed by the query processor (known as "query execution plan").
SQL is translated into relational algebra which is a simple mathematic formula and then be simplified by the algrebrizer like the work you've done at school when the teacher gives you a complex equation to solve : factorization, substitution...
The SQL engine will do the same, by factorizing the MAX(date) that will be compute once only !

How to make this very complicated query from three connected models with Django ORM?

Good day, everyone. Hope you're doing well. I'm a Django newbie, trying to learn the basics of RESTful development while helping in a small app project. We currently want some of our models to update accordingly based on the data we submit to them, by using the Django ORM and the fields that some of them share wih OneToMany relationsips. Currently, there's a really difficult query that I must do for one of my fields to update automatically given that filter. First, let me explain the models. This are not real, but a doppleganger that should work the same:
First we have a Report model that is a teacher's report of a student:
class Report(models.Model):
status = models.CharField(max_length=32, choices=Statuses.choices, default=Statuses.created,)
student = models.ForeignKey(Student, on_delete=models.CASCADE,)
headroom_teacher = models.ForeignKey(TeacherStaff, on_delete=models.CASCADE,)
# Various dates
results_date = models.DateTimeField(null=True, blank=True)
report_created = models.DateTimeField(null=True, blank=True)
.
#Other fields that don't matter
Here we have two related models, which are student and headroom_teacher. It's not necessary to show their models, but their relationship with the next two models is really important. We also have an Exams model:
class Exams(models.Model):
student = models.ForeignKey(student, on_delete=models.CASCADE,)
headroom_teacher = models.ForeignKey(TeacherStaff, on_delete=models.CASCADE,)
# Various dates
results_date = models.DateTimeField(null=True, blank=True)
initial_exam_date = models.DateTimeField(null=True, blank=True)
.
#Other fields that don't matter
As you can see, the purpose of this app is akin to reporting on the performance of students after completing some exams, and every Report is made by a teacher for specific student on how he did on those exams. Finally we have a final model called StudentMood that aims to show how should an student be feeling depending on the status of their exams:
class StudentMood(models.Model):
report = models.ForeignKey(Report, on_delete=models.CASCADE,)
student_status = models.CharField(
max_length=32, choices=Status.choices,
default=None, null=True, blank=False)
headroom_teacher = models.ForeignKey(TeacherStaff, on_delete=models.CASCADE,)
And with these three models is that we arrive to the crux of the issue. One of our possible student_status options is called Anxious for results, which we believe a student will feel during the time when he already has done an exam and is waiting for the results.
I want to automatically set my student_status to that, using a custom manager that takes into account the date that the report has been done or the day the data has been entered. I believe this can be done by making a query taking into account initial_exam_date.
I already have my custom manager set up, and the only thing missing is this query. I have no choice but to do it with Django's ORM. However, I've come up with an approximate raw SQL query, that I'm not sure if it's ok:
SELECT student_mood.id AS student_mood_id FROM
school_student_mood LEFT JOIN
school_reports report
ON student_mood.report_id = report.id AND student_mood.headroom_teacher_id = report.headroom_teacher_id
JOIN school_exams exams
ON report.headroom_teacher_id = exams.headroom_teacher_id
AND report.student_id = exams.student_id
AND exams.results_date > date where the student_mood or report data is entered, I guess
And that's what I've come to ask for help. Could someone shed some light into how to transfer this into a single query?
Without having an environment setup or really knowing exactly what you want out of the data. This is a good start.
Generally speaking, the Django ORM is not great for these types of queries, and trying to use select_related or prefetches results in really complex and inefficient queries.
I've found the best way to achieve these types of queries in Django is to break each piece of your puzzle down into a query that returns a "list" of ids that you can then use in a subquery.
Then you keep working down until you have your final output
from django.db.models import Subquery
# Grab the students of any exams where the result_date is greater than a specific date.
student_exam_subquery = Exam.objects.filter(
results_date__gt=timezone.now()
).values_list('student__id', flat=True)
# Grab all the student moods related to reports that relate to our "exams" where the student is anxious
student_mood_subquery = StudentMood.objects.filter(
student_status='anxious',
reports__student__in=Subquery(student_exam_subquery)
).values_list('report__id', flat=True)
# Get the final list of students
Student.objects.values_list('id', flat=True).filter(
reports__id__in=Subquery(student_mood_subquery)
)
Now I doubt this will work out of the box, but it's really to give you an understanding of how you might go about solving this in a way that is readable to future devs and the most efficient (db wise).
So, the issue I was running into, is that the school has exam cycles each period, and it was difficult to retrieve only the students' report for this cycle. Let's assume we have the following database:
+-----------+-----------+----------------+-------------------+-------------------+------------------+
| Student | Report ID | StudentMood ID | Exam Cycle Status | Initial Exam Date | Report created a |
+-----------+-----------+----------------+-------------------+-------------------+------------------+
| Student 1 | 1 | 1 | Done | 01/01/2020 | 02/01/2020 |
| Student 2 | 2 | 2 | Done | 01/01/2020 | 02/01/2020 |
| Student 1 | 3 | 3 | On Going | 02/06/2020 | 01/01/2020 |
| Student 2 | 4 | 4 | On Going | 02/06/2020 | 01/01/2020 |
+-----------+-----------+----------------+-------------------+-------------------+------------------+
And Obviously, I wanted to limit my query to just this cycle, like this:
+-----------+-----------+----------------+-------------------+-------------------+------------------+
| Student | Report ID | StudentMood ID | Exam Cycle Status | Initial Exam Date | Report created a |
+-----------+-----------+----------------+-------------------+-------------------+------------------+
| Student 1 | 3 | 3 | On Going | 02/06/2020 | 01/01/2020 |
| Student 2 | 4 | 4 | On Going | 02/06/2020 | 01/01/2020 |
+-----------+-----------+----------------+-------------------+-------------------+------------------+
Now, your answer, trent, was really useful, but I'm still having issues retrieving in the shape of the above:
qs_exams = Exams.objects.filter(initial_exam_date__gt=now()).values_list('student__id', flat=True)
qs_report = Report.objects.filter(student__id__in=qs_exams).values_list('id', flat=True)
qs_mood = StudentMood.objects.select_related('report') \
.filter(report__id__in=qs_report).order_by('report__student_id', '-created').distinct()
But this query is still giving me all the StudentMoods throughout the school year. Sooooo, any ideas?

0 results in MS Access totals query (w. COUNT) after applying criteria

A query I am working on is showing a rather interesting behaviour that I couldn't debug so far.
This is the query before it gets buggy:
QryCount
SELECT EmpId, [Open/Close], Count([Open/Close]) AS Occurences, Attribute1, Market, Tier, Attribute2, MtSWeek
FROM qrySource
WHERE (Venue="NewYork") AND (Type="TypeA")
GROUP BY EmpId, [Open/Close], Attribute1, Market, Tier, Attribute2, MtSWeek;
The query gives precisely the results that I would expect it to:
#01542 | Open | 5 | Call | English | Tier1 | Complain | 01/01/2017
#01542 | Closed | 2 | Call | English | Tier2 | ProdInfo | 01/01/2017
#01542 | Open | 7 | Mail | English | Tier1 | ProdInfo | 08/01/2017
etc...
But as a matter of fact in doing so it provides more records than needed at a subsequent step thereby creating cartesians.
qrySource.[Open/Close] is a string type field with possible attributes (you guessed) "open", "Closed" and null and it is actually provided by a mapping table at the creation stage of qrySource (not sure, but maybe this helps).
Now, the error comes in when I try to limit qryCount only to records where Open/Close = "Open".
I tried both using WHERE and HAVING to no avail. The query would result in 0 records, which is not what I would like to see.
I thought that maybe it is because "open" is a reserved term, but even by changing it to "D_open" in the source table didn't fix the issue.
Also tried to filter for the desired records in a subsequent query
SELECT *
FROM QryCount
WHERE [Open/Close] ="D_Open"
But nothing, still 0 records found.
I am suspicious it might be somehow related to some inherent proprieties of the COUNT function but not sure. Any help would be appreciated.
Everyone who participated, thank you and apologies for providing you with insufficient/confusing information. I recon the question could have been drafted better.
Anyhow, I found that the problem was apparently caused by the "/" in the Open/Closed field name. As soon as I removed it from the field name in the original mapping table the query performed as expected.

sqlite variable and unknown number of entries in column

I am sure this question has been asked before, but I'm so new to SQL, I can't even combine the correct search terms to find an answer! So, apologies if this is a repetition.
The db I'm creating has to be created at run-time, then the data is entered after creation. Some fields will have a varying number of entries, but the number is unknown at creation time.
I'm struggling to come up with a db design to handle this variation.
As an (anonymised) example, please see below:
| salad_name | salad_type | salad_ingredients | salad_cost |
| apple | fruity | apple | cheap |
| unlikely | meaty | sausages, chorizo | expensive |
| normal | standard | leaves, cucumber, tomatoes | mid |
As you can see, the contents of "salad_ingredients" varies.
My thoughts were:
just enter a single, comma-separated string and separate at run-time. Seems hacky, and couldn't search by salad_ingredients!
have another table, for each salad, such as "apple_ingredients", which could have a varying number of rows for each ingredient. However, I can't do this, because I don't know the salad_name at creation time! :(
Have a separate salad_ingredients table, where each row is a salad_name, and there is an arbitrary number of ingredients fields, say 10, so you could have up to 10 ingredients. Again, seems slightly hacky, as I don't like to unused fields, and what happens if a super-complicated salad comes along?
Is there a solution that I've missed?
Thanks,
Dan
based on my experience the best solution is based on a normalized set of tables
table salads
id
salad_name
salad_type
salad_cost
.
table ingredients
id
name
and
table salad_ingredients
id
id_salad
id_ingredients
where id_salad is the corresponding if from salads
and id_ingredients is the corresponding if from ingredients
using proper join you can get (select) and filter (where) all the values you need

How to use string_agg in SQL in Excel Query

I know that in SQL I can use the 'String_agg(Test1, ',')' function for grouping rows and concatenate values in a selected field ('Test1' in this case).
For Example:
I have a query that the result without using String_agg on 'Buyer' field is:
**Key** | **Buyer** | **MP**
1 | Josh | Gregory
1 | Bred | Gregory
2 | John | Ethan
The expected results when using String_agg is:
**Key** | **Buyer** | **MP**
1 | Josh, Bred | Gregory
2 | John | Ethan
But the problem is that I'm trying to execute it in SQL query which retrieves data to Excel file from another Excel file and it fails because of an error that seems like the Excel query doesn't know the String_agg function.
The query is:
SELECT `Sheet1$`.Key, string_agg(`Sheet1$`.Buyer, `, `) AS `Buyer`, `Sheet1$`.MP
FROM `C:\Input\Data.xls`.`Sheet1$` `Sheet1$`
GROUP BY 2
ORDER BY `Sheet1$`.Key
Screenshot:
Query screenshot
Error:
Error Screenshot
Someone can help me and tell me how should I correct my query to make it works?
Thank you!
Problem: Excel is not a database.
You are trying to used advanced query functionality in a spreadsheeting package, which is sometimes somewhat supported in some versions of excel, uses lots of processor power, causes serious issues as soon as a user moves anything on the sheet, or the file itself, and is not really what it was designed to do.
Solution: Use a database.
Have a bit of a look at the excel 'concatenate' function.
I believe you can use it as CONCAT() also.
Also see this SO question: Concatenation in SQL select query on Excel sheet
Hope this helps.