optimizing PGSQL SQL search queries on big texts ('like', full text search, ... )

optimizing PGSQL SQL search queries on big texts ('like', full text search, ... ) - sql

We have a software solution which is used by +200 customers. We recently switched to pgsql, because our former database was too slow handeling the search queries our customers use.
Our dabatabase looks like this:
TABLE A
1. ID
(+ some other fields which aren't important here)
TABLE B
This table is used to store 'data' on the items in table A. This is different for every customer. For example 'Type' can be 'CLIENTNAME' and value 'AZERTY'. One record on TABLE A can have infinite records in TABLE B. Mostly 1 record in Table A has between 5 - 10 records on Table B.
1. ID TABLE A
2. TYPE
3. VALUE
TABLE C
1. TABLE A ID
2. VERSIONNR
3. DESCRIPTION
This file has the different verions of the records in TABLE A. Each of these versions has an extended description. This can range from 0 characters to infinite.
Our problem: our customers are used on 'google-like' searching. For example: they type 'AZERTY' and we show all the records from TABLE A where the ID of TABLE A:
'AZERTY' is in the description of the most recent version of TABLE C
'AZERTY' is in one of the values of TABLE B
Additional problem: this search is a 'contains'. If they search 'ZER', they should also find the records with 'AZERTY' in it. Multiple arguments are an 'AND', if they search for 'ZER 123', we need to show all records where the description matches 'ZER' and '123' or the values match 'ZER' and '123'.
What we have done so far:
There is an option a user can check in/out whether they want to search the description or not. We mosty advice them to only search for the values and only use the description in case of need.
We make several search threads to the database for one search query, because searching all documents at once would take too much time.
Some time ago, on our former slow database engine, a collegue of mine made 'search tables', basically this is a table which contains all values on a TABLE A ID so there isn't need for any join in the SQL query when searching. It looks like this:
TABLE D
TABLE A ID
VALUES (all values from TABLE B for this TABLE A ID, seperated by a ' ')
DESCRIPTION (the description of the most recent version for this TABLE A ID)
Example record:
- 1
- ZER 123 CLIENT NAME NUMBER 7856 jsdfjklf 4556423
- DESCRIPTION CAN BE VERY LONG.
If a customer searches for 'ZER 123' this becomes:
"select TABLE_A_ID from TABLE_D where values like '%ZER%' and values like '%123%'"
Important:
Some of our customers have alot of records in TABLE A. +5.000.000, which means there are alot of records in TABLE B (+/- 50.000.000). Most of our customers have between 300.000 and 500.000 records in TABLE A.
My questions:
Is there a better / faster way to search through all the values then that search table? Without the search table i would have to make a join for every ' ' in the search argument of the customer, which will work too slow (i think?) if they have alot of records in TABLE A. For example:
select ID from TABLE_A
INNER JOIN TABLE_B Sub1 ON TABLE_A.ID = Sub1.TABLE_A_ID and Sub1.VALUE like '%ZER%'
INNER JOIN TABLE_B Sub2 on FILE_A.ID = Sub2.TABLE_A_ID and Sub2.VALUE like '%123%'
I have taken a look at the full text search in PGSQL. I don't think i can use it since you can't use it as like (= 'contains') ?
Is there any index I can use on the values (FILE B or search file) and description (FILE C or search file) to make the searches faster? I've read on it and i don't think there is any, because indexes aren't used when searching with "like '%ZER%'" ?
I hope i've explained this cleary.
Thanks in advance!

Your terminology is confusing, but I assume you mean "tables" when you write "files".
You cannot reasonably search in several tables with a single query, but you can search in several columns of a single table at the same time.
Based on your description, I would say that you need a trigram index on the concatenation of the relevant string columns in the table.

Related

How to get the differences between two - kind of - duplicated tables (sql)

Prolog:
I have two tables in two different databases, one is an updated version of the other. For example we could imagine that one year ago I duplicated table 1 in the new db (say, table 2), and from then I started working on table 2 never updating table 1.
I would like to compare the two tables, to get the differences that have grown in this period of time (the tables has preserved the structure, so that comparison has meaning)
My way of proceeding was to create a third table, in which I would like to copy both table 1 and table 2, and then count the number of repetitions of every entry.
In my opinion, this, added to a new attribute that specifies for every entry the table where he cames from would do the job.
Problem:
Copying the two tables into the third table I get the (obvious) error to have two duplicate key values in a unique or primary key costraint.
How could I bypass the error or how could do the same job better? Any idea is appreciated

Something like this should do what you want if A and B have the same structure, otherwise just select and rename the columns you want to confront....
SELECT
*
FROM
B
WHERE NOT EXISTS (SELECT * FROM A)
if NOT EXISTS doesn't work in your DBMS you could also use a left outer join comparing the rows columns values.
SELECT
A.*
from
A left outer join B
on A.col = B.col and ....

MS Access - Counting Occurrences of a word in multiple columns

I have a database with a couple tables that tracks personnel errors that require rework by another person. Basically, a person on the job could rework up to 10 different work packages by other people throughout their shift. To make it easy, I just have columns in the table for rework_1/original_worker_1/rework_comment_1 (repeated up to 10) and the person who had to rework it. All of my worker's names are in a separate table so I can add people and my forms update dynamically with their names. What I want to do is this:
Pull a person from my worker's name table.
Search for all occurrences of their name in another table in in column original_worker_X (where X is 1 - 10).
Output the values: Workers Name / How Many Times I found it in the original_worker_X columns.
From here I would need to make a bar graph so that each person's name had a bar with how many times someone had to rework something they did originally.
If I could do this with PHP and MySQL I would be in the money because I could brute force something with some PHP variables, queries, and loops but I am an access novice at best! I appreciate any help you wizards can provide.
Table 1:
Table 2:
Expected Output Numbers:

so i will suggest you do the following
Create a new table,lets say table 3 with three fields
A. ID, pkey, auto number
B. original_worker, text field
C. Person_doing_rework, text field
You will need ten insert statements that will insert each of the original worker 1-10, as well as person doing re-work , this is to a normalise table
Currently, the design of your table is a bit crude, and having a select statement with group by columns numbering 10 is not achievable
Below are samples of the insert statements
INSERT INTO Table3 (original_worker,Person_doing_rework)
SELECT original_worker1,Person_doing_rework
FROM table2 where isnotNull(original_worker1)
INSERT INTO Table3 (original_worker,Person_doing_rework)
SELECT original_worker2,Person_doing_rework
FROM table2 where isnotNull(original_worker2)
replicate this for original_worker3 to original_worker10
Third step
You need a delete statement that will delete all from table 3, this is to ensure that the records from table 3 is not duplicated, since we don't have a pkey/fkey relationship between table 2 and 3
Fourth step
Place all the queries into a macro in the following order
A. Delete query to run first
B. Insert queries to run next
Fifth step
Add a msgbox in the macro, that will run last, this is to inform you that all the other macro steps, i.e A and B above has successfully run.
Sixth step
You can now have a select statement from table 3 that can count the number of times an original workers' work is re worked upon, because you now have two main fields in table 3, one for original_work, and two for Person_reworked.
So any time you want to find out how many times some ones work has been re worked upon, you have to just click the macro button, this will run all the queries and put values you need in the table 3, after which you can view the details via the query in step 6.
SELECT original_worker, Count(Person_doing_rework), FROM table3 GROUP BY original_worker;

Best way to compare two tables in SQL by matching string?

I have a program where the goal is to take data from an API, and capture the differences in data from minute to minute. It involves three tables: Table 1 (for new data), Table 2 (for previous minutes data), Results table (for the results).
The sequence of the program is like this:
Update table 1 -> Calculate the differences from table 2 and update a "Results" table with the differences -> Copy table 1 to table 2.
Then it repeats! It's simple and it works.
Here is my SQL query:
Insert into Results (symbol, bid, ask, description, Vol_Dif, Price_Dif, Time) Select * FROM(
Select symbol, bid, ask, description, Vol_Dif, Price_Dif, '$now' as Time FROM (
Select t1.symbol, t1.bid, t1.ask, t1.description, (t1.volume - t2.volume) AS Vol_Dif, (t1.totalPrice - t2.totalPrice) AS Price_Dif
FROM `Table_1` t1
Inner Join (
Select id, volume, ask, totalPrice FROM Table_2) t2
ON t2.id = t1.id) as test
The tables are identical in structure, obviously. The primary key is the 'id' field that auto-increments. And as you can see, I am comparing both tables on the basis of these 'id' fields being equal.
The PROBLEM is that the API seems to be inconsistent. One API call will have 50,000 entries. The next one will have 51,000 entries. And the entries are not just added to the end or added to the beginning, they are mixed into the middle.
So, comparing on equal ID's means I am comparing entries for DIFFERENT data, IF the API calls return a different number entries.
The data that I am trying to get the differences of is the 'bid', 'ask', 'Vol_Dif', 'Price_Dif' from minute to minute. There are many instances of the same 'symbol's, so I couldn't compare with this. The ONLY other way to compare entries from table to table, beside the matching ID's, would be matching the "description" fields.
I have tried this. The script is almost the same as above except the end of the query is
ON t2.description = t1.description
The problem is that looking for matching description fields takes 3 minutes for 50,000 entries, whereas looking for matching ID's takes 1 second.
Is there a better, faster way to do what I'm trying to do? Thanks in advance. Any help is appreciated.

Memo fields showing chinese characters or symbols instead of actual value when records number reach a certain point

This is my first post here and I am a database newbie. I tried to find an answer, but I'm not sure I understand how it applies to my case.
My access program generates a query with different types of fields (Autonumber, shorttext, memo) which is then used to create a report.
It has been working fine until now, but since the DB has grown I run into a problem.
I use the ID (primary key) IN () the where condition to filter the report. I make a long string that get all the id of the selected records:
WHERE ID IN (1200,1201,1203,1226,1227,1228,1229,...)
When a certain amount of characters in the query is reached (around 4000), I get Chinese characters instead of all the memo fields, and only the memo fields, of the query results (and then in the report).
Is there a limit in the query size? Isn't it 32000 characters?
Why do these characters shows only if I select too many records?
Is there a substite to IN () that could help me reduce the query lengh or should I completely avoid memo fields?
EDIT : That's the query (stripped down a little to be readable) :
SELECT ObjetsLegislatifs.IDobjet, ObjetsLegislatifs.TitreObjet, ObjetsLegislatifs.ContenuObjet
FROM ObjetsLegislatifs
WHERE IDobjet IN(1200,1201,1203,1226,1227,1228,1229,1230,1231,1232,)
GROUP BY ObjetsLegislatifs.IDobjet, ObjetsLegislatifs.TitreObjet, ObjetsLegislatifs.ContenuObjet
ORDER BY ObjetsLegislatifs.IDobjet;
Basicaly, th IDobjet is a autonumber, the "TitreObjet" and "ContenuObjet" are Memos fields.
While IDobjet always shows the proper number, the memo fields start showing chinese when the query is too long, when a certain threshold is reached. I tried with a text field instead in the query and they work fine.

The best thing for you to do is create a temporary table (which is a special type of table - see here) with a single ID column.
When generating the query, you can then insert your ID into this table and join to it instead of using an IN list, something like this:
select a.*
from table a
INNER JOIN temp_table b ON a.ID = b.ID

SQL Server 2008 Array Query

I have a table structure
ID [integer]
Name
RecoveryID [integer]
date
I want to search on the RecoveryID with an array and reveal all those in the array without a corresponding record.
so, if my table contains
1,'John',1,20-10-2013
2,'John',4,20-10-2013
3,'John',5,20-10-2013
And I search on the RecoveryID with the array [1,2,3,4,5,6] I would want the result [2,3,6]
I have tried using various IN, NOT IN statements, but I always get what I have, not what I don't have.
To try and explain further, I am trying to Outer Join without a second table. I have a list of users, a list of things that CAN be done (1,2,3,4,5,6,7) and a list of things that NEED to be done by a specific user. {[John],(1,2,7)} For example.
If John completes action 1, my work table now contains ('John',1,20-10-2013) actions 2 & 7 are remaining. I have the list (1,2,7) how can I query the work table so that it returns (2,7) ?

You can use Except set operation as :
SELECT n
FROM (VALUES(1),(2),(3),(4),(5),(6)) AS Nums(n)
EXCEPT
SELECT RecoveryID from table1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas