SQL optimisation - Word count in string - Postgresql - sql

I am trying to update a large table (about 1M rows) with the count of words in a field on Postgresql.
This query works, and sets the token_count field counting the words (tokens) in longtext in table my_table:
UPDATE my_table mt SET token_count =
(select count(token) from
(select unnest(regexp_matches(t.longtext, E'\\w+','g')) as token
from my_table as t where mt.myid = t.myid)
as tokens);
myid is the primary key of the table.
\\w+ is necessary because I want to count words, ignoring special characters.
For example, A test . ; ) would return 5 with space-based count, while 2 is the right value.
The issue is that it's horribly slow, and 2 days are not enough to complete it on 1M rows.
What would you do to optimised it? Are there ways to avoid the join?
How can I split the batch into blocks, using for example limit and offset?
Thanks for any tips,
Mulone
UPDATE: I measured the performance of the array_split, and the update is gonna be slow anyway. So maybe a solution would consist of parallelising it. If I run different queries from psql, only one query works and the others wait for it to finish. How can I parallelise an update?

Have you tried using array_length?
UPDATE my_table mt
SET token_count = array_length(regexp_split_to_array(trim(longtext), E'\\W+','g'), 1)
http://www.postgresql.org/docs/current/static/functions-array.html
# select array_length(regexp_split_to_array(trim(' some long text '), E'\\W+'), 1);
array_length
--------------
3
(1 row)

UPDATE my_table
SET token_count = array_length(regexp_split_to_array(longtext, E'\\s+'), 1)
Or your original query without a correlation
UPDATE my_table
SET token_count = (
select count(*)
from (select unnest(regexp_matches(longtext, E'\\w+','g'))) s
);

Using tsvector and ts_stat
get statistics of a tsvector column
SELECT *
FROM ts_stat($$
SELECT to_tsvector(t.longtext)
FROM my_table AS t
$$);
No sample data to try it but it should work.
Sample Data
CREATE TEMP TABLE my_table
AS
SELECT $$A paragraph (from the Ancient Greek παράγραφος paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.$$::text AS longtext;
SELECT *
FROM ts_stat($$
SELECT to_tsvector(t.longtext)
FROM my_table AS t
$$);
word | ndoc | nentry
--------------+------+--------
παράγραφος | 1 | 1
written | 1 | 1
write | 1 | 2
unit | 1 | 1
sentenc | 1 | 1
self-contain | 1 | 1
self | 1 | 1
point | 1 | 1
particular | 1 | 1
paragrapho | 1 | 1
paragraph | 1 | 2
one | 1 | 1
idea | 1 | 1
greek | 1 | 1
discours | 1 | 1
deal | 1 | 1
contain | 1 | 1
consist | 1 | 1
besid | 1 | 2
ancient | 1 | 1
(20 rows)

Make sure myid is indexed, being the first field in the index.
Consider doing this outside DB in the first place. Hard to say without benchmarking, but the counting may be more costly than select+update; so may be worth it.
use COPY command (BCP equivalent for Postgres) to copy out the table data efficiently in bulk to a file
Run a simple Perl script to count. 1 million rows should take couple of minutes to 1 hour for Perl, depending on how slow your IO is.
use COPY to copy the table back into DB (possibly into a temp table, then update from that temp table; or better yet truncate the main table and COPY straight into it if you can afford the downtime).
For both your approach, AND the last step of my approach #2, update the token_count in batches of 5000 rows (e.g. set rowcount to 5000, and loop the updates adding where token_count IS NULL to the query

Related

Rebuild tables from joined table

I am facing an issue where a data supplier is generating a dump of his multi-tenant databases in a single table. Recreating the original tables is not impossible, the problem is I am receiving millions of rows every day. Recreating everything, every day, is out of question.
Until now, I was using SSIS to do so, with a lookup-intensive approach. In the past year, my virtual machine went from having 2 GB of ram to 128, and still growing.
Let me explain the disgrace:
Imagine a database where users have posts, and posts have comments. In my real scenario, I am talking about 7 distinct tables. Analyzing a few rows, I have the following:
+-----+------+------+--------+------+-----------+------+----------------+
| Id* | T_Id | U_Id | U_Name | P_Id | P_Content | C_Id | C_Content |
+-----+------+------+--------+------+-----------+------+----------------+
| 1 | 1 | 1 | john | 1 | hello | 1 | hello answer 1 |
| 2 | 1 | 2 | maria | 2 | cake | 2 | cake answer 1 |
| 3 | 2 | 1 | pablo | 1 | hello | 1 | hello answer 3 |
| 4 | 2 | 1 | pablo | 2 | hello | 2 | hello answer 2 |
| 5 | 1 | 1 | john | 3 | nosql | 3 | nosql answer 1 |
+-----+------+------+--------+------+-----------+------+----------------+
the Id is from my table
T_Id is the "tenant" Id, which identifies multiple databases
I have imagined the following possible solution:
I make a query that selects non-existent Ids for each table, such as:
SELECT DISTINCT n.t_id,
n.c_id,
n.c_content
FROM mytable n
WHERE n.id > 4
AND NOT EXISTS (SELECT 1
FROM mytable o
WHERE o.id <= 4
AND n.t_id = o.t_id
AND n.c_id = o.c_id)
This way, I am able to select only the new occurrences whenever a new Id of a table is found. Although it works, it may perform badly when working with 100s of millions of rows.
Could anyone share a suggestion? I am quite lost.
Thanks in advance.
EDIT > my question is vague
My final intent is to rebuild the tables from the dump, incrementally, avoiding lookups outside the database. Every now and then I am gonna run a script that will select new tenants, users, posts and comments and add them to their corresponding tables.
My previous solution worked as follows:
Cache the whole database
For each new row, search for the columns inside the cache
If it doesn't exist, then insert it
I know it sounds dumb, but it made sense as a new developer working with ETLs
First, if you have a full flat DB dump, I'll suggest you to work on your file before even importing it in your DB (low level file processing is pretty cheap and nearly instantaneous).
From Removing lines in one file that are present in another file using python you can remove all the already parsed line since your last run.
with open('new.csv','r') as source:
lines_src = source.readlines()
with open('old.csv','r') as f:
lines_f = f.readlines()
destination = open('diff_add.csv',"w")
for data in lines_src:
if data not in lines_f:
destination.write(data)
destination.close()
This take less than five second to work on a 900Mo => 1.2Go dump. With this you'll only work with line that really make change in one of your new table.
Now you can import this flat DB to a working table.
As you'll have to search the needle in each line, some index on the ids may by a good idea (go to composite index that use your Tenant_id first).
For the last part, I don't know exactly how your data look, can you have some update to do ?
The Operators - EXCEPT and INTERSECT can help you too with this kind of problem.

Access 2007 select first value of query results

I am running into a rather annoying thingy in Access (2007) and I am not sure if this is a feature or if I am asking for the impossible.
Although the actual database structure is more complex, my problem boils down to this:
I have a table with data about Units for specific years. This data comes from different sources and might overlap.
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
...
Now I would like the user to select certain sources, order them by priority and then extract one data value for each year.
For example, if the user selects source 1, 2 and 3 and orders them by (3, 1, 2), then I would like the following result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
I am able to order the initial table, based on a specific order. I do this with the following query
SELECT Unit, IYR, X1, Source
FROM TestTable
WHERE Source In (1,2,3)
ORDER BY Unit, IYR,
IIf(Source=3,1,IIf(Source=1,2,IIf(Source=2,3,4)))
This gives me the following intermediate result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2011 | 90 | 1 |
Next step is to only get the first value of each year. I was thinking to use the following query:
SELECT X.Unit, X.IYR, first(X.X1) as FirstX1
FROM (...) AS X
GROUP BY X.Unit, X.IYR
Where (…) is the above query.
Now Access goes bananas. Whatever order I give to the intermediate results, the result of this query is.
Unit | IYR | X1 |
--------------------
A | 2009 | 55 |
A | 2010 | 80 |
A | 2011 | 90 |
In other words, for year 2010 it shows the value of source 1 instead of 3. It seems that Access does not care about the ordering of the nested query when it applies the FIRST() function and sticks to the original ordering of the data.
Is this a feature of Access or is there a different way of achieving the desired results?
Ps: Next step would be to use a self join to add the source column to the results again, but I first need to resolve above problem.
Rather than use first it may be better to determine the MIN Priority and then join back e.g.
SELECT
t.UNIT,
t.IYR,
t.X1,
t.Source ,
t.PrioritySource
FROM
(SELECT
Unit,
IYR,
X1,
Source,
SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
) as t
INNER JOIN
(SELECT
Unit,
IYR,
MIN(SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3)) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
GROUP BY
Unit,
IYR ) as MinPriortiy
ON t.Unit = MinPriortiy.Unit and
t.IYR = MinPriortiy.IYR and
t.PrioritySource = MinPriortiy.PrioritySource
which will produce this result (Note I include Source and priority source for demonstration purposes only)
UNIT | IYR | X1 | Source | PrioritySource
----------------------------------------------
A | 2009 | 55 | 1 | 2
A | 2010 | 150 | 3 | 1
A | 2011 | 90 | 1 | 2
Note the first subquery is to handle the fact that Access won't let you join on a Switch
Yes, FIRST() does use an arbitrary ordering. From the Access Help:
These functions return the value of a specified field in the first or
last record, respectively, of the result set returned by a query. If
the query does not include an ORDER BY clause, the values returned by
these functions will be arbitrary because records are usually returned
in no particular order.
I don't know whether FROM (...) AS X means you are using an ORDER BY inline (assuming that is actually possible) or if you are using a VIEW ('stored Query object') here but either way I assume the ORDER BY is being disregarded (because an ORDER BY should only apply to the final result).
The alternative is to use MIN() (or possibly MAX()).
This is the most concise way I have found to write such queries in Access that require pulling back all columns that correspond to the first row in a group of records that are ordered in a particular way.
First, I added a UniqueID to your table. In this case, it's just an AutoNumber field. You may already have a unique value in your table, in which case you can use that.
This will choose the row with a Source 3 first, then Source 1, then Source 2. If there is a tie, it picks the one with the higher X1 value. If there is a further tie, it is broken by the UniqueID value:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=
(SELECT TOP 1 [UniqueID] FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, UniqueID)
This yields:
Unit IYR X1 Source UniqueID
A 2009 55 1 1
A 2010 150 3 4
A 2011 90 1 5
I recommend (1) you create an index on the IYR field -- this will dramatically increase your performance for this type of query, and (2) if you have a lot (>~100K) records, this isn't the best choice. I find it works quite well for tables in the 1-70K range. For larger datasets, I like to use my GroupIncrement function to partition each group (similar to SQL Server's ROW_NUMBER() OVER statement).
The Choose() function is a VBA function and may not be clear here. In your case, it sounds like there is some interactivity required. For that, you could create a second table called "Choices", like so:
Rank Choice
1 3
2 1
3 2
Then, you could substitute the following:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=(SELECT TOP 1 [UniqueID] FROM
[TestTable] t2 INNER JOIN [Choices] c
ON t2.Source=c.Choice
WHERE t.IYR=t2.IYR ORDER BY c.[Rank], t2.X1 DESC, t2.UniqueID);
Indexing Source on TestTable and Choice on the Choices table may be helpful here, too, depending on the number of choices required.
Q:
Can you get this to work without the need for surrogate key? For
example what if the unique key is the composite of
{Unit,IYR,X1,Source}
A:
If you have a compound key, you can do it like this-- however I think that if you have a large dataset, it will totally kill the performance of the query. It may help to index all four columns, but I can't say for sure because I don't regularly use this method.
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.Unit & t.IYR & t.X1 & t.Source =
(SELECT TOP 1 Unit & IYR & X1 & Source FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, Unit, IYR)
In certain cases, you may have to coalesce some of the individual parts of the key as follows (though Access generally will coalesce values automatically):
t.Unit & CStr(t.IYR) & CStr(t.X1) & CStr(t.Source)
You could also use a query in your FROM statements instead of the actual table. The query itself would build a composite of the four fields used in the key, and then you'd use the new key name in the WHERE clause of the top SELECT statement, and in the SELECT TOP 1 [key] of the subquery.
In general, though, I will either: (a) create a new table with an AutoNumber field, (b) add an AutoNumber field, (c) add an integer and populate it with a unique number using VBA - this is useful when you get a MaxLocks error when trying to add an AutoNumber, or (d) use an already indexed unique key.

Finding the difference between two sets of data from the same table

My data looks like:
run | line | checksum | group
-----------------------------
1 | 3 | 123 | 1
1 | 7 | 123 | 1
1 | 4 | 123 | 2
1 | 5 | 124 | 2
2 | 3 | 123 | 1
2 | 7 | 123 | 1
2 | 4 | 124 | 2
2 | 4 | 124 | 2
and I need a query that returns me the new entries in run 2
run | line | checksum | group
-----------------------------
2 | 4 | 124 | 2
2 | 4 | 124 | 2
I tried several things, but I never got to a satisfying answer.
In this case I'm using H2, but of course I'm interested in a general explanation that would help me to wrap my head around the concept.
EDIT:
OK, it's my first post here so please forgive if I didn't state the question precisely enough.
Basically given two run values (r1, r2, with r2 > r1) I want to determine which rows having row = r2 have a different line, checksum or group from any row where row = r1.
select * from yourtable
where run = 2 and checksum = (select max(checksum)
from yourtable)
Assuming your last run will have the higher run value than others, below SQL will help
select * from table1 t1
where t1.run in
(select max(t2.run) table1 t2)
Update:
Above SQL may not give you the right rows because your requirement is not so clear. But the overall idea is to fetch the rows based on the latest run parameters.
SELECT line, checksum, group
FROM TableX
WHERE run = 2
EXCEPT
SELECT line, checksum, group
FROM TableX
WHERE run = 1
or (with slightly different result):
SELECT *
FROM TableX x
WHERE run = 2
AND NOT EXISTS
( SELECT *
FROM TableX x2
WHERE run = 1
AND x2.line = x.line
AND x2.checksum = x.checksum
AND x2.group = x.group
)
A slightly different approach:
select min(run) run, line, checksum, group
from mytable
where run in (1,2)
group by line, checksum, group
having count(*)=1 and min(run)=2
Incidentally, I assume that the "group" column in your table isn't actually called group - this is a reserved word in SQL and would need to be enclosed in double quotes (or backticks or square brackets, depending on which RDBMS you are using).

Select multiple rows by max

I have a simple table with a "versioning" scheme:
Version | PartKey1 | PartKey2 | Value
1 | 0 | 0 | foo
2 | 0 | 0 | bar
1 | 1 | 0 | foobar
This table is medium (~100 000 lines for a full version). At the start it is loaded with a version 1 which contains a full snapshot, and over time incremental updates are added, but we want to preserve the old versions, thus they are added with an incremented "Version" number (2 here).
When reading the data, I want to be able to specify a maximum version, and I would like, if possible, to only retrieve the "rows" I am interested in.
For example: specifying 2 as the maximum version, I would like a query that retrieve only 2 rows in the table above:
Version | PartKey1 | PartKey2 | Value
2 | 0 | 0 | bar
1 | 1 | 0 | foobar
The row:
1 | 0 | 0 | foo
is discarded because the version 2 of this row is more recent.
I was wondering if such a selection was possible / advisable in a SQL query. I can do the filtering on the application side, but obviously it means pulling in useless resources from the DB so if it's possible (and cheap on the DB side) I'd rather offload this work to the DB.
You can do:
SELECT v1.*
FROM versioningscheme v1
LEFT JOIN versioningscheme v2
ON v2.partkey1 = v1.partkey1 AND v2.partkey2 = v1.partkey2
AND v2.version > v1.version
WHERE v2.version IS NULL
Left Join with NULL detection is very powerful and underused. Null values are returned when there is no match (and obviously, when you have the max row in v1, you can't get a row in v2 that satisfies the join condition).
select t.*
from MyTable t
inner join (
select PartKey1, PartKey2, max(Version) as MaxVersion
from MyTable
where Version <= 2
group by PartKey1, PartKey2
) tm on t.PartKey1 = tm.PartKey1
and t.PartKey2 = tm.PartKey2
and t.Version = tm.MaxVersion
This is common with time varying data (Where you choose to find the most recent value within a specific window of time), and is completely reasonable.
In your case, ROW_NUMBER() allows the data to be parsed just once, rather than multiple times. With an appropriate INDEX such as (PartKey1, PartKey2, Version), this should be exceptionally quick...
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY PartKey1, PartKey2 ORDER BY Version DESC) AS reversed_version
FROM
MyTable
WHERE
Version <= <MaxVersionParamter>
)
AS data
WHERE
reversed_version = 1

How to optimize an update SQL that runs on a Oracle table with 700M rows

UPDATE [TABLE] SET [FIELD]=0 WHERE [FIELD] IS NULL
[TABLE] is an Oracle database table with more than 700 million rows. I cancelled the SQL execution after it had been running for 6 hours.
Is there any SQL hint that could improve performance? Or any other solution to speed that up?
EDIT: This query will be run once and then never again.
First of all is it a one-time query or is it a recurrent query ? If you only have to do it once you may want to look into running the query in parallel mode. You will have to scan all rows anyway, you could either divide the workload yourself with ranges of ROWID (do-it-yourself parallelism) or use Oracle built-in features.
Assuming you want to run it frequently and want to optimize this query, the number of rows with the field column as NULL will eventually be small compared to the total number of rows. In that case an index could speed things up. Oracle doesn't index rows that have all indexed columns as NULL so an index on field won't get used by your query (since you want to find all rows where field is NULL).
Either:
create an index on (FIELD, 0), the 0 will act as a non-NULL pseudocolumn and all rows will be indexed on the table.
create a function-based index on (CASE WHEN field IS NULL THEN 1 END), this will only index the rows that are NULLs (the index would therefore be very compact). In that case you would have to rewrite your query:
UPDATE [TABLE] SET [FIELD]=0 WHERE (CASE WHEN field IS NULL THEN 1 END)=1
Edit:
Since this is a one-time scenario, you may want to use the PARALLEL hint:
SQL> EXPLAIN PLAN FOR
2 UPDATE /*+ PARALLEL(test_table 4)*/ test_table
3 SET field=0
4 WHERE field IS NULL;
Explained
SQL> select * from table( dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 4026746538
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
--------------------------------------------------------------------------------
| 0 | UPDATE STATEMENT | | 22793 | 289K| 12 (9)| 00:00:
| 1 | UPDATE | TEST_TABLE | | | |
| 2 | PX COORDINATOR | | | | |
| 3 | PX SEND QC (RANDOM)| :TQ10000 | 22793 | 289K| 12 (9)| 00:00:
| 4 | PX BLOCK ITERATOR | | 22793 | 289K| 12 (9)| 00:00:
|* 5 | TABLE ACCESS FULL| TEST_TABLE | 22793 | 289K| 12 (9)| 00:00:
--------------------------------------------------------------------------------
Are other users are updating the same rows in the table at the same time ?
If so, you could be hitting lots of concurrency issues (waiting for locks) and it may be worth breaking it into smaller transactions.
DECLARE
v_cnt number := 1;
BEGIN
WHILE v_cnt > 0 LOOP
UPDATE [TABLE] SET [FIELD]=0 WHERE [FIELD] IS NULL AND ROWNUM < 50000;
v_cnt := SQL%ROWCOUNT;
COMMIT;
END LOOP;
END;
/
The smaller the ROWNUM limit the less concurrency/locking issues you'll hit, but the more time you'll spend in table scanning.
Vincent already answered your question perfectly, but I'm curious about the "why" behind this action. Why are you updating all NULL's to 0?
Regards,
Rob.
Some suggestions:
Drop any indexes that contain FIELD before running your UPDATE statement, and then re-add them later.
Write a PL/SQL procedure to do this that commits after every 1000 or 10000 rows.
Hope this helps.
You could acheive the same result without updating by using an ALTER table to set the columns "DEFAULT" value to 0.