I have big merge which involve many inexact rename, but it fails with the below:
Performing inexact rename detection: 100% (169817200/169817200), done.
Performing inexact rename detection: 100% (2106881938/2106881938), done.
Performing inexact rename detection: 100% (120035958/120035958), done.
Segmentation fault
I tried to restart my VDI but it didn't help. Any idea how to solve it?
From the discussion, this happens only during a complex merge involving renamed folder and many files.
That is a job for the new merge strategy ORT ("Ostensibly Recursive's Twin").
That merge strategy will become the default in 2.34, but in the meantime, with git
git merge -s ort
The primary difference noticable here is that the updating of the working tree and index is not done simultaneously with the merge algorithm, but is a separate post-processing step.
The new API is designed so that one can do repeated merges (e.g. during a rebase or cherry-pick) and only update the index and working tree one time at the end instead of updating it with every intermediate result.
Also, one can perform a merge between two branches, neither of which match the index or the working tree, without clobbering the index or working tree.
The "ort" backend does the complete merge inmemory, and only updates the index and working copy as a post-processing step.
It does handle file conflicts and file/Folder renames much more efficiently than before (with the default "recursive" strategy).
I have a child table with 100 million rows and need to update 50 million rows of a column using the value from the parent table. I have read around that assuming if we have enough space, it would be the fastest to "create table as select", but I want to know if anyone disagrees or if other factors are required in order to make a better guess? Would it be better to go this route versus using pl/sql's BULK COLLECT FORALL UPDATE feature?
If you have a lot of data then CREATE TABLE AS SELECT is definitely faster because it does not require UNDO table space. However, to recreate all the indices on the new table can be quite a hassle due to name conflicts.
Good news is: 50 min rows is not really a lot of data. If you have a modern midrange server it should not cause problems so it is not worth the extra work. The best way to find out is to make a copy of the original table (including all indices) and try the update there. Then you get a rough idea how long it will take.
Parallel UPDATE is probably the best option for a large change to a child table. (If you have Enterprise Edition, sufficient resources, a sane configuration, etc.)
alter session enable parallel dml;
update /*+ parallel */ ...;
(You might want to play with different parallel numbers, like parallel(8). The default degree of parallelism is usually good enough. But some platforms like SPARC inflate their "CPU_COUNT", leading to ridiculous degrees of parallelism.)
Parallel UPDATE is likely not the optimal solution. Recreating the objects can be faster because it can almost completely avoid generating REDO and UNDO. But re-creating objects is usually buggy and getting that optimal performance is tricky.
Here are things to consider before you decide to simply drop and recreate a table:
Grants. Save and re-apply the object grants after the objects are recreated.
Dependent objects. The process needs to re-create all objects, and dependent objects, in the exact same way. This can be painfully difficult depending on how complex your schema is. DBMS_METADATA can be tricky, and in some cases still won't make the objects exactly the same way. If you decide to hard-code the DDL instead you have to remember to update the process whenever the objects change.
Invalid objects. Most objects will automatically recompile when necessary. But you probably don't want to wait for that because it always looks bad to have invalid objects. And even if they do compile correctly, some programs may still get those pesky ORA-04068: existing state of packages has been discarded errors. (Because most PL/SQL programmers are unaware of session state and make every package variable public by default.)
Statistics. Simply re-gathering them after the table is re-created is not always sufficient. Histograms depend on whether columns were used in a predicate. If the table is re-created all the columns are new and no histograms will be initially created.
Direct-path writes are elusive. A parent-child table implies a foreign key, which normally prevents direct-path writes. The process needs to disable or drop the foreign key. And also set the table and index to NOLOGGING, and then remember to set them back to LOGGING at the end. When you re-create the foreign key, if you want to do it in parallel you have to initially create it as NOVALIDATE, set the table to parallel, enable validate the constraint, and then set the table back to NOPARALLEL.
In a large data warehouse it's worth going through all those steps and building code for dealing with all the issues. If this is your only large table UPDATE I suggest you avoid that work and accept a slightly non-optimal solution.
Context. I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an application that iterates through each of those query files and passes it to a standalone Spark application. This latter first parses the query, extracts the used tables, registers them (using: registerTempTable() in Spark < 2 and createOrReplaceTempView() in Spark 2), and executes effectively the query (spark.sql()).
Challenge. Since registering the tables can be time consuming, I would like to lazily register the tables, i.e. only once when they are first used, and keep that in form of metadata that can readily be used in the subsequent queries without the need to re-register the tables with each query. It's a sort of intra-job caching but not any of the caching options Spark offers (table caching), as far as I know.
Is that possible? if not can anyone suggest another approach to accomplish the same goal (iterating through separate query files and run a querying Spark application without registering the tables that have already been registered before).
In general, registering a table should not take time (except if you have lots of files it might take time to generate the list of file sources). It is basically just giving the dataframe a name. What would take time is reading the dataframe from disk.
So the basic question is, how is the dataframe (tables) written to disk. If it is written as a large number of small files or a file format which is slow (e.g. csv), this can take some time (having lots of files take time to generate the file list and having a "slow" file format means the actual reading is slow).
So the first thing you can try to do is read your data and resave it.
lets say for the sake of example that you have a large number of csv files in some path. You can do something like:
df = spark.read.csv("path/*.csv")
now that you have a dataframe you can change it to have less files and use a better format such as:
If the above is not enough, and your cluster is large enough to cache everything, you might put everything in a single job, go over all tables in all queries, register all of them and cache them. Then run your sql queries one after the other (and time each one separately).
If all of this fails you can try to use something like alluxio (http://www.alluxio.org/) to create an in memory file system and try to read from that.
This question already has answers here:
How can I get a hash of an entire table in postgresql?
(7 answers)
Closed 9 years ago.
Suppose you have a reasonably large (for local definitions of “large”), but relatively stable table.
Right now, I want to take a checksum of some kind (any kind) of the contents of the entire table.
The naïve approach might be to walk the entire table, taking the checksum (say, MD5) of the concatenation of every column on each row, and then perhaps concatenate them and take its MD5sum.
From the client side, that might be optimized a little by progressively appending columns' values into the MD5 sum routine, progressively mutating the value.
The reason for this, is that at some point in future, we want to re-connect to the database, and ensure that no other users may have mutated the table: that includes INSERT, UPDATE, and DELETE.
Is there a nicer way to determine if any change/s have occurred to a particular table? Or a more efficient/faster way?
We are not able/permitted to make any alterations to the table itself (e.g. adding a “last-updated-at” column or triggers or so forth)
(This is for Postgres, if it helps. I'd prefer to avoid poking transaction journals or anything like that, but if there's a way to do so, I'm not against the idea.)
Adding columns and triggers is really quite safe
While I realise you've said it's a large table in a production DB so you say you can't modify it, I want to explain how you can make a very low impact change.
In PostgreSQL, an ALTER TABLE ... ADD COLUMN of a nullable column takes only moments and doesn't require a table re-write. It does require an exclusive lock, but the main consequence of that is that it can take a long time before the ALTER TABLE can actually proceed, it won't hold anything else up while it waits for a chance to get the lock.
The same is true of creating a trigger on the table.
This means that it's quite safe to add a modified_at or created_at column and an associated trigger function to maintain them to a live table that's in intensive real-world use. Rows added before the column was created will be null, which makes perfect sense since you don't know when they were added/modified. Your trigger will set the modified_at field whenever a row changes, so they'll get progressively filled in.
For your purposes it's probably more useful to have a trigger-maintained side-table that tracks the timestamp of the last change (insert/update/delete) anywhere in the table. That'll save you from storing a whole bunch of timestamps on disk and will let you discover when deletes have happened. A single-row side-table with a row you update on each change using a FOR EACH STATEMENT trigger will be quite low-cost. It's not a good idea for most tables because of contention - it essentially serializes all transactions that attempt to write to the table on the row update lock. In your case that might well be fine, since the table is large and rarely updated.
A third alternative is to have the side table accumulate a running log of the timestamps of insert/update/delete statements or even the individual rows. This allows your client read the change-log table instead of the main table and make small changes to its cached data rather than invalidating and re-reading the whole cache. The downside is that you have to have a way to periodically purge old and unwanted change log records.
So... there's really no operational reason why you can't change the table. There may well be business policy reasons that prevent you from doing so even though you know it's quite safe, though.
... but if you really, really, really can't:
Another option is to use the existing "md5agg" extension: http://llg.cubic.org/pg-mdagg/ . Or to apply the patch currently circulating pgsql-hackers to add an "md5_agg" to the next release to your PostgreSQL install if you built from source.
Logical replication
The bi-directional replication for PostgreSQL project has produced functionality that allows you to listen for and replay logical changes (row inserts/updates/deletes) without requiring triggers on tables. The pg_receivellog tool would likely suit your purposes well when wrapped with a little scripting.
The downside is that you'd have to run a patched PostgreSQL 9.3, so I'm guessing if you can't change a table, running a bunch of experimental code that's likely to change incompatibly in future isn't going to be high on your priority list ;-) . It's included in the stock release of 9.4 though, see "changeset extraction".
Testing the relfilenode timestamp won't work
You might think you could look at the modified timestamp(s) of the file(s) that back the table on disk. This won't be very useful:
The table is split into extents, individual files that by default are 1GB each. So you'd have to find the most recent timestamp across them all.
Autovacuum activity will cause these timestamps to change, possibly quite a while after corresponding writes happened.
Autovacuum must periodically do an automatic 'freeze' of table contents to prevent transaction ID wrap-around. This involves progressively rewriting the table and will naturally change the timestamp. This happens even if nothing's been added for potentially quite a long time.
Hint-bit setting results in small writes during SELECT. These writes will also affect the file timestamps.
Examine the transaction logs
In theory you could attempt to decode the transaction logs with pg_xlogreader and find records that affect the table of interest. You'd have to try to exclude activity caused by vacuum, full page writes after hint bit setting, and of course the huge amount of activity from every other table in the entire database cluster.
The performance impact of this is likely to be huge, since every change to every database on the entire system must be examined.
All in all, adding a trigger on a table is trivial in comparison.
What about creating a trigger on insert/update/delete events on the table? The trigger could call a function that inserts a timestamp into another table which would mark the time for any table-changing event.
The only concern would be an update event updated using the same data currently in the table. The trigger would fire, even though the table didn't really change. If you're concerned about this case, you could make the trigger call a function that generates a checksum against just the updated rows and compares against a previously generated checksum, which would usually be more efficient than scanning and checksumming the whole table.
Postgres documentation on triggers here: http://www.postgresql.org/docs/9.1/static/sql-createtrigger.html
If you simply just want to know when a table has last changed without doing anything to it, you can look at the actual file(s) timestamp(s) on your database server.
SELECT relfilenode FROM pg_class WHERE relname = 'your_table_name';
If you need more detail on exactly where it's located, you can use:
select t.relname,
from pg_class t
join pg_namespace ns on ns.oid = t.relnamespace
where relname = 'your_table_name';
Since you did mention that it's quite a big table, it will definitely be broken into segments, and toasts, but you can utilize the relfilenode as your base point, and do a ls -ltr relfilenode.* or relfilnode_* where relfilenode is the actual relfilenode from above.
These files gets updated at every checkpoint if something occured on that table, so depending on how often your checkpoints occur, that's when you'll see the timestamps update, which if you haven't changed the default checkpoint interval, it's within a few minutes.
Another trivial, but imperfect way to check if INSERTS or DELETES have occurred is to check the table size:
SELECT pg_total_relation_size('your_table_name');
I'm not entirely sure why a trigger is out of the question though, since you don't have to make it retroactive. If your goal is to ensure nothing changes in it, a trivial trigger that just catches an insert, update, or delete event could be routed to another table just to timestamp an attempt but not cause any activity on the actual table. It seems like you're not ensuring anything changes though just by knowing that something changed.
Anyway, hope this helps you in this whacky problem you have...
A common practice would be to add a modified column. If it were MySQL, I'd use timestamp as datatype for the field (updates to current date on each updade). Postgre must have something similar.
I have quite a complex scenario where the same package can be run in parallel. In some situations both execution can end up trying to insert the same row into the destination, which causes a violation of primary key error.
There is currently a lookup that checks the destination table to see i the record exists so the insert is done on the its "no match" output. It doesnt prevent the error because the lookup is loaded on the package start thus both packages get the same data on it and if a row comes in both of them will consider it a "new" row so the first one succeeds and the second, fails.
Anything that can be done to avoid this scenario? Pretty much ignore the "duplicate rows" on the oledb destination? I cant use the MAX ERROR COUNT because the duplicate row is in a bath among other rows that were not on the first package and should be inserted.
The default lookup behaviour is to employ Full Cache mode. As you have observed, during the package validation stage, it will pull all the lookup values into an local memory cache and use that which results in it missing updates to the table.
For your scenario, I would try changing the cache mode to None (partial is the other option). None indicates that an actual query should be fired off to the target database for every row that passes through. Depending on your data volume or a poorly performing query, that can have a not-insignificant impact on the destination. It still won't guarantee that the parallel instance isn't trying the to load the exact same record (or that the parallel run has already satisfied their lookup and is ready to write to the target table) but it should improve the situation.
If you cannot control the package executions such that the concurrent dataflows are firing, then you should look at re-architecting the approach (write to partitions and swap in, use something to lock resources, stage all the data and use a TSQL merge etc)
Just a thought ... How about writing the new records to a temp table and merging it intermittently? This will give an opportunity to filter out duplicates.
Pretty straightforward question, which tables are affected by the Catalog URL Rewrites index in Magento?
Each time I run this index it takes a long time to run and the admin status for the index gets stuck on PROCESSING.
I have tried to find lock tables with SHOW FULL PROCESSLIST and have TRUNCATED core_url_rewrite and now I am waiting for the rebuild to run again, while I listen for error in system.log.
It would be nice to know exactly which tables are used and if it is just core_url_rewrite and catalogsearch_fulltext, which I have also truncated....
Just found these files:
They seem to match times when I tried to run the index, but do they stop the index creation like a mysql lock file would do?
It's not about what the process does, it's about how it does it. It will load up products one by one and do processing. Try to run
php indexer.php --reindex catalog_url
in your magento/shell directory. With a max_execution time set to zero and enough memory, it will eventually finish.
As long as the lock file is there, no other reindex process can start. The question about tables is a little more complex, try to turn mysql general log and watch for updates. The time spent in MySQL is not a big concern, instantiating product objects is both slow and leaky. Make sure you have this patch.