which tables are affected by the Catalog URL Rewrites index in Magento? - indexing

Pretty straightforward question, which tables are affected by the Catalog URL Rewrites index in Magento?
Each time I run this index it takes a long time to run and the admin status for the index gets stuck on PROCESSING.
I have tried to find lock tables with SHOW FULL PROCESSLIST and have TRUNCATED core_url_rewrite and now I am waiting for the rebuild to run again, while I listen for error in system.log.
It would be nice to know exactly which tables are used and if it is just core_url_rewrite and catalogsearch_fulltext, which I have also truncated....
Just found these files:
magento/var/locks/index_process_1.lock
magento/var/locks/index_process_2.lock
magento/var/locks/index_process_3.lock
magento/var/locks/index_process_4.lock
magento/var/locks/index_process_5.lock
magento/var/locks/index_process_6.lock
magento/var/locks/index_process_7.lock
magento/var/locks/index_process_8.lock
magento/var/locks/index_process_9.lock
magento/var/locks/index_process_10.lock
They seem to match times when I tried to run the index, but do they stop the index creation like a mysql lock file would do?

It's not about what the process does, it's about how it does it. It will load up products one by one and do processing. Try to run
php indexer.php --reindex catalog_url
in your magento/shell directory. With a max_execution time set to zero and enough memory, it will eventually finish.
As long as the lock file is there, no other reindex process can start. The question about tables is a little more complex, try to turn mysql general log and watch for updates. The time spent in MySQL is not a big concern, instantiating product objects is both slow and leaky. Make sure you have this patch.

Related

Redis write back cache still a manual task?

I am working on an assignment. The REST API (developed in Spring) has a method m() which simulates cleaning of windows by a person. Towards the end the cleaner has to write a unique phrase (a string) on the window. Phrases written by all cleaners are eventually saved in the MySQL DB. So each time m() is executed, a query is made to the DB to fetch all phrases written to the DB today so far. The cleaner method m() then generates a random string as a phrase, checks it in the queried phrases to make sure its unique and writes it to the DB. So there is one query per m() to fetch all phrases and one to write the phrase. Both happens on the same table.
This is a scenario that can take advantage of caching and I went to Redis. I also think write back cache is the best solution. So every write happens, it happens to the cache instead of the DB and every read happens from the cache as well. The cache can be copied to the DB in a new thread per hour (or something configurable). I was reading Can Redis write out to a database like PostgreSQL? and it seems some years back you had to do this manually.
My questions:
Is doing this manually still the way to go? If not, can someone
point me to a Redis resource I can make use of?
If manual is the way to go this is how I plan to implement it. Is it ideal?
Phrases written each hour will be appended to a list of objects (userid, phrase) in Redis, the list for midnight to 1 am will be called phrases_1, for 1 to 2 am as phrases_2 and so on. Each hour a background thread will write the entire hour's list to DB. Every time all phrases are required to be fetched for checking, I will load all lists for the day from the cache e.g. phrases_1, phrases_2 in a loop and consolidate them. (Later when number of users grow - I will have to shard but that is not my immediate concern).
Thanks.
Check https://github.com/RedisGears/rgsync (and https://redislabs.com/solutions/use-cases/caching/) which tries to address both the cases of write-back and write-through.
I'm yet to do a functionality test.
It is also interesting to note that a 2020 CMU paper (https://www.pdl.cmu.edu/PDL-FTP/Storage/2020.apocs.writeback.pdf) claims "writeback-aware caching is NPcomplete and Max-SNP hard"
Instead of going to redis for uniqueness of data,you should create a unique index on the field you want to be unique and MySQL will take care of the rest for you

MS SQL Server Query caching

One of my projects has a very large database on which I can't edit indexes etc., have to work as it is.
What I saw when testing some queries that I will be running on their database via a service that I am writing in .net. Is that they are quite slow when ran the first time?
What they used to do before is - they have 2 main (large) tables that are used mostly. They showed me that they open SQL Server Management Studio and run a
SELECT *
FROM table1
JOIN table2
a query that takes around 5 minutes to run the first time, but then takes about 30 seconds if you run it again without closing SQL Server Management Studio. What they do is they keep open SQL Server Management Studio 24/7 so that when one of their programs executes queries that are related to these 2 tables (which seems to be almost all queries ran by their program) in order to have the 30 seconds run time instead of the 5 minutes.
This happens because I assume the 2 tables get cached and then there are no (or close to none) disk reads.
Is this a good idea to have a service which then runs a query to cache these 2 tables every now and then? Or is there a better solution to this, given the fact that I can't edit indexes or split the tables, etc.?
Edit:
Sorry just I was possibly unclear, the DB hopefully has indexes already, just I am not allowed to edit them or anything.
Edit 2:
Query plan
This could be a candidate for an indexed view (if you can persuade your DBA to create it!), something like:
CREATE VIEW transhead_transdata
WITH SCHEMABINDING
AS
SELECT
<columns of interest>
FROM
transhead th
JOIN transdata td
ON th.GID = td.HeadGID;
GO
CREATE UNIQUE CLUSTERED INDEX transjoined_uci ON transhead_transdata (<something unique>);
This will "precompute" the JOIN (and keep it in sync as transhead and transdata change).
You can't create indexes? This is your biggest problem regarding performance. A better solution would be to create the proper indexes and address any performance by checking wait stats, resource contention, etc... I'd start with Brent Ozar's blog and open source tools, and move forward from there.
Keeping SSMS open doesn't prevent the plan cache from being cleared. I would start with a few links.
Understanding the query plan cache
Check your current plan cache
Understanding why the cache would clear (memory constraint, too many plans (can't hold them all), Index Rebuild operation, etc. Brent talks about this in this answer
How to clear it manually
Aside from that... that query is suspect. I wouldn't expect your application to use those results. That is, I wouldn't expect you to load every row and column from two tables into your application every time it was called. Understand that a different query on those same tables, like selecting less columns, adding a predicate, etc could and likely would cause SQL Server to generate a new query plan that was more optimized. The current query, without predicates and selecting every column... and no indexes as you stated, would simply do two table scans. Any increase in performance going forward wouldn't be because the plan was cached, but because the data was stored in memory and subsequent reads wouldn't experience physical reads. i.e. it is reading from memory versus disk.
There's a lot more that could be said, but I'll stop here.
You might also consider putting this query into a stored procedure which can then be scheduled to run at a regular interval through SQL Agent that will keep the required pages cached.
Thanks to both #scsimon #Branko Dimitrijevic for their answers I think they were really useful and the one that guided me in the right direction.
In the end it turns out that the 2 biggest issues were hardware resources (RAM, no SSD), and Auto Close feature that was set to True.
Other fixes that I have made (writing it here for anyone else that tries to improve):
A helper service tool will rearrange(defragment) indexes once every
week and will rebuild them once a month.
Create a view which has all the columns from the 2 tables in question - to eliminate JOIN cost.
Advised that a DBA can probably help with better tables/indexes
Advised to improve server hardware...
Will accept #Branko Dimitrijevic 's answer as I can't accept both

Select LIMIT 1 takes long time on postgresql

I'm running a simple query on localhost PostgreSQL database and it runs too long:
SELECT * FROM features LIMIT 1;
I expect such query to be finished in a fraction of a second as it basically says "peek anywhere in the database and pick one row". Or it doesn't?
table size is 75GB with estimated row count 1.84405e+008
I'm the only user of the database
the database server was just started, so I guess nothing is cached in memory
I totally agree with #larwa1n with the content he comment on your post.
The reason here, I guess, is the performance of SELECT is too slow.
With my experience maybe there are another reasons. I list as below:
The table is too big, so let add some WHERE CLAUSE and INDEX
The performance of your server/disk drive is too slow.
Other process take most resource.
Another reason maybe come from maintenance task, let check again does the autovacuum is running? If not, check is this table is vacuum already? If not, let do a vacuum full on that table. Sometimes, when you do a lot of insert/update/delete on a large table without vacuum will make the table save in fragmented disk block, which will take longer time in query.
Hopefully, this answer will help you find out the final reason.

How to load 15.000.000 registers into a table with pentaho?

I have created an ETL process with Pentaho that selects data from a table in a Database and load this into another database.
The main problem that I have to make front is that for 1.500.000 rows it takes 6 hours. The full table is 15.000.000 and I have to load 5 tables like that.
Can anyone explain how is supposed to load a large size of data with pentaho?
Thank you.
I never had problem with volume with Pentaho PDI. Check the following in order.
Can you check the problem is really coming from Pentaho: what happens if you drop the query in SQL-Developer or Toad or SQL-IDE-Fancy-JDBC-Compilant.
In principle, PDI is meant to import data with a SELECT * FROM ... WHERE ... and do all the rest in the transformation. I have a set of transformation here which take hours to execute because they do complex queries. The problem is not due to PDI but complexity of the query. The solutions is to export the GROUP BY and SELECT FROM (SELECT...) into PDI steps, which can start before the query result is finished. The result is like 4 hours to 56 seconds. No joke.
What is your memory size? It is defined in the spoon.bat / spoon.sh.
Near the end you have a line which looks like PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=256m". The important parameter is -Xmx.... If it is -Xmx256K, your jvm has only 256KB of RAM to work with.
Change it to 1/2 or 3/4 of the available memory, in order to leave room for the other processes.
Is the output step the bottleneck? Check by disabling it and watch you clock during the run.
If it is long , increase the commit size and allow batch inserts.
Disable all the index and constraints and restore them when loaded. You have nice SQL script executor steps to automate that, but check first manually then in a job, otherwise the reset index may trigger before to load begins.
You have also to check that you do not lock your self: as PDI launches the steps alltogether, you may have truncates which are waiting on another truncate to unlock. If you are not in an never ending block, it may take quite while before to db is able to cascade everything.
There's no fixed answer covering all possible performance issues. You'll need to identify the bottlenecks and solve them in your environment.
If you look at the Metrics tab while running the job in Spoon, you can often see at which step the rows/s rate drops. It will be the one with the full input buffer and empty output buffer.
To get some idea of the maximum performance of the job, you can test each component individually.
Connect the Table Input to a dummy step only and see how many rows/s it reaches.
Define a Generate Rows step with all the fields that go to your destination and some representative data and connect it to the Table Output step. Again, check the rows/s to see the destination database's throughput.
Start connecting more steps/transformations to your Table Input and see where performance goes down.
Once you know your bottlenecks, you'll need to figure out the solutions. Bulk load steps often help the output rate. If network lag is holding you back, you might want to dump data to compressed files first and copy those locally. If your Table input has joins or where clauses, make sure the source database has the correct indexes to use, or change your query.

Temp Table cannot load content without timing out

When I am trying to load a page it times out after some time and I have no idea how to edit the query and fix this problem.
The sql is as follows:
<createTempTable nml-type="String">
DECLARE GLOBAL TEMPORARY
TABLE temp_tdt
(subject_id_num bigint) ON COMMIT PRESERVE ROWS not
logged with replace
</createTempTable>
The dataset is not long at all but still this times out.
There has never been a problem with having no index in the past, it simply loaded the data no problem, now instead of loading the data it times out the first time and then goes back to loading fast. This SQL is written in websphere in a xlm file. I am looking for a more efficient way.
We're missing a lot of detail here. But either your temporary table is created for every user (every new connection) or every app restart, plus there is no indexes on this temp table.
Do timeouts occur on the first invocation or after a while? Best change to a proper table with indexed and if you get any timeouts, analyse the queries run against it.