Reindex Magento 2 with morethan 500000 products - e-commerce

I have a website having more than 500000 products, but as the product count is much higher than usual I cannot do the reindexing successfully.
Getting the timeout error on SSH. Reindex mode is set to Update By Schedule.
How do I run the reindex in this case?
Looking for solutions :)
Thanks

Indexers are scoped and multi-threaded to support reindexing in parallel mode. It parallelizes by the indexer’s dimension and executes across multiple threads, reducing processing time.
check this
https://devdocs.magento.com/guides/v2.4/config-guide/cli/config-cli-subcommands-index.html#:~:text=Index%20parallelization%20affects%20scoped%20indexers,be%20paralleled%20by%20store%20views.
maybe it can help :)

Related

How to troubleshoot suspended queries in Azure Synapse?

Currently, I encounter an issue of suspended queries in Azure Synapse when executing from ADF (Store procedures call).
Also, I followed the suggestion in the link below for troubleshooting the issue:
Delete due to sensitive informations
The troubleshoot queries returned as below:
I checked if the transaction lock is the issue as I killed a few suspending or running queries which they ran for more than 15 hours. I also checked for the rest of the queries running but there is nothing would cause the transaction lock. I tried to run the store procedure manually from Azure Data Studio which is blocked as mentioned above and It took 40 seconds to complete.
While the suspending query from ADF, it took nearly an hour to finish.
Any suggestion to troubleshoot this issue is much appreciated.
Thanks
There a number of factors you must always consider when tuning queries in Azure Synapse Analytics, dedicated SQL pools:
DWU - what DWU is your pool at? Lower DWUs mean lower concurrent users and lower performance and should not be used for any kind of performance tuning. Crank it up temporarily to rule this out as a problem, bearing in mind changing this disconnects any active queries. Also bear in mind, not all queries respond to higher DWU.
Resource class - what resource class is associated with the user executing these queries? Remember the default is smallrc, and the admin user always has smallrc. Understand static and dynamic resource classes. DMV sys.dm_pdw_exec_requests will give you useful information on this. Trial with your workload to find the sweetspot between performance and concurrency v resource class. Encourage your dev team to use labels in their queries: OPTION ( LABEL = 'some informative label' )
Table geometry - this is the distribution (ROUND_ROBIN|HASH|REPLICATE) of your table and the indexing choice (CLUSTERED COLUMNSTORE|CLUSTERED INDEX|HEAP). Clustered columnstore and round robin are the defaults but they are not always appropriate. Consider what is appropriate for your tables.
If you work through those and still have an issue you can start to look at statistics and workload classification for starters, but gather information on the points above should give you a good idea.
If you are just doing single value INSERTs, then don't. Dedicated SQL pools are terrible with these. Convert these to load from a file in a single INSERT / COPY INTO.

Presto Nodes with too much load

I'm performing some queries over a tpch 100gb dataset on presto, I have 4 nodes, 1 master, 3 workers. When I try to run some queries, not all of them, I see on Presto web interface that the nodes die during the execution, resulting in query failure, the error is the following:
.facebook.presto.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or been under too much load. This is probably a transient issue, so please retry your query in a few minutes.
I rebooted all nodes and presto service but the error remains, this problem doesn't exist if I run the same queries over a smaller dataset.Can someone provide some help on this problem?
Thanks
3 possible causes for this kind of error. You may ssh into one of worker to find out what the problem is when the query is running.
High CPU
Tune down the task.concurrency to, for example, 8
High memory
In the jvm.config, -Xmx should no more than 80% total memory. In the config.properties, query.max-memory-per-node should be no more than the half of Xmx number.
Low open file limit
Set in the /etc/security/limits.conf a larger number for the Presto process. The default is definitely way too low.
It might be an issue for configuration. For example, if the local maximum memory is not set appropriately and the query use too much heap memory, full GC might happen to cause such errors. I would suggest to ask in the Presto Google Group and describe someway to reproduce the issue :)
I was running presto on Mac with 16GB of ram below is the configuration of java.config file.
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
I was getting following error even for running the Query
Select now();
Query 20200817_134204_00005_ud7tk failed: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes.
I changed my -Xmx16G value to -Xmx10G and it works fine.
I used following link to install presto on my system.
Link for Presto Installation

How to circumvent BigQuery's 20 concurrent queries limitation?

Wondering if anyone knows... or have ran into this.. there's a 20 concurrent queries limitation for BigQuery.
https://developers.google.com/bigquery/quota-policy#queries
Is there a way to disable the limit? Our MapReduce tasks needs many concurrent queries in order to complete within a reasonable amount of time.
We have a similar problem. There is no way how to change this from your side. Also "upgrading your plan" as #Dominik suggest won't help.
You have to contact Google directly, explain your problem (business case) and if it is valid they can increase your quota limits (for certain Google Cloud project)

Solr Index slow after a while

I use SolrJ to send data to my Solr server.
When I start my program off, it indexes stuff at the rate of about 1000 docs per sec(I commit every 250,000 docs)
I have noticed that when my index is filled up with about 5 million docs, it starts crawling, not just at commit time, add time too.
My Solr server and indexing program run on the same machine
Here are some of the relevant portions from my solrconfig:
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>1024</ramBufferSizeMB>
<mergeFactor>150</mergeFactor>
Any suggestions about how to fix this?
that merge factor seems really, really (really) high.
Do you really want that?
If you aren't using compound files that could easily lead to a ulimit problem (if you are linux).

Simultaneous queries in Solr

Hej,
I am deploying a Solr server containg more than 30m docs. Currently, I am testing the searching performance and the results are very dependant of the number of simultaneous queries I execute:
1 simultaneous query: 2516ms
2 simultaneous queries: 4250,4469 ms
3 simultaneous queries: 5781, 6219, 6219 ms
4 simultaneous queries: 6484, 7203, 7719, 7781 ms
...
Jetty threadpool is configured as default:
New class="org.mortbay.thread.BoundedThreadPool"
Set name="minThreads" 10
Set name="lowThreads" 50
Set name="maxThreads" 10000
I would like to know if there is any factor I can set for decreasing the impact of the simultaneous requests in response times.
Solrconfig is configured also as default but without cache for measuring worst cases and mergeFactor=5 (searching will be more requested than updating).
Thanks in advance
Why are you trying to do this with caching turned off? What exactly are you trying to measure?
You have effectively forced Solr (Lucene) to perform every search from the disk. What you are actually measuring is concurrency of Java itself combined with your OS and disk throughput. This has nothing to do with Jetty or Solr.
Caches are your friend. You really should be using them in any sort of a production capacity. In my opinion, you should be measuring your throughput under load while varying the caches to see what the tradeoff is between cache size and throughput.
Please check out this IBM Tutorial for Solr
I got a great help from this.
Hope you will find your answer. :-)