Pentaho Data Integration MongoDB Input speed very slow

Pentaho Data Integration MongoDB Input speed very slow - pentaho

I need to extract data from MongoDB and put into database, the MongoDB is a replica set with 2 secondary, and I read the MongoDB data from secondary.
I setup a simple transformation (MongoDB Input to Dummy step) to test the speed, but the MongoDB read speed is usually around 200 row/s.
I think that is too slow.
Does anyone can share your experience, what is usual speed for MongoDB Input step and how to optimize that?
Thanks!

i haven't worked on with mongodb lately but i may suggest to check a few things
are you running from local machine or remote server if yes when you run the transformation can you check the network speed as you may be sending data on network which may be slow.
we usually configure a cart server on run the transformation on the database server to improve the performance.
check the commit size you may need to change it as per the stream you are working,usually increasing it helps improve.
see if below may help you https://help.pentaho.com/Documentation/6.1/0L0/0Y0/090/020#OptimizeDS

Related

How to minimize the cost for a sql database

I have a website that need a database to store some user information and a blob storage to save some files.
I want to minimize the cost as much as possible so I played around in Microsoft Azure Pricing Calculator with a Azure SQL Database. For the database I think that over it's hole lifetime 2GB of storage would be enought.
I arrived to 2 options that where dirt cheap but I don't really understand what it gives me.
First is with a serverles computer for 3600 seconds (of runtime?)
Is that time the time that my database is processing the request? For example if I have a select statement that takes 1 sec to complete I'll be left with 3599 sec for that month?
If that's the case what happens if I run out of time?
Second option is using a Hardware Type: Gen 4
but for this one I don't have any other options to configure my needs. Is this obsolete? Can I rely on it for production?

If you need a very cheap one use the Basic or S0.
Keep in mind that Basic are very slow: try to connect to it through SSMS.
Serverless is for databases that you pause for 3/4 of the day. It might be the case for you but keep in mind that when you use them they will cost a lot. I don't think this will be suitable for you.

Singlestore (MemSQL)

I have a Singlestore (previously MemSQL) cloud database set up.
My software is running in the background, constantly writing to a table.
When I try to query this table, it takes 10+ seconds. When the software is shut off, the query takes milliseconds.
What would be the reason for this? And is there anything that can be done to mitigate against this?

From a high level, cluster resources are much more utilized while the background software constantly writes to the table. The same resources that handle the constant writes are concurrently trying to serve the query, so it makes sense its faster when there is no writing.
A 'knob to turn' WRT database ingest performance is partition count - you can try creating a test DB w/ more partitions that the current DB (say 2x more). Then try querying from the test DB, both while the background software is running and while it is not - compare this to the DB w/ fewer partitions.
For general guidance on troubleshooting query performance, see this section of the docs: https://docs.singlestore.com/managed-service/en/query-data/query-procedures/troubleshooting-poorly-performing-queries.html
If you're an active customer, you can file a support ticket for the issue for some additional analysis of the backend workings

Very low throughput by using JdbcIO on Google Dataflow

I'd like to load data into Google CloudSQL instance via Google Dataflow.
I think that there're no built-in Sink for CloudSQL, I decide to use org.apache.beam.sdk.io.jdbc.JdbcIO.
But, the throughput into CloudSQL is very low (about 6 records/sec).
I suspect that the spec of CloudSQL is too poor, But there's no improve when it's upgraded.
In the log of Dataflow, there're many logs as below:
Proposing dynamic split of work unit my-project;2017-06-27_02_58_19-14077185378147382467;6703504927792172410 at
{"fractionConsumed":0.9669782519340515}
Rejecting split request because custom reader returned null residual source.
What's happened? And How can I improve the performance?

It's resolved!
At generating connection-string, adding as below:
JdbcIO.DataSourceConfiguration.create("com.mysql.jdbc.Driver", "jdbc:mysql://google/mydatabase?cloudSqlInstance=myproject:region:instance-name&socketFactory=com.google.cloud.sql.mysql.SocketFactory&rewriteBatchedStatements=true")
Adding "rewriteBatchedStatements=true", it's worked.
The throughput improved to 2000/sec about!
Notice: it workes only when using mysql, perhaps.

Rejecting split request because custom reader returned null residual
source.
Whatever custom source you implemented doesn't appear to support dynamic rebalancing.
I suspect that the spec of CloudSQL is too poor, But there's no
improve when it's upgraded.
Are you sure it's throughput to Cloud SQL that is the issue. Have you measured the performance of your source and proven it is the bottleneck?
I'd like to load data into Google CloudSQL instance via Google
Dataflow
Generally, I wouldn't recommend this. Cloud SQL is a single machine database, so I suspect you don't get a lot of benefit, and perhaps it's even a performance negative, by using a horizontally scalable method like Dataflow. You should be able to do ingestion into Cloud SQL just as fast using a single VM instance to load the data.

Need hints to optimise a sybase access over a big fat pipe

I have the need to access a sybase database (12.5) from oversea. The high latency is definitely a problem.
I already optimized the connection parameters to make better use of the network and achieved a 20x performance increase, but it's still not enough : 1 minute to get 3Mb of data.
We need another 10x or 20x increase for our application.
Technical data :
the data are flowing through a single TCP connection using the TDS protocol
the client app is an excel sheet with macros, using the default Sybase driver
the corporate environment makes it difficult to push big changes in the 10+ years architecture, so solutions need to be the least intrusive. But some changes may be bargained due to the importance of this project.
Can anyone give me pointers ?
I already thought of :
splitting SQL requests over several concurrent connections to the database. The problem is data consistency : what if records are modified at the same time since requests will not be exactly executed at the same time ? Is there an existing mechanism to spread a request over several calls on different connections ?
using some kind of database "cache" or "local replication" oversea, but I don't know what is possible.
Thanks.

Try to install local database (ASE or ASA) and synchronize this databases with Sybase Mobilink (or Sybase Replication Server if you need small replication latency and you have a lot of money).

(I know I answer to my own question)
Eventually, we settled to designing our own database remote access protocol. It's not complicated since we are only using a basic subset of SQL (SELECT and UPDATE), and the protocol doesn't have to understand SQL anyway.
By using our own protocol, we'll be able to use compression, make the client able to use several TCP links at the same time, maximize network utilisation and add some functionnal caching secific to our application.
The client will be our app and the server will be a "proxy" to the real database, sitting next to it (like #Tim suggested in the comments).
It's not the only solution, but we feel that it's a good balance between enormous replication price, development complexity and expected benefits.

Is it possible to get sub-1-second latency with transactional replication?

Our database architecture consists of two Sql Server 2005 servers each with an instance of the same database structure: one for all reads, and one for all writes. We use transactional replication to keep the read database up-to-date.
The two servers are very high-spec indeed (the write server has 32GB of RAM), and are connected via a fibre network.
When deciding upon this architecture we were led to believe that the latency for data to be replicated to the read server would be in the order of a few milliseconds (depending on load, obviously). In practice we are seeing latency of around 2-5 seconds in even the simplest of cases, which is unsatisfactory. By a simplest case, I mean updating a single value in a single row in a single table on the write db and seeing how long it takes to observe the new value in the read database.
What factors should we be looking at to achieve latency below 1 second? Is this even achievable?
Alternatively, is there a different mode of replication we should consider? What is the best practice for the locations of the data and log files?
Edit
Thanks to all for the advice and insight - I believe that the latency periods we are experiencing are normal; we were mis-led by our db hosting company as to what latency times to expect!
We're using the technique described near the bottom of this MSDN article (under the heading "scaling databases"), and we'd failed to deal properly with this warning:
The consequence of creating such specialized databases is latency: a write is now going to take time to be distributed to the reader databases. But if you can deal with the latency, the scaling potential is huge.
We're now looking at implementing a change to our caching mechanism that enforces reads from the write database when an item of data is considered to be "volatile".

No. It's highly unlikely you could achieve sub-1s latency times with SQL Server transactional replication even with fast hardware.
If you can get 1 - 5 seconds latency then you are doing well.
From here:
Using transactional replication, it is
possible for a Subscriber to be a few
seconds behind the Publisher. With a
latency of only a few seconds, the
Subscriber can easily be used as a
reporting server, offloading expensive
user queries and reporting from the
Publisher to the Subscriber.
In the following scenario (using the
Customer table shown later in this
section) the Subscriber was only four
seconds behind the Publisher. Even
more impressive, 60 percent of the
time it had a latency of two seconds
or less. The time is measured from
when the record was inserted or
updated at the Publisher until it was
actually written to the subscribing
database.

I would say it's definately possible.
I would look at:
Your network
Run ping commands between the two servers and see if there are any issues
If the servers are next to each other you should have < 1 ms.
Bottlenecks on the server
This could be network traffic (volume)
Like network cards not being configured for 1GB/sec
Anti-virus or other things
Do some analysis on some queries and see if you can identify indexes or locking which might be a problem
See if any of the selects on the read database might be blocking the writes.
Add with (nolock), and see if this makes a difference on one or two queries you're analyzing.
Essentially you have a complicated system which you have a problem with, you need to determine which component is the problem and fix it.
Transactional replication is probably best if the reports / selects you need to run need to be up to date. If they don't you could look at log shipping, although that would add some down time with each import.
For data/log files, make sure they're on seperate drives so the performance is maximized.

Something to remember about transaction replication is that a single update now requires several operations to happen for that change to occur.
First you update the source table.
Next the log readers sees the change and writes the change to the distribution database.
Next the distribution agent sees the new entry in the distribution database and reads that change, then runs the correct stored procedure on the subscriber to update the row.
If you monitor the statement run times on the two servers you'll probably see that they are running in just a few milliseconds. However it is the lag time while waiting for the log reader and distribution agent to see that they need to do something which is going to kill you.
If you truly need sub second processing time then you will want to look into writing your own processing engine to handle data moving from one server to another. I would recommend using SQL Service Broker to handle this as this way everything is native to SQL Server and no third party code has to be written.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas