Pentaho gives Unexpected End of Stream when querying a large dataset - pentaho

I have a Pentaho transform that reads several million rows of data with a Table Input step. When running with a few million it is ok. When I hit about 15 million I get and Unexpected End of Stream Exception x out of y bytes. When this occurs I have several other table inputs going to stream lookups that work fine. The input for my main stream gets no rows. My database is mariadb and my timeouts are set to 8 hours (don't ask :/). Has anyone encountered anything similar?
My query was not using an index on my date range when I had a large range. I've forced this index and still have the same problem. In my processlist the query is stuck at "Writing to net".

The problem was with my MariaDB connector. I changed this to a MySQL connector and it worked perfectly.

Related

Using Powershell to copy large Oracle table to SQL Server - memory issue

We are trying to copy data from a large Oracle table (about 50M rows) to SQL Server, using Powershell and SQLBulkCopy. The issue with this particular Oracle table is that it contains a CLOB field and it seems that unlike other table loads, this one is taking up more and more OS memory, eventually overpowering SQL Server, which is located on the same server, on which Powershell is running. Oracle is external and data is being sent via a network. Max size of CLOB is 6.4M bytes, whereas the average size is 2000.
Here is a snip of code being used. Seems that batchsize does not have any bearing on what's happening:
`
$SourceConnection = New-Object Oracle.ManagedDataAccess.Client.OracleConnection($SourceConnectionnectionstring)
$SourceConnection.Open()
$SourceCmd = $SourceConnection.CreateCommand()
$SourceCmd.CommandType = "text"
$SourceCmd.CommandText = $queryStatment
$bulkCopy = New-Object Data.SqlClient.SqlBulkCopy($targetConnectionString, [System.Data.SqlClient.SqlBulkCopyOptions]::UseInternalTransaction)
$bulkcopy.DestinationTableName = $destTable
$bulkcopy.bulkcopyTimeout = 0
$bulkcopy.batchsize = 500
$SourceReader = $SourceCmd.ExecuteReader()
Start-Sleep -Seconds 2
$bulkCopy.WriteToServer($SourceReader)
`
We tried different batch sizes, smaller and larger, with same result.
Tried enableStreaming 1 / 0
tried using Internal Transaction (in the code sample above) or just using default options, but still specifying batch size...
Anything else we can try to do to avoid the memory pressure?
Thank you in advance!
Turned out, after extensive research, an obscure Oracle command attribute is responsible for sending data for CLOBs, which is what was saturating memory.
InitialLOBFetchSize
This property specifies the amount of data that the OracleDataReader initially fetches for LOB columns. It is defaulted to 0, which means "the entire CLOB".
I set it to 1M bytes, which is plenty, and the process never ate into memory.

What is the data limit that Google Data Studio can handle?

Does anyone have an experience with large data sets in Data Studio?
I want to use a data that is close to 40 million rows and a dozen columns. I was trying to check it by myself but after connection to BigQuery query a Configuration error occurred.
If you have a dataset stored in BigQuery, Data Studio should have no problem handing it - through BigQuery. Size shouldn't really be a problem.
I've noticed that when Data Studio accesses a BigQuery table, it is limited to 20,000,000 rows.
Specifically, LIMIT 20000000 is applied to the actual query in BQ and there is no way to configure / change that (to the best of my knowledge).

Insertion in Leaf table in SQL Server MDS is very slow

I am using SSIS to move data from an existing database to an MDS database.
I am following the following Control flow;
Truncate TableName_Leaf
Load Data to stg
The second step has the following data flow:
1. Load data from source database (This has around 90000 records)
2. Apply a data conversion task to convert string datatype to Unicode (as MDS only supports Unicode)
3. Specify TableName_Leaf as OLE DB destination.
The step 1 and 2 are completing quickly, but the insertion to Leaf table is extremely slow. (It took 40 seconds to move 100 rows end to end, and around 6 minutes to move 1000 records.)
I tried deleting extra constraints from the Leaf table, but that also did not improve the performance much.
Is there any other way to insert data to MDS which is quicker or better?
Using a Table or view - fast load. in OLE DB destination connection helped resolve the issue. I used a batch size of 1000 for my case and it worked fine.

Bulk Conversion of Large SQL Databases (100 GB stored in 10 files (100 tables/file) to SQLite

I am converting a large SQL database (100GB stored in 10 files, with 100 tables per file) to SQLite. Right now, I am using the CodeProject C# utility, as suggested in another thread (convert sql-server *.mdf file into sqlite file). However, this approach is not entirely satisfactory for two reasons:
The conversion process usually stops abruptly when converting one of my files. Then I have to go in and check which tables were successfully converted or not.
I could manually convert 10 tables at a time; but this requires 100 repetitions and my constant presence in front of my computer.
Thank you so much for your kind regards!
It is possible a "transaction log" is being created. This is the log used to rollback changes if something goes wrong. Since your job is so large, this log file can grow too large and the process will fail.
Try this:
1) Back up the data.
2) Turn off the log with this: PRAGMA database.journal_mode = OFF;
Caveat: I've never tried this with SqlLite but other databases work in a similar fashion.

How to resume data migration from the point where error happened in ssis?

I am migrating data from an Oracle database to a SQL server 2008 r2 database using SSIS. My problem is that at a certain point the package fails, say some 40,000 rows out of 100,000 rows. What can I do so that the next time when I run the package after correcting the errors or something, I want it to be restarted from the 40,001st row, i.e, the row where the error had occured.
I have tried using checkpoint in SSIS, but the problem is that they work only between different control flow tasks. I want something that can work on the rows that are being transferred.
There's no native magic I'm aware of that is going to "know" that it failed on row 40,000 and when it restarts, it should start streaming row 40,001. You are correct that checkpoints are not the answer and have plenty of their own issues (can't serialize Object types, loops restart, etc).
How you can address the issue is through good design. If your package is created with the expectation that it's going to fail, then you should be able to handle these scenarios.
There are two approaches I'm familiar with. The first approach is to add a Lookup Transformation in the Data Flow between your source and your destination. The goal of this is to identify what records exist in the target system. If no match is found, then only those rows will be sent on to destination. This is a very common pattern and will allow you to also detect changes between source and destination (if that is a need). The downside is that you will always be transferring the full data set out of the source system and then filtering rows in the data flow. If it failed on row 99,999 out of 1,000,000 you will still need to stream all 1,000,000 rows back to SSIS for it to find the 1 that hasn't been sent.
The other approach is to use a dynamic filter in your WHERE clause of your source. If you can make assumptions like the rows are inserted in order, then you can structure your SSIS package to look like Execute SQL Task where you run a query like SELECT COALESCE(MAX(SomeId), 0) +1 AS startingPoint FROM dbo.MyTable against the Destination database and then assign that to an SSIS variable (#[User::StartingId]). You then use an expression on your select statement from the Source to be something like "SELECT * FROM dbo.MyTable T WHERE T.SomeId > " + (DT_WSTR, 10) #[User::StartingId] Now when the data flow begins, it will start where it last loaded data. The challenge on this approach is finding those scenarios where you know data hasn't been inserted out of order.
Let me know if you have questions, need things better explained, pictures, etc. Also, above code is freehanded so there could be syntax errors but the logic should be correct.