Fetching records based on batch size in apache beam - batch-processing

I have 100k records to be processed and I need to fetch 10k each time, process them and fetch another 10k until I process all the 100k records which I call as batch size to reduce the processing overhead each time by fetching all the records at once.
Any suggestions on how to achieve it using Apache beam
I am using spark runner.


Does Google BigQuery charge by processing time?

So, this is somewhat of a "realtime" question. I run a query and it's currently going at almost 12K seconds, but it told me This query will process 239.3 GB when run. Does BQ charge by the time besides the data processed? Should I stop this now?
I assume you are using on-demand pricing model - you are billed based on amount of processed bytes and processing time is not involved
BigQuery uses a columnar data structure. You're charged according to the total data processed in the columns you select, and the total data per column is calculated based on the types of data in the column. For more information about how your data size is calculated, see data size calculation.
You aren't charged for queries that return an error, or for queries that retrieve results from the cache.
Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.
Cancelling a running query job may incur charges up to the full cost for the query were it allowed to run to completion.
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
See more at BigQuery pricing
For on-demand queries you are charged for the amount of data processed by BigQuery engine and for expensive queries only you are charged extra for the complexity (which could manifest itself by increased query time).
The amount of data processed is reflected by totalBytesProcessed. And also by totalBytesBilled which is the same for ordinary queries. For complex/expensive queries you are charged extra for the complexity and technically it's done by totalBytesBilled becoming bigger than totalBytesProcessed.
More details: see this link

How to check Redshift COPY command performance from AWS S3?

I'm working on an application wherein I'll be loading data into Redshift.
I want to upload the files to S3 and use the COPY command to load the data into multiple tables.
For every such iteration, I need to load the data into around 20 tables.
I'm now creating 20 CSV files for loading data into 20 tables wherein for every iteration, the 20 created files will be loaded into 20 tables. And for next iteration, new 20 CSV files will be created and dumped into Redshift.
With the current system that I have, each CSV file may contain a maximum of 1000 rows which should be dumped into tables. Maximum of 20000 rows for every iteration for 20 tables.
I wanted to improve the performance even more. I've gone through https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html
At this point, I'm not sure how long it's gonna take for 1 file to load into 1 Redshift table. Is it really worthy to split every file into multiple files and load them parallelly?
Is there any source or calculator to give an approximate performance metrics of data loading into Redshift tables based on number of columns and rows so that I can decide whether to go ahead with splitting files even before moving to Redshift.
You should also read through the recommendations in the Load Data - Best Practices guide: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
Regarding the number of files and loading data in parallel, the recommendations are:
Loading data from a single file forces Redshift to perform a
serialized load, which is much slower than a parallel load.
Load data files should be split so that the files are about equal size,
between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression.
The number of files should be a multiple of the number of slices in your
That last point is significant for achieving maximum throughput - if you have 8 nodes then you want n*8 files e.g. 16, 32, 64 ... this is so all nodes are doing maximum work in parallel.
That said, 20,000 rows is such a small amount of data in Redshift terms I'm not sure any further optimisations would make much significant difference to the speed of your process as it stands currently.

Pentaho Data Integration (PDI) How to use postgresql bulk loader? My transformation running forever

I'm new to PDI, im using PDI 7, i have excel input with 6 rows and want to insert it into postgresDB. My transformation is : EXCEL INPUT --> Postgres Bulk Loader (2 steps only).
Condition 1 : When i Run the transformation the Postgres Bulk Load not stopping and not inserting anything into my postgresDB.
Condition 2 : So, I add "Insert/Update" step after Postgres Bulk Loader, and all data inserted to postgresDB which means success, but the bulk loader still running.
My transformation
From all sources i can get, they only need input and Bulk Loader step, and the after finished the transformation, the bulk loader is "finished" (mine's "running"). So, i wanna ask how to to this properly for Postgres? Do i skipped something important? Thanks.
The PostgreSQL bulk loader used to be only experimental. Haven't tried it in some time. Are you sure you need it? If you're loading from Excel, it's unlikely you'll have enough rows to warrant use of a bulk loader.
Try just the regular Table Output step. If you're only inserting, you shouldn't need the Insert/Update step either.
To insert just 7 rows you don't need bulk loader.
Bulk loader designed to load huge amount of data. It uses native psql client. PSQL client transfers data much faster since it uses all features of binary protocol without any restriction of jdbc specification. JDBC is used in other steps like Table Output. Most of time Table Output is enough sufficient.
Postgres Bulk Loader step just builds in memory data in csv format from incoming steps and pass them to psql client.
I did made some experiments.
DB: Postgresv9.5x64
PDI KETTLE defautl jvm settings 512mb
Data source: DBF FILE over 2_215_000 rows
Both PDI and Kettle on same localhost
Table truncated on each run
PDI Kettle restarted on each run(to avoid heavily CPU load of gc run due huge amount rows)
Results are underneath to help you make decision
Bulk loader: average over 150_000 rows per second around 13-15s
Table output (sql inserts): average 11_500 rows per second. Total is around 3min 18s
Table output (batch inserts, batch size 10_000): average 28_000 rows per second. Total is around 1min 30s
Table output (batch inserts in 5 threads batch size 3_000): average 7_600 rows per second per each thread. Means around 37_000 rows per second. Total time is around 59s.
Advantage of Buld loader is that is doesn't fill memory of jmv, all data is streamed into psql process immediately.
Table Output fill jvm memory with data. Actually after around 1_600_000 rows memory is full and gc is started. CPU that time loaded up to 100% and speed slows down significantly. That is why worth to play with batch size, to find value which will provide best performance (bigger better), but on some level cause GC overhead.
Last experiment. Memory provided to jvm is enough to hold data. This can be tweaked in variable PENTAHO_DI_JAVA_OPTIONS. I set value of jvm heap size to 1024mb and increased value of batch size.
Table output (batch inserts in 5 threads batch size 10_000): average 12_500 rows per second per each thread. Means total around 60_000 rows per second. Total time is around 35s.
Now much easier to make decision. But your have to notice the fact, that kettle pdi and database located on same host. In case if hosts are different network bandwidth can play some role in performance.
Slow insert/update step
Why you have to avoid using insert/update (in case of huge amount of data processed or you are limited by time)?
Let' look on documentation
The Insert/Update step first looks up a row in a table using one or
more lookup keys. If the row can't be found, it inserts the row. If it
can be found and the fields to update are the same, nothing is done.
If they are not all the same, the row in the table is updated.
Before states, for each row in stream step will execute 2 queries. It is lookup first and then update or insert. Source of PDI Kettle states that PreparedStatement is used for all queries: insert, update and lookup.
So if this step is bottleneck then, try to figure out what exactly slow.
Is lookup slow? (Run manually lookup query on database on sample data. Check is it slow ? Does lookup fields has index on those columns used to find correspond row in database)
Is update slow? (Run manually lookup query on database on sample data. Check is is slow? Does update where clause use index on lookup fields)
Anyway this step is slow since it requires a lot of network communication, and data processing in kettle.
The only way to make it faster, is to load all data in database into "temp" table and call function which will upsert data. Or just use simple sql step in job to do the same.

Select Query output in parallel streams

I need to spool over 20 million records in a flat file. A direct select query would be time utilizing. I feel the need to generate the output in parallel based on portions of the data - i.e having 10 select queries over 10% of the data each in parallel. Then sort and merge on UNIX.
I can utilize rownum to do this, however this would be tedious, static and needs to be updated every time my rownum changes.
Is there a better alternative available?
If the data in SQL is well spread out over multiple spindles and not all on one disk, and the IO and network channels are not saturated currently, splitting into separate streams may reduce your elapsed time. It may also introduce random access on one or more source hard drives which will cripple your throughput. Reading in anything other than cluster sequence will induce disk contention.
The optimal scenario here would be for your source table to be partitioned, that each partition is on separate storage (or very well striped), and each reader process is aligned with a partition boundary.

SSIS DataFlowTask DefaultBufferSize and DefaultBufferMaxRows

I have a task which pulls records from Oracle db to our SQL using dataflow task. This package runs everyday around 45 mins. This package will refresh about 15 tables. except one, others are incremental update. so almost every task runs 2 to 10 mins.
the one package which full replacement runs up to 25 mins. I want to tune this dataflow task to run faster.
There is just 400k of rows in the table. I did read some articles about DefaultBufferSize and DefaultBufferMaxRows. I have below doubts.
If I can set DefaultBufferSize upto 100 MB, Is there any place to look or analyse how much I can provide.
DefaultBufferMaxRows is set to 10k. Even If I give 50k and I provided 10 MB for DefaultBufferSize if which can only hold up to some 20k then what will SSIS do. Just ignore those 30k records or still it will pull all those 50k rocords(Spooling)?
Can I use Logging options to set proper limits?
As a general practice (and if you have enough memory), a smaller number of large buffers is better than a larger number of small buffers BUT not until the point where you have paging to disk (which is bad for obvious reasons)
To test it, you can log the event BufferSizeTuning, which will show you how many rows are in each buffer.
Also, before you begin adjusting the sizing of the buffers, the most important improvement that you can make is to reduce the size of each row of data by removing unneeded columns and by configuring data types appropriately.