Specify minimum number of generated files from Hive insert

Specify minimum number of generated files from Hive insert - hive

I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent.
So my questions are
(a) what determines how many files are generated and
(b) is there a way to specify the minimum number of files or (even better) the maximum size of each file?

The number of files generated during INSERT ... SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured.
If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in each partition. This creates high pressure on reducers and may cause OOM exception.
To make sure reducers are writing only one partition files each, add DISTRIBUTE BY partition_column at the end of your query.
If the data volume is too big, and you want more reducers to increase parallelism and to create more files per partition, add random number to the distribute by, for example using this: FLOOR(RAND()*100.0)%10 - it will distribute data additionally by random 10 buckets, so in each partition will be 10 files.
Finally your INSERT sentence will look like:
INSERT OVERWRITE table PARTITION(part_col)
SELECT *
FROM src
DISTRIBUTE BY part_col, FLOOR(RAND()*100.0)%10; --10 files per partition
Also this configuration setting affects the number of files generated:
set hive.exec.reducers.bytes.per.reducer=67108864;
If you have too much data, Hive will start more reducers to process no more than bytes per reducer specified on each reducer process. The more reducers - the more files will be generated. Decreasing this setting may cause increasing the number of reducers running and they will create minimum one file per reducer. If partition column is not in the distribute by then each reducer may create files in each partition.
To make long story short, use
DISTRIBUTE BY part_col, FLOOR(RAND()*100.0)%10 -- 10 files per partition
If you want 20 files per partition, use FLOOR(RAND()*100.0)%20; - this will guarantee minimum 20 files per partition if you have enough data, but will not guarantee the maximum size of each file.
Bytes per reducer setting does not guarantee that it will be the fixed minimum number of files. The number of files will depend of total data size/bytes.per.reducer. This setting will guarantee the maximum size of each file.
But much better use some evenly distributed key or combination with low cardinality instead of random because in case of containers restart, rand() may produce different values for the same rows and it may cause data duplication or loss(same data which is already present in some reducer output will be distributed one more time to another reducer). You can calculate similar function on some keys available instead of rand() to get more or less evenly distributed key with low cardinality.
You can use both methods combined: bytes per reducer limit + distribute by to control both the minimum number of files and maximum file size.
Also read this answer about using distribute by to distribute data evenly between reducers: https://stackoverflow.com/a/38475807/2700344

Related

In TimescaleDB how to add retention policy for size instead of time interval?

In TimescaleDB, how to set a standard size in terms of GB/MB for a particular table/hypertable, so that when it reaches a particular size, it begins to delete the old rows in order to accommodate new rows of data.
From the documentation, it was evident that only retention policy(time interval) can be added for a hypertable. Is it possible to add retention policy for size?

Retention policy drops entire chunk and chunks are measured by time intervals, thus there is no sense to define policy in size and not in time. The policy drops a chunk after entire chunk is older than given interval, thus if chunk size is 7 days and retention policy is 3 days, then the oldest dropped data will be 10 days old (the dropped chunk contains data from 10 to 3 days old). Chunks are represented by tables internally, thus dropping a chunk is dropping a table, which is the most efficient way to delete data in PostgreSQL. Deleting by row is much more expensive than dropping or truncating a table and doesn't free space until VACUUM is run.
TimescaleDB expects that you know your application load well and can correctly estimate desired size in time interval.
Time dimension column is not required to have time type, but can be a number. It is important that time dimension column increases over time and it is clear how to use in queries and define chunk size. So it is possible to use a counter for the time dimension column and increment it for each row by 1 or by row size. Notice that syncing counter can be a bottleneck.
It is also possible to write a user-defined action, where own action can be defined to be executed on regular basis as a custom policy.
Summary of thee possible solutions:
Give good estimate of chunk size, which is expected way by TimescaleDB.
Define numerical Time dimension column with counter-like implementation.
Write custom policy using user-defined action.

Databricks datamart updated optimization

I had a performance issue to ask your input on.
This is singly based in Databricks on Azure with storage on Azure Data Lake Storage. And tech stack is not more than 2 years old and is all up to the most recent release.
Say I have a datamart Delta table, 100 columns, 30,000,000 rows, grows 225,000 rows every calendar-quarter.
There isn't a Datawarehouse in this architecture, so the newest 225,000 rows are simply appended to the datamart; 30,000,000+ and growing every quarter.
Two columns are a dimension key Dim1_cd and a matching Dim1_desc.
There are 36 other dimensions key-value pairs in the datamart much like Dim1 is a key-value pair.
The datamart is a list of transactions, has a Period column, eg. "2021Q3", and Period is the first level and only partition of the datamart.
The partition divides the delta table into 15 partition folders currently. Each with 100Meg-ish size files numbering in the 150-parquet file range per partition folder.
A calendar-quarter later a new set of files is delivered and to be appended to the datamart, one of which is a Dim1_lookup.txt file which is first read into a Dim1_deltaTable; and has only two columns Dim1_cd and Dim1_desc. Each row is 3rd normal form distinct. On disk, the Dim1_lookup.txt file is only 55K large.
Applying this dimension’s newest version will sometimes have to take only 3-4 minutes, where there are not any Dim1_desc values needing to be updated. Other times, there are 20,000 to 100,000 updates to be written across 100s to 1000s of parquet files and can take an unpleasant long time.
Of course, writing the code of a delta table update to apply the Dim1_deltaTable is no big challenge.
But what can you suggest how to optimize the updates?
Ideally you might have a Datawarehouse backing the datamart, but not the case in this architecture.
You might want to partition Dim1_desc to take advantage of delta’s data skipping but there are 36 other _desc fields you have the same update concerns for.
What do you consider possible to optimize the update/minimize update processing time?

How to check Redshift COPY command performance from AWS S3?

I'm working on an application wherein I'll be loading data into Redshift.
I want to upload the files to S3 and use the COPY command to load the data into multiple tables.
For every such iteration, I need to load the data into around 20 tables.
I'm now creating 20 CSV files for loading data into 20 tables wherein for every iteration, the 20 created files will be loaded into 20 tables. And for next iteration, new 20 CSV files will be created and dumped into Redshift.
With the current system that I have, each CSV file may contain a maximum of 1000 rows which should be dumped into tables. Maximum of 20000 rows for every iteration for 20 tables.
I wanted to improve the performance even more. I've gone through https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html
At this point, I'm not sure how long it's gonna take for 1 file to load into 1 Redshift table. Is it really worthy to split every file into multiple files and load them parallelly?
Is there any source or calculator to give an approximate performance metrics of data loading into Redshift tables based on number of columns and rows so that I can decide whether to go ahead with splitting files even before moving to Redshift.

You should also read through the recommendations in the Load Data - Best Practices guide: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
Regarding the number of files and loading data in parallel, the recommendations are:
Loading data from a single file forces Redshift to perform a
serialized load, which is much slower than a parallel load.
Load data files should be split so that the files are about equal size,
between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression.
The number of files should be a multiple of the number of slices in your
cluster.
That last point is significant for achieving maximum throughput - if you have 8 nodes then you want n*8 files e.g. 16, 32, 64 ... this is so all nodes are doing maximum work in parallel.
That said, 20,000 rows is such a small amount of data in Redshift terms I'm not sure any further optimisations would make much significant difference to the speed of your process as it stands currently.

Select Query output in parallel streams

I need to spool over 20 million records in a flat file. A direct select query would be time utilizing. I feel the need to generate the output in parallel based on portions of the data - i.e having 10 select queries over 10% of the data each in parallel. Then sort and merge on UNIX.
I can utilize rownum to do this, however this would be tedious, static and needs to be updated every time my rownum changes.
Is there a better alternative available?

If the data in SQL is well spread out over multiple spindles and not all on one disk, and the IO and network channels are not saturated currently, splitting into separate streams may reduce your elapsed time. It may also introduce random access on one or more source hard drives which will cripple your throughput. Reading in anything other than cluster sequence will induce disk contention.
The optimal scenario here would be for your source table to be partitioned, that each partition is on separate storage (or very well striped), and each reader process is aligned with a partition boundary.

Efficiently storing 7.300.000.000 rows

How would you tackle the following storage and retrieval problem?
Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:
id (unique row identifier)
entity_id (takes on values between 1 and 2.000.000 inclusive)
date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10))
value_1 (takes on values between 1 and 1.000.000 inclusive)
value_2 (takes on values between 1 and 1.000.000 inclusive)
entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The database must be able to hold 10 years worth of daily data (7.300.000.000 rows (3.650*2.000.000)).
What is described above is the write patterns. The read pattern is simple: all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345.
Transactional support is not needed, but the storage solution must be open-sourced. Ideally I'd like to use MySQL, but I'm open for suggestions.
Now - how would you tackle the described problem?
Update: I was asked to elaborate regarding the read and write patterns. Writes to the table will be done in one batch per day where the new 2M entries will be added in one go. Reads will be done continuously with one read every second.

"Now - how would you tackle the described problem?"
With simple flat files.
Here's why
"all queries will be made on a
specific entity_id. I.e. retrieve all
rows describing entity_id = 12345."
You have 2.000.000 entities. Partition based on entity number:
level1= entity/10000
level2= (entity/100)%100
level3= entity%100
The each file of data is level1/level2/level3/batch_of_data
You can then read all of the files in a given part of the directory to return samples for processing.
If someone wants a relational database, then load files for a given entity_id into a database for their use.
Edit On day numbers.
The date_id/entity_id uniqueness rule is not something that has to be handled. It's (a) trivially imposed on the file names and (b) irrelevant for querying.
The date_id "rollover" doesn't mean anything -- there's no query, so there's no need to rename anything. The date_id should simply grow without bound from the epoch date. If you want to purge old data, then delete the old files.
Since no query relies on date_id, nothing ever needs to be done with it. It can be the file name for all that it matters.
To include the date_id in the result set, write it in the file with the other four attributes that are in each row of the file.
Edit on open/close
For writing, you have to leave the file(s) open. You do periodic flushes (or close/reopen) to assure that stuff really is going to disk.
You have two choices for the architecture of your writer.
Have a single "writer" process that consolidates the data from the various source(s). This is helpful if queries are relatively frequent. You pay for merging the data at write time.
Have several files open concurrently for writing. When querying, merge these files into a single result. This is helpful is queries are relatively rare. You pay for merging the data at query time.

Use partitioning. With your read pattern you'd want to partition by entity_id hash.

You might want to look at these questions:
Large primary key: 1+ billion rows MySQL + InnoDB?
Large MySQL tables
Personally, I'd also think about calculating your row width to give you an idea of how big your table will be (as per the partitioning note in the first link).
HTH.,
S

Your application appears to have the same characteristics as mine. I wrote a MySQL custom storage engine to efficiently solve the problem. It is described here
Imagine your data is laid out on disk as an array of 2M fixed length entries (one per entity) each containing 3650 rows (one per day) of 20 bytes (the row for one entity per day).
Your read pattern reads one entity. It is contiguous on disk so it takes 1 seek (about 8mllisecs) and read 3650x20 = about 80K at maybe 100MB/sec ... so it is done in a fraction of a second, easily meeting your 1-query-per-second read pattern.
The update has to write 20 bytes in 2M different places on disk. IN simplest case this would take 2M seeks each of which takes about 8millisecs, so it would take 2M*8ms = 4.5 hours. If you spread the data across 4 “raid0” disks it could take 1.125 hours.
However the places are only 80K apart. In the which means there are 200 such places within a 16MB block (typical disk cache size) so it could operate at anything up to 200 times faster. (1 minute) Reality is somewhere between the two.
My storage engine operates on that kind of philosophy, although it is a little more general purpose than a fixed length array.
You could code exactly what I have described. Putting the code into a MySQL pluggable storage engine means that you can use MySQL to query the data with various report generators etc.
By the way, you could eliminate the date and entity id from the stored row (because they are the array indexes) and may be the unique id – if you don't really need it since (entity id, date) is unique, and store the 2 values as 3-byte int. Then your stored row is 6 bytes, and you have 700 updates per 16M and therefore a faster inserts and a smaller file.
Edit Compare to Flat Files
I notice that comments general favor flat files. Don't forget that directories are just indexes implemented by the file system and they are generally optimized for relatively small numbers of relatively large items. Access to files is generally optimized so that it expects a relatively small number of files to be open, and has a relatively high overhead for open and close, and for each file that is open. All of those "relatively" are relative to the typical use of a database.
Using file system names as an index for a entity-Id which I take to be a non-sparse integer 1 to 2Million is counter-intuitive. In a programming you would use an array, not a hash-table, for example, and you are inevitably going to incur a great deal of overhead for an expensive access path that could simply be an array indeing operation.
Therefore if you use flat files, why not use just one flat file and index it?
Edit on performance
The performance of this application is going to be dominated by disk seek times. The calculations I did above determine the best you can do (although you can make INSERT quicker by slowing down SELECT - you can't make them both better). It doesn't matter whether you use a database, flat-files, or one flat-file, except that you can add more seeks that you don't really need and slow it down further. For example, indexing (whether its the file system index or a database index) causes extra I/Os compared to "an array look up", and these will slow you down.
Edit on benchmark measurements
I have a table that looks very much like yours (or almost exactly like one of your partitions). It was 64K entities not 2M (1/32 of yours), and 2788 'days'. The table was created in the same INSERT order that yours will be, and has the same index (entity_id,day). A SELECT on one entity takes 20.3 seconds to inspect the 2788 days, which is about 130 seeks per second as expected (on 8 millisec average seek time disks). The SELECT time is going to be proportional to the number of days, and not much dependent on the number of entities. (It will be faster on disks with faster seek times. I'm using a pair of SATA2s in RAID0 but that isn't making much difference).
If you re-order the table into entity order
ALTER TABLE x ORDER BY (ENTITY,DAY)
Then the same SELECT takes 198 millisecs (because it is reading the order entity in a single disk access).
However the ALTER TABLE operation took 13.98 DAYS to complete (for 182M rows).
There's a few other things the measurements tell you
1. Your index file is going to be as big as your data file. It is 3GB for this sample table. That means (on my system) all the index at disk speeds not memory speeds.
2.Your INSERT rate will decline logarithmically. The INSERT into the data file is linear but the insert of the key into the index is log. At 180M records I was getting 153 INSERTs per second, which is also very close to the seek rate. It shows that MySQL is updating a leaf index block for almost every INSERT (as you would expect because it is indexed on entity but inserted in day order.). So you are looking at 2M/153 secs= 3.6hrs to do your daily insert of 2M rows. (Divided by whatever effect you can get by partition across systems or disks).

I had similar problem (although with much bigger scale - about your yearly usage every day)
Using one big table got me screeching to a halt - you can pull a few months but I guess you'll eventually partition it.
Don't forget to index the table, or else you'll be messing with tiny trickle of data every query; oh, and if you want to do mass queries, use flat files

Your description of the read patterns is not sufficient. You'll need to describe what amounts of data will be retrieved, how often and how much deviation there will be in the queries.
This will allow you to consider doing compression on some of the columns.
Also consider archiving and partitioning.

If you want to handle huge data with millions of rows it can be considered similar to time series database which logs the time and saves the data to the database. Some of the ways to store the data is using InfluxDB and MongoDB.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas