how to limit the memory used in a map join, how can I control it .
lets say I make join of two huge tables( each 10 million records with 10 columns). can I control the memory that I can allot for that job ?
thanks for answering
By default, the maximum size of a table to be used in a map join (as the small table) is 1,000,000,000 bytes (about 1 GB)
If you want to increase this,
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=2000000000;
Related
One of the tables in my Postgresql database takes up 25 GB and I want to understand which columns contribute to this size. I suspect that there are 2 columns (both columns allow text with no chars limit) that contribute heavily to the table size. I checked each of the column sizes by using the query below, but their size doesnt add up to 25 GB..Can someone advise me on how I can see the split of 25 GB by column?
select pg_size_pretty(sum(pg_column_size(insert_column_name))) as total_size
from insert_table_name
I suspect the above query is giving me the wrong size taken up by each column in the table, because all the individual column sizes add up to only 4-5 GB (when it should add to 25 GB).
Where is the remaining shortfall of 20GB coming from if the query above is correct?
Can you advise on what query I can use to see the split of 25GB by columns?
The table was simply with an id, string within 400 charaters, and a length item to record the length of its string. The problem was, when I do a query, e.g select * from table where length = whatever a number It never reacts. (or always caltulates... ) I was wondering, if it was due to the large dataset? Should I somehow split the table into several? But I noticed that, when executing query like above, there are three threads about postgresql with only 2 MB RAM occupation each and 4-5 MB transmission rate. Was it normal?
Environment: 12GB RAM, Postgresql 12 on Win10.
Yes, that is perfectly normal.
Your query is performing a parallel sequential scan with two additional worker processes. Reading a large table from disk neither requires much RAM nor much CPU. You are probably I/O bound.
Two remarks:
Depending on the number of result rows, an index on the column or expression in the WHERE clause can speed up processing considerably.
Unless you really need it for speed, storing the length of the string in an extra column is bad practice. You can calculate that from the string itself.
Storing such redundant data not only wastes a little space, it opens the door to inconsistencies (unless you have a CHECK constraint).
All this is not PostgreSQL specific, it will be the same with any database.
The official documentation says that if 1 table is small enough, a broadcast join can be created (which is faster than a Shuffle join): https://cloud.google.com/bigquery/docs/best-practices-performance-compute#optimize_your_join_patterns
However, it does not indicates the size limit for the small table. Some (old) links on the internet say we no longer do some broadcast join if the small table is higher than 8MB. However, this number seems quite small to me (I compare with Hive where I can have several hundreds of MB in the small table).
Does somebody know this limit?
The exact value is not published, and it is not a fixed value (depends on types of JOINs and on query plan shape), but it is in order of hundrends of MBs.
I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.
I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).
Is there a limit to the number of tables I can have in BigQuery? I'm trying to create multiple small tables to reduce query costs. Thanks!
There is no limit to the number of tables. You might have problems querying them all since there is a 10k limit to the length of a query string.
Now it's:
Maximum number of tables referenced per query — 1,000
Maximum unresolved legacy SQL query length — 256 KB
Maximum unresolved standard SQL query length — 1 MB
Maximum resolved legacy and standard SQL query length — 12 MB
https://cloud.google.com/bigquery/quotas
There is no limit in number of tables you can create. If you have more than a few thousand tables, listing a dataset may be slow (and opening the UI might be slow), but otherwise you can create as many tables as you need.
You can see this document.
There is no maximum number of standard tables.
but partitioned table have limit. (4,000)
and if there are so many tables, api can be slow and have limitation to see table on console.
When you use an API call, enumeration performance slows as you approach 50,000 tables in a dataset. The Cloud Console can display up to 50,000 tables for each dataset.