What table size is small enough for MAPJOIN? - hive

How do I decide whether a table is small enough for the MAPJOIN optimization?
My guess is that I should look at
du /misc/hdfs/user/hive/warehouse/my_table
and use MAPJOIN if that is below 50% (? 5%?) of RAM.
I am using hive 0.10.

hive-site.xml
hive.mapjoin.smalltable.filesize
Default Value: 25000000
The threshold for the input file size of the small tables; if the file size is smaller than this threshold, it will try to convert the common join into map join.
This is the current release Wiki, but I think this setting goes back to 0.10.

Related

Why spark is reading more data that I expect it to read using read schema?

In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?

PostgreSQL: Setted a full column to null value & database size increased. Why?

I´m working with PostgreSQL. I have a database named db_as on it with 25.000.000 rows of data. I wanted to set some diskspace free so I updated a full column to null value thinking that I would decrease databases size, but it didnt happend, in fact, i did the oposite thing, I increased databases size, and i dont know why. It increased from 700MB to 1425MB, thats a lot :( .
I used this sentence to know each columns size:
SELECT sum(pg_column_size(_column)) as size FROM _table
And this one to know all the databases size:
SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database;
The original values will still be on disk, just dead.
Run a vacuum on the database to remove these.
vacuum full
Documentation
https://www.postgresql.org/docs/12/sql-vacuum.html

Hive LLAP tuning: Memory per daemon and Heap Size Calculation

I am tuning my cluster which has Hive LLAP, According to the below link, https://community.hortonworks.com/articles/215868/hive-llap-deep-dive.html I need to calculate the value of heapsize, but not sure what is the meaning of *?
I also have a question regarding how to calculate the value for hive.llap.daemon.yarn.container.mb other then then default value given by ambari?
I have tried calculating the value by considering this (* as multiplication) and set container value equal to yarn.scheduler.maximum.allocation.mb, However HiveServer 2 interactive does not start after tuning.
Here's the excellent wiki article for setting up hive llap in HDP suite.
https://community.hortonworks.com/articles/149486/llap-sizing-and-setup.html
Your understanding for * is correct, it's used for multiplication.
Rule of thumb here is set hive.llap.daemon.yarn.container.mb to yarn.scheduler.maximum-allocation-mb but if your service is not coming up with that value then I would recommend you to change llap_heap_size to 80% of hive.llap.daemon.yarn.container.mb.

datastax : Spark job fails : Removing BlockManager with no recent heart beats

Im using datastax-4.6. I have created a cassandra table and stored 2crore records. Im trying to read the data using scala. The code works fine for few records but when i try to retrieve all 2crore records it displays me follwing error.
**WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 172.20.98.17, 34224, 0) with no recent heart beats: 140948ms exceeds 45000ms
15/05/15 19:34:06 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(C15759,34224) not found**
Any help?
This problem is often tied to GC pressure
Tuning your Timeouts
Increase the spark.storage.blockManagerHeartBeatMs so that Spark waits for the GC pause to end.
SPARK-734 recommends setting -Dspark.worker.timeout=30000 -Dspark.akka.timeout=30000 -Dspark.storage.blockManagerHeartBeatMs=30000 -Dspark.akka.retry.wait=30000 -Dspark.akka.frameSize=10000
Tuning your jobs for your JVM
spark.cassandra.input.split.size - will allow you to change the level of parallelization of your cassandra reads. Bigger split sizes mean that more data will have to reside in memory at the same time.
spark.storage.memoryFraction and spark.shuffle.memoryFraction - amount of the heap that will be occupied by RDDs (as opposed to shuffle memory and spark overhead). If you aren't doing any shuffles, you could increase this value. The databricks guys say to make this similar in size to the size of your oldgen.
spark.executor.memory - Obviously this depends on your hardware. Per DataBricks you can do up to 55gb. Make sure to leave enough RAM for C* and for your OS and OS page cache. Remember that long GC pauses happen on larger heaps.
Out of curiosity, are you frequently going to be extracting your entire C* table with Spark? What's the use case?

MDF file size much larger than actual data

For some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data?
More info:
Only a few tables with ~25 million rows. No large varchar fields (biggest is 300, most are less than varchar(50). Not very wide tables < 20 columns. Also, none of the large tables are indexed yet. Tables with indexes have less than 1 million rows. I don't use char, only varchar for strings. Datatype is not the issue.
Turned out it was the log file, not the mdf file. The MDF file is actually 24gigs which seems more reasonable, however still big IMHO.
UPDATE:
I fixed the problem with the LDF (log) file by changing the recovery model from FULL to simple. This is okay because this server is only used for internal development and ETL processing. In addition, before changing to SIMPLE I had to shrink the LOG file. Shrinking is not recommended in most cases, however, this was one of those cases where the log file should have never grown so big and so fast. For further reading see this
Could be a lot of reasons maybe you are using char(5000) instead of varchar(5000), maybe you are using bigints instead of int, nvarchar when all you need is varchar etc etc etc. Maybe you are using a lot of indexes per table, these will all add up. Maybe your autogrow settings are wrong. You are sure this is the MDF and not the LDF file right?
Because the MDF was allocated with 154Gb, or has grown to 154Gb through various operations. A database file has at least the size of the data in it, but it can be larger than the used amount by any amount.
An obvious question will be how do you measure the amount of data in the database? Did you use sp_spaceused? Did you check sys.allocation_units? Did you guess?
If the used size is indeed 7Gb out of 154Gb, then you should leave it as it is. The database was sized by somebody at this size, or has grown, and it is likely to grow back. If you believe that the growth or pre-sizing was accidental, then the previous point still applies and you should leave it as is.
If you are absolutely positive the overallocation is a mistake, you can shrink the database, with all the negative consequences of shrinking.
Just in case this is useful for someone out there, found this query in dba.stackexchange, it uses the sys.dm_db_database_page_allocations which counts the number of pages per object, this includes internal storage and gives you a real overview of the spaced used by your database.
SELECT sch.[name], obj.[name], ISNULL(obj.[type_desc], N'TOTAL:') AS [type_desc],
COUNT(*) AS [ReservedPages],
(COUNT(*) * 8) AS [ReservedKB],
(COUNT(*) * 8) / 1024.0 AS [ReservedMB],
(COUNT(*) * 8) / 1024.0 / 1024.0 AS [ReservedGB]
FROM sys.dm_db_database_page_allocations(DB_ID(), NULL, NULL, NULL, DEFAULT) pa
INNER JOIN sys.all_objects obj
ON obj.[object_id] = pa.[object_id]
INNER JOIN sys.schemas sch
ON sch.[schema_id] = obj.[schema_id]
GROUP BY GROUPING SETS ((sch.[name], obj.[name], obj.[type_desc]), ())
ORDER BY [ReservedPages] DESC;
Thanks to Solomon Rutzky:
https://dba.stackexchange.com/questions/175649/sum-of-table-sizes-dont-match-with-mdf-size
Either AUTO SHRINK is not enabled or The initial size was set to the larger value.