I am facing some performance issues in Pentaho. It is very slow.
the data set size is around 3,500,000 rows
Does anyone have the same problem? I am not sure if it is because of Pentaho Analyzer or Mondrian
Did anyone experience Analyzer or Mondrian with large Data Sets? size ? how is the performance? is it really the suitable tool for such data set?
any advice plz?
thanks
Related
I have a very high level question.
could indexes on a sql server table improve the loading performance of a tableau dashboard?
if so - is there any best practice / guideline we could follow?
thanks a lot
An index will speed up the extraction of the data to Tableaus database structure, but it will not speed up Tableau as you interact with it. There is a Tableau community website where you can find best practices, etc. on.
Yes, it will speed up Tableau. Just like the indexes speed up any query (if applied properly), Tableau is the same way - it's just querying the data and displaying the results.
Best practice ? Like everything else, analyse the usage to see where it is appropriate to apply indexes. Not enough is bad, too many is bad.
I work in an ETL development team where we use Spark-SQL to transform the data by creating and progressing with several intermediate temporary views in sequence and finally ending up with another temp view whose data is then copied into the target table folder.
However, at several instances our queries takes excessive amount of time even when dealing with small number of records (<~ 10K) and we scramble for help in all direction.
Hence I would like to know and learn about Spark SQL performance tuning in details (e.g. behind the scenes, architecture, and most importantly - interpreting Explain plans etc) which would help me to learn and create a solid foundation on the subject. I have experience in performance tuning with RDBMS (Teradata, Oracle etc) in the past.
Since, I am very new to this can anyone please point me in the right direction where I can find books, tutorials, courses etc on this subject. I have searched the internet and even several online learning platforms but couldn't find any comprehensive tutorial or resource to learn this.
Please help !
Thanks in advance..
I would not go into details as they can be very comprehensive. There are some concepts that you should consider while tuning your job.
Number of Executors
Number of Executor Cores
Executor Memory
Above 3 things directly impact the magnitude of parallelism achieved by your Application.
Shuffling
Spilling
Partitioning
Bucketing
Above are important with respect to your Data w.r.t Storage and Format.
P.S: Its just the tip of an iceberg! Goodluck
I am attaching a few links that refer to scaling Spark Jobs. That could be a nice starting point.
Scaling Spark Jobs At Facebook
Joins and Shuffling
I have been working on multidimensional analysis with pentaho community. The problem is, when I do the aggregations and filters, I get in the output no more than 1000 records(rows). I want to know if am doing something wrong or pentaho analysis tool has a limitation.
If so, does power BI community edition have a good limit ? Or can you suggest me another community tool to continue the work with it.
Are you using Saiku for OLAP analysis?
For Saiku, we have the limit in
TABLE_LAZY_SIZE = 1000 (default) Which you can change as per your requirement.
reference: http://saiku-documentation.readthedocs.io/en/latest/saiku_settings.html
I am trying to write to save a Spark DataFrame to Oracle. The save is working but the performance seems to be very poor.
I have tried 2 approaches using
dfToSave.write().mode(SaveMode.Append).jdbc(…) -- I suppose this uses below API internally.
JdbcUtils.saveTable(dfToSave,ORACLE_CONNECTION_URL, "table",props)
Both seem to be taking very long, more than 3 mins for size of 400/500 rows DataFrame.
I hit across a JIRA SPARK-10040 , but says it is resolved in 1.6.0 and I am using the same.
Anyone has faced the issue and knows how to resolve it?
I can tell you what happened to me. I turned down my partitions to query the database, and so my previously performant processing (PPP) became quite slow. However, since my dataset only collects when I post it back to the database, I (like you) thought there was a problem with the spark API, driver, connection, table structure, server configuration, anything. But, no, you just have to repartition after your query.
Currently I am backing up my Derby Database using the SYSCS_UTIL.SYSCS_BACKUP_DATABASE procedure. However due to a too big size sometimes this fails and I am wondering if there exist the possibility to backup only data that is newer than x days, where x is a settable values.
Would appreciate greatly any hints towards achieving this goal, if that is possible.