Performance Issue with writing Spark Dataframes to Oracle Database - apache-spark-sql

I am trying to write to save a Spark DataFrame to Oracle. The save is working but the performance seems to be very poor.
I have tried 2 approaches using
dfToSave.write().mode(SaveMode.Append).jdbc(…) -- I suppose this uses below API internally.
JdbcUtils.saveTable(dfToSave,ORACLE_CONNECTION_URL, "table",props)
Both seem to be taking very long, more than 3 mins for size of 400/500 rows DataFrame.
I hit across a JIRA SPARK-10040 , but says it is resolved in 1.6.0 and I am using the same.
Anyone has faced the issue and knows how to resolve it?

I can tell you what happened to me. I turned down my partitions to query the database, and so my previously performant processing (PPP) became quite slow. However, since my dataset only collects when I post it back to the database, I (like you) thought there was a problem with the spark API, driver, connection, table structure, server configuration, anything. But, no, you just have to repartition after your query.

Related

Which is faster: doing a SQL join from inside R (DBI), or doing the join inside PGAdmin or DataGrip, saving a CSV, and importing to R?

Summary
I have a data analysis project that requires the use of a locally stored PostgreSQL database of fairly decent size for a desktop machine (~10 tables with up to ~90m rows and ~20 columns, summing to around 20gb worth of data).
I have some statistical models I want to execute on this data using R. First, though, I need to manipulate the data a little bit to get it into the form I want. The basic manipulations are fairly straightforward SELECT and JOIN operations, but they take a a few minutes to do on my machine because of the amount of data. I'll need to refer to the manipulated tables again and again in analysis, so I'd like to be able to save the results of these SELECT and JOIN operations for later use.
Question
Is it faster or computationally more efficient to
(a) execute the joins from R's DBI package, using, say, dbGetQuery, and saving the resulting dataframes on disk for later analysis
or
(b) do the joins and selects inside PGAdmin or DataGrip, save the result to a .csv file, and bring that into R?
What I've tried
I've tried three of the operations I need to do in both RStudio (as in a) and DataGrip (as in b) and timed them with a stopwatch. In 2 instances, the code seems to go faster inside the SQL environment in DataGrip, and in the third, it's marginally faster in RStudio. I'm not sure why this is, besides the third operation working on smaller tables than the first two. No, I don't know how to benchmark code in either platform, and yes, this may be part of my issue. Nor do I know much about big-O notation, but that may not be relevant here.
I can include some more concrete code if it's helpful, but my question (it seems to me) is a little more theoretical. I'm basically asking if connecting to a SQL server on my local machine should be any different if I'm doing it in R versus doing it in some "proper" database environment. Are there bottlenecks in one and not the other?
Thanks in advance for any insight!

Fetch_pandas vs Unload as Parquet - Snowflake data unloading using Python connector

I am new to Snowflake and Python. I am trying to figure out which would faster and more efficient:
Read data from snowflake using fetch_pandas_all() or fetch_pandas_batches() OR
Unload data from Snowflake into Parquet files and then read them into a dataframe.
CONTEXT
I am working on a data layer regression testing tool, that has to verify and validate datasets produced by different versions of the system.
Typically a simulation run produces around 40-50 million rows, each having 18 columns.
I have very less idea about pandas or python, but I am learning (I used to be a front-end developer).
Any help appreciated.
LATEST UPDATE (09/11/2020)
I used fetch_pandas_batches() to pull down data into manageable dataframes and then wrote them to the SQLite database. Thanks.
Based on your use-case, you are likely better off just running a fetch_pandas_all() command to get the data into a df. The performance is likely better as it's one hop of the data, and it's easier to code, as well. I'm also a fan of leveraging the SQLAlchemy library and using the read_sql command. That looks something like this:
resultSet = pd.read_sql(text(sqlQuery), SnowEngine)
once you've established an engine connection. Same concept, but leverages the SQLAlchemy library instead.

solutions for cleaning/manipulating big data (currently using Stata)

I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.
However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)
The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.
Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.
Here are some of my current ideas:
Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.
Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.
R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.
Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.
Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.
I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.
Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.

What Database for extensive logfile analysis?

The task is to filter and analyze a huge amount of logfiles (around 8TB) from a finished research project. The idea is to fill a database with the data to be able to run different analysis tasks later.
The values are stored comma separated. In principle the values are tuples of up to 5 values:
id, timestamp, type, v1, v2, v3, v4, v5
In a first try using MySQL I used one table with one log entry per row. So there is no direct relation between the log values. The downside here is slow querying of subsets.
Because there is no relation I looked into alternatives like NoSQL databases, and column based tables like hbase or cassandra seemed to be a perfect fit for this kind of data. But these systems are made for huge distributed systems, which we not have. In our case the analysis will run on a single machine or perhaps some VMs.
Which kind of database would fit this task? Is it worth to setup a single machine instance with hadoop+hbase... or is this all a bit over-sized?
What database would you choose to do high-performance logfile analysis?
EDIT: Maybe out of my question it is not clear that we cannot spend money for cloud services or new hardware. The Question is if there are benefits in using noSQL approaches instead of mySQL (especially for this data). If there are none, or if they are so small that the effort of setting up a noSQL system is not worth the benefit we can use our ESXi infrastructure and MySQL.
EDIT2: I'm still having the Problem here. I did further experiments with MySQL and just inserted a quarter of all available data. The insert is now running for over 2 days and is not yet finished. Currently there are 2,147,483,647 rows in my single table db. With indeces this takes 211,2 GiB of disk space. And this is just a quarter of all logging data...
A query of the form
SELECT * FROM `table` WHERE `timestamp`>=1342105200000 AND `timestamp`<=1342126800000 AND `logid`=123456 AND `unit`="UNIT40";
takes 761 seconds to complete, in this case returning one row.
There is a combined index on timestamp, logid, unit.
So I think this is not the way to go, because later in analysis I will have to get all entries in a time range and compare the datapoints.
I read bout MongoDB and Redis, but the problem with them is, that they are in Memory databases.
In the later analyzing process there will a very small amount of concurrent database access. In fact the analyzing will be run from one single machine.
I do not need redundancy. I would be able to regenerate the database in case of a failure.
When the database is once completely written, there would also be no need to update or add further row.
What do you think about alternatives like Redis, MongoDB and so on. When I get this right, i would need RAM in the dimension of my data...
Is this task even somehow possible with a single node system or with maybe two nodes?
well i personally would prefer the faster solution, as you said you need a high-perfomance analysis. the problem is, if you have to setup a whole new system to do so and the performance-improvement would be minor in relation to the additional effort you'd need, then stay with SQL.
in our company, we have a quite small Database containing not even half a GB of Data on the VM. the problem now is, as soon as you use a VM, you will have major performance issues, when opening the Database on VM you can go for a coffee in the meantime ;)
But if the time until the Database is loaded to cache is not so important it doesn't matter. It all depends on how much faster you think the new System will be, and how much effort you will have to put in it, but as i said i'd prefer the faster solution if you have to go for "high-performance analysis"

How do you speed up CSV file process? (5 million or more records)

I wrote a VB.net console program to process CSV record that come in a text file. I'm using FileHelpers library
along with MSFT Enterprise library 4. To read the record one at the time and insert into the database.
It took about 3 - 4 hours to process 5+ million records on the text file.
Is there anyway to speed up the process? Has anyone deal with such large amount of records before and how would you update such records if there is new data to be update?
edit: Can someone recommend a profiler? prefer Open source or free.
read the record one at the time and insert into the database
Read them in batches and insert them in batches.
Use a profiler - find out where the time is going.
Short of a real profiler, try the following:
Time how long it takes to just read the files line by line, without doing anything with them
Take a sample line, and time how long it takes just to parse it and do whatever processing you need, 5+ million times
Generate random data and insert it into the database, and time that
My guess is that the database will be the bottleneck. You should look into doing a batch insert - if you're inserting just a single record at a time, that's likely to be a lot slower than batch inserting.
I have done many applications like this in the past and there are a number of ways that you can look at optimizing.
Ensure that the code you are writing is properly managing memory, with something like this one little mistake here can slow the process to a crawl.
Think about writing the database calls to be Async as it may be the bottleneck so a bit a queuing could be ok
Consider dropping indexes, doing the import then re-doing the import.
Consider using SSIS to do the import, it is already optimized and does this kind of thing out fo the box.
Why not just insert that data directly to SQL Server Database using Microsoft SQL Server Management Studio or command line - SQLCMD? It does know how to process CVC files.
BulkInsert property should be set to True on your database.
If it has to be modified, you can insert it into Temprorary table and then apply your modifications with T-SQL.
Best bet would to try using a profiler with a relatively small sample -- this could identify where the actual hold-ups are.
Load it into memory and then insert into the DB. 5 million rows shouldn't tax your memory. The problem is you are essentially thrashing your disk--both reading the CSV and writing to the DB.
I'd speed it up the same way I'd speed anything up: by running it through a profiler and figuring out what's taking the longest.
There is absolutely no way to guess what the bottleneck here is -- maybe there is a bug in the code which parses the CSV file, resulting in polynomial runtimes? Maybe there is some very complex logic used to process each row? Who knows!
Also, for the "record", 5 million rows isn't all THAT heavy -- an off-the-top-of-my-head guess says that a reasonable program should be able to churn through that in half an hour, an a good program in much less.
Finally, if you find that the database is your bottleneck, check to see if a transaction is being committed after each insert. That can lead to some nontrivial slowdown...
Not sure what you're doing with them, but have you considered perl? I recently re-wrote a vb script that was doing something similar - processing thousands of records - and the time went from about an hour for the vb script to about 15 seconds for perl.
After reading all records from file (I would read entire file in one pass, or in blocks), then use the SqlBulkCopy class to import your records into the DB. SqlBulkCopy is, as far as I know, the fasted approach to importing a block of records. There are a number of tutorials online.
As others has suggested, profile the app first.
That said, you will probably gain from doing batch inserts. This was the case for one app I worked with, and it was a high impact.
Consider 5 million round trips are a lot, specially if each of them is for a simple insert.
In a similar situation we saw considerable performance improvement by switching from one-row-at-time inserts to using the SqlBulkCopy API.
There is a good article here.
You need to bulk load the data into your database, assuming it has that facility. In Sql Server you'd be looking at BCP, DTS or SSIS - BCP is the oldest but maybe the fastest. OTOH if that's not possible in your DB turn off all indexes before doing the run, I'm guessing it's the DB that's causing problems, not the .Net code.