SparkSQL: intra-SparkSQL-application table registration - apache-spark-sql

Context. I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an application that iterates through each of those query files and passes it to a standalone Spark application. This latter first parses the query, extracts the used tables, registers them (using: registerTempTable() in Spark < 2 and createOrReplaceTempView() in Spark 2), and executes effectively the query (spark.sql()).
Challenge. Since registering the tables can be time consuming, I would like to lazily register the tables, i.e. only once when they are first used, and keep that in form of metadata that can readily be used in the subsequent queries without the need to re-register the tables with each query. It's a sort of intra-job caching but not any of the caching options Spark offers (table caching), as far as I know.
Is that possible? if not can anyone suggest another approach to accomplish the same goal (iterating through separate query files and run a querying Spark application without registering the tables that have already been registered before).

In general, registering a table should not take time (except if you have lots of files it might take time to generate the list of file sources). It is basically just giving the dataframe a name. What would take time is reading the dataframe from disk.
So the basic question is, how is the dataframe (tables) written to disk. If it is written as a large number of small files or a file format which is slow (e.g. csv), this can take some time (having lots of files take time to generate the file list and having a "slow" file format means the actual reading is slow).
So the first thing you can try to do is read your data and resave it.
lets say for the sake of example that you have a large number of csv files in some path. You can do something like:
df = spark.read.csv("path/*.csv")
now that you have a dataframe you can change it to have less files and use a better format such as:
df.coalesce(100).write.parquet("newPath")
If the above is not enough, and your cluster is large enough to cache everything, you might put everything in a single job, go over all tables in all queries, register all of them and cache them. Then run your sql queries one after the other (and time each one separately).
If all of this fails you can try to use something like alluxio (http://www.alluxio.org/) to create an in memory file system and try to read from that.

Related

Possible Reasons for SSIS Package to Process Every Second Row of Dataset

Firstly sorry that I can't get really specific, as SSIS packages may get very complicated and are awful to describe.
My sceanrio requires a CSV to be processed after validation against other tables and stored into several tables in a SQL database. My result is that only every second row of my CSV gets processed on the target system. The test and dev environments (which are equivalent set up) behave as expected and store all rows.
What are possible reasons for my target system to behave different? What are the most possible mistakes e.g. oversee/oversight?
In the flat file connection manager check whether the header row to skip is zero.

USQL Files vs. Managed Tables - how data is stored physically?

I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.
Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:
I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.
I will use managed tables and set up table partitioning
I am now looking into pros and cons of both scenarios. Some things that come to my mind are:
The ability to compress data in scenario #1 (at the cost of performance of course)
The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)
On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest
I would expect that the tables will offer better performance
One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://learn.microsoft.com/en-us/u-sql/ddl/tables):
First we can read that:
"U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?
Later we can read that:
"..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."
Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.
Therefore I have several questions:
Is there some documentation explaining in more detail how TABLE REBUILD works?
What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?
How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)
Is there any documentation explaining in more details how .ss files are created?
Which of the scenarios would you take and why?
Many thanks for your help,
Jakub
EDIT: I did some more tests and it only made it more intriguing.
I took a sample of 7 days of data
I‌ created a table partitioned by date
I created 8 partitions - one for each day + one default
I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions
I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)
To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)
I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense
I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)
I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny
I then imported another file containing only 2 days of data
I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files
Being even more intrigued I ran the TABLE REBUILD
I endeed up with 8 files (for each partitions) but... they were all just recently modified
Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?
Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.

Many small data table I/O for pandas?

I have many table (about 200K of them) each small (typically less than 1K rows and 10 columns) that I need to read as fast as possible in pandas. The use case is fairly typical: a function loads these table one at a time, computes something on them and stores the final result (not keeping the content of the table in memory).
This is done many times over and I can choose the storage format for these tables for best (speed) performance.
What natively supported storage format would be the quickest?
IMO there are a few options in this case:
use HDF Store (AKA PyTable, H5) as #jezrael has already suggested. You can decide whether you want to group some/all of your tables and store them in the same .h5 file using different identifiers (or keys in Pandas terminology)
use new and extremely fast Feather-Format (part of the Apache Arrow project). NOTE: it's still a bit new format so its format might be changed in future which could lead to incompatibilities between different versions of feather-format module. You also can't put multiple DFs in one feather file, so you can't group them.
use a database for storing/reading tables. PS it might be slower for your use-case.
PS you may also want to check this comparison especially if you want to store your data in compressed format

What is more efficient INSERT command or SQL Loader for bulk upload - ORACLE 11g R2

As part of a new process requirement, we will be creating table and which will contain approximately 3000 - 4000 records. We have a copy of these records in plain text on a txt file.
Loading these records in the table leaves me with two choices
Use a shell script to generate SQL file containing INSERT Statements for these records
with the use of awk, shell variables, and loops to create a sql and script execution of this sql, we can be performed with ease
Use of SQL Loader.
Realignment of the record list and ctl file generation the only dependency.
Which of the above two options would be most efficient, in terms of taking up DB resources, utilisation on the client server on which this is to be performed.
I do realise the number of records are rather small, but we may have to repeat this activity with higher number of records (close to 60,000) in which case I would like to have the best possible option configured from the start.
SQL*Loader is the more efficient method. It gives you more control. You have an option do DIRECT load and NOLOGGING, which will reduce redo log generation, and when indexes have been disabled (as part of direct loading), the loading goes faster. Downside, is if load is interupted, indexes are left unusable.
But, considering the advantages, SQL*Loader is the best approach. And you will feel the difference, when you have millions of records, and having so many loading jobs running in parallel. I heard DBA complaining about the log size, when we do CONVENTIONAL INSERT statement loading, with 200+ such jobs, running in parallel. The larger the data volume, the larger the difference you'll see in performance.
SQL*Loader will be more efficient than thousands of individual INSERT statements. Even with 60,000 rows, though, both approaches should complete in a matter of seconds.
Of the two options you mentioned, SQL*Loader is definitely the way to go - much faster and more efficient.
However, I'd choose another approach - external tables. Has all the benefits of SQL*Loader, and allows you to treat your external csv file like an ordinary database table.

Microsoft Access: Import CSV file from a list of multiple files

I have a huge file around 10 GB in *.csv format. It is data from 1960 to present date for different regions. I was able to break down the file by the regions. There are about 8000 regions and I split the file by regions so I have 8000 files about 2 MB each.
I was wondering what would be the most efficient way to create a Access database system to look up data for each region. Is it by:
Separating the file into small files (i.e by each region 8000 files) by region name and importing them to access each time, OR
Splitting them into constant sizes about 1 GB each and querying them.
In either case, how do I import the files to Access?
As you may be aware, an Access database file is limited to 2GB in size, so you almost certainly won't be able to keep all of the information in one file. Even if it did just barely fit, keeping that much information in a single Access database file would likely be rather slow to work with.
Depending on the "shape" of your data there may be other limits in Access that could cause difficulties in your particular situation. For example, a table (or query) is limited to 255 columns. If you haven't done so already, take a look at the Access specifications before proceeding too much further with this.
And in any case, consider using another database as the back-end. Your data may even be too large for single database in SQL Server Express Edition (maximum of 10GB total size per database, as I recall), but even if you had to split the data into two SQL Express databases it would be easier to deal with than a dozen (or more?) Access databases.
Bear in mind that if you use a different database back-end you may still be able to use Access as a query and reporting tool (via ODBC linked tables).
Edit re: comment
Based on your description, if you will never need to query across regions (and remember that "never" is a very long time™) then your 8000-file approach would be feasible. However, I wouldn't necessarily recommend importing the corresponding CSV data every time you want to run a query. Instead, I would borrow ideas from both Tom's and HansUp's answers:
Plan "A": Start by running queries directly against the CSV files themselves to see if that is fast enough for your needs. You could test that by creating a linked table to the CSV file and running some typical queries. As Tom mentioned, a CSV linked table cannot be indexed, so if you find that the queries are too slow then you may have to go to Plan "B".
Plan "B": If you do need to import the CSV data then you'll probably want to use HansUp's suggestion of using DoCmd.TransferText to help automate the process. It seems wasteful to import the specific CSV file for every query, so you might consider creating ~8000 .accdb files and then using a query like...
strSQL = _
"SELECT * FROM TableName " & _
"IN ""C:\__tmp\region12345.accdb"" " & _
"WHERE StartDate BETWEEN #2013-05-10# AND #2013-05-15#"
...where your code could substitute
the name of the appropriate .accdb file based on the region of interest, and
the required date range.
If you will be doing this with VBA, you can use the DoCmd.TransferText Method to import CSV data into Access.
I wouldn't want to do that in your situation, though. 10 GB is too much data to reasonably manage in Access. And if you partition that into separate db files, querying data pulled from multiple db files is challenging and slow. Furthermore, if the query's combined result set hits the 2 GB Access limit, you will get a confusing error about insufficient disk space.
This is not a reasonable job for data storage in MS Access.
#Gords & #HansUps are very good answers. Use a better backend for your data. Free ones would include SQL Express & MySQL. If you're in a corporate environment, then you may have a license for MS SQL Server.
However, if you insist on doing this doing this in strictly Access, here are two related ideas. Both ideas require that you link and de-link (using VBA) to the data you need as you need it.
You don't have to import a CSV file to be able to see it as a table. You can link to it just as you would a table in another database.
Positives: You don't have to change your existing data format.
Drawbacks: You can't edit your existing data, nor can you index it,
so queries may be slow.
Or, you can convert each CSV file into it's own Access DB (you can do this with VBA to automate this). Then, like in the above suggestion, link & de-link the tables as needed.
Positives: You can edit your existing data, and also index it, so
queries may be quick.
Drawbacks: It's an awful amount of work just to avoid using a
different backend DB.