USQL Files vs. Managed Tables - how data is stored physically? - azure-data-lake

I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.
Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:
I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.
I will use managed tables and set up table partitioning
I am now looking into pros and cons of both scenarios. Some things that come to my mind are:
The ability to compress data in scenario #1 (at the cost of performance of course)
The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)
On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest
I would expect that the tables will offer better performance
One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://learn.microsoft.com/en-us/u-sql/ddl/tables):
First we can read that:
"U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?
Later we can read that:
"..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."
Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.
Therefore I have several questions:
Is there some documentation explaining in more detail how TABLE REBUILD works?
What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?
How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)
Is there any documentation explaining in more details how .ss files are created?
Which of the scenarios would you take and why?
Many thanks for your help,
Jakub
EDIT: I did some more tests and it only made it more intriguing.
I took a sample of 7 days of data
I‌ created a table partitioned by date
I created 8 partitions - one for each day + one default
I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions
I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)
To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)
I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense
I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)
I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny
I then imported another file containing only 2 days of data
I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files
Being even more intrigued I ran the TABLE REBUILD
I endeed up with 8 files (for each partitions) but... they were all just recently modified
Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?

Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.

Related

Keeping archived data schema up to date with running data warehouse

recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
Plan A:
It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
Every period, "Export tablespace" to remove the oldest from the partition scheme.
That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
Don't archive the old data.
Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

SparkSQL: intra-SparkSQL-application table registration

Context. I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an application that iterates through each of those query files and passes it to a standalone Spark application. This latter first parses the query, extracts the used tables, registers them (using: registerTempTable() in Spark < 2 and createOrReplaceTempView() in Spark 2), and executes effectively the query (spark.sql()).
Challenge. Since registering the tables can be time consuming, I would like to lazily register the tables, i.e. only once when they are first used, and keep that in form of metadata that can readily be used in the subsequent queries without the need to re-register the tables with each query. It's a sort of intra-job caching but not any of the caching options Spark offers (table caching), as far as I know.
Is that possible? if not can anyone suggest another approach to accomplish the same goal (iterating through separate query files and run a querying Spark application without registering the tables that have already been registered before).
In general, registering a table should not take time (except if you have lots of files it might take time to generate the list of file sources). It is basically just giving the dataframe a name. What would take time is reading the dataframe from disk.
So the basic question is, how is the dataframe (tables) written to disk. If it is written as a large number of small files or a file format which is slow (e.g. csv), this can take some time (having lots of files take time to generate the file list and having a "slow" file format means the actual reading is slow).
So the first thing you can try to do is read your data and resave it.
lets say for the sake of example that you have a large number of csv files in some path. You can do something like:
df = spark.read.csv("path/*.csv")
now that you have a dataframe you can change it to have less files and use a better format such as:
df.coalesce(100).write.parquet("newPath")
If the above is not enough, and your cluster is large enough to cache everything, you might put everything in a single job, go over all tables in all queries, register all of them and cache them. Then run your sql queries one after the other (and time each one separately).
If all of this fails you can try to use something like alluxio (http://www.alluxio.org/) to create an in memory file system and try to read from that.

Improving query performance of of database table with large number of columns and rows(50 columns, 5mm rows)

We are building an caching solution for our user data. The data is currently stored i sybase and is distributed across 5 - 6 tables but query service built on top of it using hibernate and we are getting a very poor performance. In order to load the data into the cache it would take in the range of 10 - 15 hours.
So we have decided to create a denormalized table of 50 - 60 columns and 5mm rows into another relational database (UDB), populate that table first and then populate the cache from the new denormalized table using JDBC so the time to build us cache is lower. This gives us a lot better performance and now we can build the cache in around an hour but this also does not meet our requirement of building the cache whithin 5 mins. The denormlized table is queried using the following query
select * from users where user id in (...)
Here user id is the primary key. We also tried a query
select * from user where user_location in (...)
and created a non unique index on location also but that also did not help.
So is there a way we can make the queries faster. If not then we are also open to consider some NOSQL solutions.
Which NOSQL solution would be suited for our needs. Apart from the large table we would be making around 1mm updates on the table on a daily basis.
I have read about mongo db and seems that it might work but no one has posted any experience with mongo db with so many rows and so many daily updates.
Please let us know your thoughts.
The short answer here, relating to MongoDB, is yes - it can be used in this way to create a denormalized cache in front of an RDBMS. Others have used MongoDB to store datasets of similar (and larger) sizes to the one you described, and can keep a dataset of that size in RAM. There are some details missing here in terms of your data, but it is certainly not beyond the capabilities of MongoDB and is one of the more frequently used implementations:
http://www.mongodb.org/display/DOCS/The+Database+and+Caching
The key will be the size of your working data set and therefore your available RAM (MongoDB maps data into memory). For larger solutions, write heavy scaling, and similar issues, there are numerous approaches (sharding, replica sets) that can be employed.
With the level of detail given it is hard to say for certain that MongoDB will meet all of your requirements, but given that others have already done similar implementations and based on the information given there is no reason it will not work either.

SQL Data load - Suggestions required

Gurus,
We are in process of setting up a SSIS package to load a formatted text file in to SQL server. It will have around 100 million rows and file size will be (multiple files of around 15 GB each) 100 GB. The file format is aligned with XML schema like mentioned below... it takes nearly 72 hrs to load this file in to SQL server tables...
File format
EM|123|XYZ|30|Sales mgr|20000|AD|1 Street 1| State1|City1|US|AD|12Street 2|state 2|City2|UK|CON|2012689648|CON|42343435
EM|113|WYZ|31|Sales grade|200|AD|12 Street 1| State2|City2|US|AD|1Street 22|state 3|City 3|UK|CON|201689648|CON|423435
EM|143|rYZ|32|Sales Egr|2000|AD|113Street 1| State3|City3|US|AD|12Street 21|state 4|City 5|UK|CON|201269648|CON|443435
Data will come in above format. It means "EM" till "AD" is Employee details like Code,Name,age,Designation,Salary and "AD" is Address details like Street,Sate,City,Country. Address data can be multiple for same employee...similarly "CON" is contact details with Phone number which also may be multiple.
So, we need to load Employee Details in to seperate table, Address details in seperate table and Contact details in seperate table with Code as Primary key in Employee Details and Reference key in other two tables.
We designed package like, had a Script Component as Source and parsed line by line by using .NET scripts and created multiple out put buffers each per table and added the row in the script. Mapped the Script component output to 3 OLE DB Destinations (SQL Server tables).
Our server is Quad Core with 48 GB RAM virtualized and we have 2 cores with 24 GB dedicated for DB. Our SQL server DB (Simple Recovery model) has Data files in Network share location that is SAN storage. To improve performance we created Each table in differenct data file (Primary and secondary).. but still it takes around 72 hrs.
Need guidance on following points.
Is it possible to use BCP, if yes any pointers.. (Hope BCP will perform better)
Any suggestions on specified solution.
Any alternates...
There are no indexes defined on the table also no triggers...We have even set defaultMaxbufferzie to 100 MB
Looking forward for response..Any help is much appreciated..
1.) If necessary, simplify/flatten your XML file via XSLT as shown here:
http://blogs.msdn.com/b/mattm/archive/2007/12/15/xml-source-making-things-easier-with-xslt.aspx
2.) Use XML Source as shown here:
http://blogs.msdn.com/b/mattm/archive/2007/12/11/using-xml-source.aspx
3.) Drop any indexes on the destination tables
4.) If your source data is bulletproof, disable constraints on the tables via:
ALTER TABLE [MyTable] NOCHECK CONSTRAINT ALL
5.) Load the data via OLEDB Destination
6.) Re-enable constraints
7.) Re-create indexes
You say the data files are on a network share. One improvement would be to add a hard drive and run the job on the SQL server as you'd eliminate latency. I think even hooking up a USB drive to read the files from would be better than using a network location. Certainly worth a little test in my opinion.
SSIS is pretty quick when doing bulk loads, so I suspect that the bottleneck isn't with SSIS itself, but something about the way your database/server is configured. Some suggestions:
When you're running the import, how many rows are you importing every second (you can do a "SELECT COUNT(*) FROM yourtable WITH READUNCOMMITTED" during the import to see) Does this rate stay constant, or does it slow down towards the end of your import?
As others have said, do you have any indexes or triggers on your destination tables?
While you're running the import, what do your disks look like? In perfmon, is the disk queue spiking like crazy, indicating that your disks are the bottleneck? What's the throughput on these disks during normal performance testing? I've had experience where improperly configured iSCSI or improperly aligned SAN storage can drop my disks from 400MB/s down to 15MB/s - still fine during regular usage, but way too slow to do anything in bulk.
You're also talking about loading 100GB of data, which is no small amount - it shouldn't take 72 hours to load, but it won't load it in 20 minutes either, so have reasonable expectations. Please weigh in on these and the other bottlenecks people have asked about and we may be able to help you isolate your issue.
If you have any control over the way the files are created to begin with, I would move away from the one-to-many relationship you have with the |EM| and |AD|,|CON| and do something like this:
|EM|EmpID|data|data|
|AD|EmpID|data|data|
|CON|EmpID|data|data|
Further, if you can split the records into three different files, you'd be able to use a Flat File Source component with a fixed specification for each source to process the data in bulk.

Efficiently storing 7.300.000.000 rows

How would you tackle the following storage and retrieval problem?
Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:
id (unique row identifier)
entity_id (takes on values between 1 and 2.000.000 inclusive)
date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10))
value_1 (takes on values between 1 and 1.000.000 inclusive)
value_2 (takes on values between 1 and 1.000.000 inclusive)
entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The database must be able to hold 10 years worth of daily data (7.300.000.000 rows (3.650*2.000.000)).
What is described above is the write patterns. The read pattern is simple: all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345.
Transactional support is not needed, but the storage solution must be open-sourced. Ideally I'd like to use MySQL, but I'm open for suggestions.
Now - how would you tackle the described problem?
Update: I was asked to elaborate regarding the read and write patterns. Writes to the table will be done in one batch per day where the new 2M entries will be added in one go. Reads will be done continuously with one read every second.
"Now - how would you tackle the described problem?"
With simple flat files.
Here's why
"all queries will be made on a
specific entity_id. I.e. retrieve all
rows describing entity_id = 12345."
You have 2.000.000 entities. Partition based on entity number:
level1= entity/10000
level2= (entity/100)%100
level3= entity%100
The each file of data is level1/level2/level3/batch_of_data
You can then read all of the files in a given part of the directory to return samples for processing.
If someone wants a relational database, then load files for a given entity_id into a database for their use.
Edit On day numbers.
The date_id/entity_id uniqueness rule is not something that has to be handled. It's (a) trivially imposed on the file names and (b) irrelevant for querying.
The date_id "rollover" doesn't mean anything -- there's no query, so there's no need to rename anything. The date_id should simply grow without bound from the epoch date. If you want to purge old data, then delete the old files.
Since no query relies on date_id, nothing ever needs to be done with it. It can be the file name for all that it matters.
To include the date_id in the result set, write it in the file with the other four attributes that are in each row of the file.
Edit on open/close
For writing, you have to leave the file(s) open. You do periodic flushes (or close/reopen) to assure that stuff really is going to disk.
You have two choices for the architecture of your writer.
Have a single "writer" process that consolidates the data from the various source(s). This is helpful if queries are relatively frequent. You pay for merging the data at write time.
Have several files open concurrently for writing. When querying, merge these files into a single result. This is helpful is queries are relatively rare. You pay for merging the data at query time.
Use partitioning. With your read pattern you'd want to partition by entity_id hash.
You might want to look at these questions:
Large primary key: 1+ billion rows MySQL + InnoDB?
Large MySQL tables
Personally, I'd also think about calculating your row width to give you an idea of how big your table will be (as per the partitioning note in the first link).
HTH.,
S
Your application appears to have the same characteristics as mine. I wrote a MySQL custom storage engine to efficiently solve the problem. It is described here
Imagine your data is laid out on disk as an array of 2M fixed length entries (one per entity) each containing 3650 rows (one per day) of 20 bytes (the row for one entity per day).
Your read pattern reads one entity. It is contiguous on disk so it takes 1 seek (about 8mllisecs) and read 3650x20 = about 80K at maybe 100MB/sec ... so it is done in a fraction of a second, easily meeting your 1-query-per-second read pattern.
The update has to write 20 bytes in 2M different places on disk. IN simplest case this would take 2M seeks each of which takes about 8millisecs, so it would take 2M*8ms = 4.5 hours. If you spread the data across 4 “raid0” disks it could take 1.125 hours.
However the places are only 80K apart. In the which means there are 200 such places within a 16MB block (typical disk cache size) so it could operate at anything up to 200 times faster. (1 minute) Reality is somewhere between the two.
My storage engine operates on that kind of philosophy, although it is a little more general purpose than a fixed length array.
You could code exactly what I have described. Putting the code into a MySQL pluggable storage engine means that you can use MySQL to query the data with various report generators etc.
By the way, you could eliminate the date and entity id from the stored row (because they are the array indexes) and may be the unique id – if you don't really need it since (entity id, date) is unique, and store the 2 values as 3-byte int. Then your stored row is 6 bytes, and you have 700 updates per 16M and therefore a faster inserts and a smaller file.
Edit Compare to Flat Files
I notice that comments general favor flat files. Don't forget that directories are just indexes implemented by the file system and they are generally optimized for relatively small numbers of relatively large items. Access to files is generally optimized so that it expects a relatively small number of files to be open, and has a relatively high overhead for open and close, and for each file that is open. All of those "relatively" are relative to the typical use of a database.
Using file system names as an index for a entity-Id which I take to be a non-sparse integer 1 to 2Million is counter-intuitive. In a programming you would use an array, not a hash-table, for example, and you are inevitably going to incur a great deal of overhead for an expensive access path that could simply be an array indeing operation.
Therefore if you use flat files, why not use just one flat file and index it?
Edit on performance
The performance of this application is going to be dominated by disk seek times. The calculations I did above determine the best you can do (although you can make INSERT quicker by slowing down SELECT - you can't make them both better). It doesn't matter whether you use a database, flat-files, or one flat-file, except that you can add more seeks that you don't really need and slow it down further. For example, indexing (whether its the file system index or a database index) causes extra I/Os compared to "an array look up", and these will slow you down.
Edit on benchmark measurements
I have a table that looks very much like yours (or almost exactly like one of your partitions). It was 64K entities not 2M (1/32 of yours), and 2788 'days'. The table was created in the same INSERT order that yours will be, and has the same index (entity_id,day). A SELECT on one entity takes 20.3 seconds to inspect the 2788 days, which is about 130 seeks per second as expected (on 8 millisec average seek time disks). The SELECT time is going to be proportional to the number of days, and not much dependent on the number of entities. (It will be faster on disks with faster seek times. I'm using a pair of SATA2s in RAID0 but that isn't making much difference).
If you re-order the table into entity order
ALTER TABLE x ORDER BY (ENTITY,DAY)
Then the same SELECT takes 198 millisecs (because it is reading the order entity in a single disk access).
However the ALTER TABLE operation took 13.98 DAYS to complete (for 182M rows).
There's a few other things the measurements tell you
1. Your index file is going to be as big as your data file. It is 3GB for this sample table. That means (on my system) all the index at disk speeds not memory speeds.
2.Your INSERT rate will decline logarithmically. The INSERT into the data file is linear but the insert of the key into the index is log. At 180M records I was getting 153 INSERTs per second, which is also very close to the seek rate. It shows that MySQL is updating a leaf index block for almost every INSERT (as you would expect because it is indexed on entity but inserted in day order.). So you are looking at 2M/153 secs= 3.6hrs to do your daily insert of 2M rows. (Divided by whatever effect you can get by partition across systems or disks).
I had similar problem (although with much bigger scale - about your yearly usage every day)
Using one big table got me screeching to a halt - you can pull a few months but I guess you'll eventually partition it.
Don't forget to index the table, or else you'll be messing with tiny trickle of data every query; oh, and if you want to do mass queries, use flat files
Your description of the read patterns is not sufficient. You'll need to describe what amounts of data will be retrieved, how often and how much deviation there will be in the queries.
This will allow you to consider doing compression on some of the columns.
Also consider archiving and partitioning.
If you want to handle huge data with millions of rows it can be considered similar to time series database which logs the time and saves the data to the database. Some of the ways to store the data is using InfluxDB and MongoDB.