Gurus,
We are in process of setting up a SSIS package to load a formatted text file in to SQL server. It will have around 100 million rows and file size will be (multiple files of around 15 GB each) 100 GB. The file format is aligned with XML schema like mentioned below... it takes nearly 72 hrs to load this file in to SQL server tables...
File format
EM|123|XYZ|30|Sales mgr|20000|AD|1 Street 1| State1|City1|US|AD|12Street 2|state 2|City2|UK|CON|2012689648|CON|42343435
EM|113|WYZ|31|Sales grade|200|AD|12 Street 1| State2|City2|US|AD|1Street 22|state 3|City 3|UK|CON|201689648|CON|423435
EM|143|rYZ|32|Sales Egr|2000|AD|113Street 1| State3|City3|US|AD|12Street 21|state 4|City 5|UK|CON|201269648|CON|443435
Data will come in above format. It means "EM" till "AD" is Employee details like Code,Name,age,Designation,Salary and "AD" is Address details like Street,Sate,City,Country. Address data can be multiple for same employee...similarly "CON" is contact details with Phone number which also may be multiple.
So, we need to load Employee Details in to seperate table, Address details in seperate table and Contact details in seperate table with Code as Primary key in Employee Details and Reference key in other two tables.
We designed package like, had a Script Component as Source and parsed line by line by using .NET scripts and created multiple out put buffers each per table and added the row in the script. Mapped the Script component output to 3 OLE DB Destinations (SQL Server tables).
Our server is Quad Core with 48 GB RAM virtualized and we have 2 cores with 24 GB dedicated for DB. Our SQL server DB (Simple Recovery model) has Data files in Network share location that is SAN storage. To improve performance we created Each table in differenct data file (Primary and secondary).. but still it takes around 72 hrs.
Need guidance on following points.
Is it possible to use BCP, if yes any pointers.. (Hope BCP will perform better)
Any suggestions on specified solution.
Any alternates...
There are no indexes defined on the table also no triggers...We have even set defaultMaxbufferzie to 100 MB
Looking forward for response..Any help is much appreciated..
1.) If necessary, simplify/flatten your XML file via XSLT as shown here:
http://blogs.msdn.com/b/mattm/archive/2007/12/15/xml-source-making-things-easier-with-xslt.aspx
2.) Use XML Source as shown here:
http://blogs.msdn.com/b/mattm/archive/2007/12/11/using-xml-source.aspx
3.) Drop any indexes on the destination tables
4.) If your source data is bulletproof, disable constraints on the tables via:
ALTER TABLE [MyTable] NOCHECK CONSTRAINT ALL
5.) Load the data via OLEDB Destination
6.) Re-enable constraints
7.) Re-create indexes
You say the data files are on a network share. One improvement would be to add a hard drive and run the job on the SQL server as you'd eliminate latency. I think even hooking up a USB drive to read the files from would be better than using a network location. Certainly worth a little test in my opinion.
SSIS is pretty quick when doing bulk loads, so I suspect that the bottleneck isn't with SSIS itself, but something about the way your database/server is configured. Some suggestions:
When you're running the import, how many rows are you importing every second (you can do a "SELECT COUNT(*) FROM yourtable WITH READUNCOMMITTED" during the import to see) Does this rate stay constant, or does it slow down towards the end of your import?
As others have said, do you have any indexes or triggers on your destination tables?
While you're running the import, what do your disks look like? In perfmon, is the disk queue spiking like crazy, indicating that your disks are the bottleneck? What's the throughput on these disks during normal performance testing? I've had experience where improperly configured iSCSI or improperly aligned SAN storage can drop my disks from 400MB/s down to 15MB/s - still fine during regular usage, but way too slow to do anything in bulk.
You're also talking about loading 100GB of data, which is no small amount - it shouldn't take 72 hours to load, but it won't load it in 20 minutes either, so have reasonable expectations. Please weigh in on these and the other bottlenecks people have asked about and we may be able to help you isolate your issue.
If you have any control over the way the files are created to begin with, I would move away from the one-to-many relationship you have with the |EM| and |AD|,|CON| and do something like this:
|EM|EmpID|data|data|
|AD|EmpID|data|data|
|CON|EmpID|data|data|
Further, if you can split the records into three different files, you'd be able to use a Flat File Source component with a fixed specification for each source to process the data in bulk.
Related
I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers.
Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:
I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.
I will use managed tables and set up table partitioning
I am now looking into pros and cons of both scenarios. Some things that come to my mind are:
The ability to compress data in scenario #1 (at the cost of performance of course)
The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)
On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest
I would expect that the tables will offer better performance
One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://learn.microsoft.com/en-us/u-sql/ddl/tables):
First we can read that:
"U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?
Later we can read that:
"..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."
Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.
Therefore I have several questions:
Is there some documentation explaining in more detail how TABLE REBUILD works?
What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?
How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)
Is there any documentation explaining in more details how .ss files are created?
Which of the scenarios would you take and why?
Many thanks for your help,
Jakub
EDIT: I did some more tests and it only made it more intriguing.
I took a sample of 7 days of data
I created a table partitioned by date
I created 8 partitions - one for each day + one default
I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions
I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)
To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)
I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense
I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)
I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny
I then imported another file containing only 2 days of data
I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files
Being even more intrigued I ran the TABLE REBUILD
I endeed up with 8 files (for each partitions) but... they were all just recently modified
Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?
Microsoft have recently released a whitepaper called "U-SQL Performance Optimization" which you should read. It includes detailed notes on distribution, hashing v round-robin and partitioning.
I have created an ETL process with Pentaho that selects data from a table in a Database and load this into another database.
The main problem that I have to make front is that for 1.500.000 rows it takes 6 hours. The full table is 15.000.000 and I have to load 5 tables like that.
Can anyone explain how is supposed to load a large size of data with pentaho?
Thank you.
I never had problem with volume with Pentaho PDI. Check the following in order.
Can you check the problem is really coming from Pentaho: what happens if you drop the query in SQL-Developer or Toad or SQL-IDE-Fancy-JDBC-Compilant.
In principle, PDI is meant to import data with a SELECT * FROM ... WHERE ... and do all the rest in the transformation. I have a set of transformation here which take hours to execute because they do complex queries. The problem is not due to PDI but complexity of the query. The solutions is to export the GROUP BY and SELECT FROM (SELECT...) into PDI steps, which can start before the query result is finished. The result is like 4 hours to 56 seconds. No joke.
What is your memory size? It is defined in the spoon.bat / spoon.sh.
Near the end you have a line which looks like PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=256m". The important parameter is -Xmx.... If it is -Xmx256K, your jvm has only 256KB of RAM to work with.
Change it to 1/2 or 3/4 of the available memory, in order to leave room for the other processes.
Is the output step the bottleneck? Check by disabling it and watch you clock during the run.
If it is long , increase the commit size and allow batch inserts.
Disable all the index and constraints and restore them when loaded. You have nice SQL script executor steps to automate that, but check first manually then in a job, otherwise the reset index may trigger before to load begins.
You have also to check that you do not lock your self: as PDI launches the steps alltogether, you may have truncates which are waiting on another truncate to unlock. If you are not in an never ending block, it may take quite while before to db is able to cascade everything.
There's no fixed answer covering all possible performance issues. You'll need to identify the bottlenecks and solve them in your environment.
If you look at the Metrics tab while running the job in Spoon, you can often see at which step the rows/s rate drops. It will be the one with the full input buffer and empty output buffer.
To get some idea of the maximum performance of the job, you can test each component individually.
Connect the Table Input to a dummy step only and see how many rows/s it reaches.
Define a Generate Rows step with all the fields that go to your destination and some representative data and connect it to the Table Output step. Again, check the rows/s to see the destination database's throughput.
Start connecting more steps/transformations to your Table Input and see where performance goes down.
Once you know your bottlenecks, you'll need to figure out the solutions. Bulk load steps often help the output rate. If network lag is holding you back, you might want to dump data to compressed files first and copy those locally. If your Table input has joins or where clauses, make sure the source database has the correct indexes to use, or change your query.
I have a huge file around 10 GB in *.csv format. It is data from 1960 to present date for different regions. I was able to break down the file by the regions. There are about 8000 regions and I split the file by regions so I have 8000 files about 2 MB each.
I was wondering what would be the most efficient way to create a Access database system to look up data for each region. Is it by:
Separating the file into small files (i.e by each region 8000 files) by region name and importing them to access each time, OR
Splitting them into constant sizes about 1 GB each and querying them.
In either case, how do I import the files to Access?
As you may be aware, an Access database file is limited to 2GB in size, so you almost certainly won't be able to keep all of the information in one file. Even if it did just barely fit, keeping that much information in a single Access database file would likely be rather slow to work with.
Depending on the "shape" of your data there may be other limits in Access that could cause difficulties in your particular situation. For example, a table (or query) is limited to 255 columns. If you haven't done so already, take a look at the Access specifications before proceeding too much further with this.
And in any case, consider using another database as the back-end. Your data may even be too large for single database in SQL Server Express Edition (maximum of 10GB total size per database, as I recall), but even if you had to split the data into two SQL Express databases it would be easier to deal with than a dozen (or more?) Access databases.
Bear in mind that if you use a different database back-end you may still be able to use Access as a query and reporting tool (via ODBC linked tables).
Edit re: comment
Based on your description, if you will never need to query across regions (and remember that "never" is a very long time™) then your 8000-file approach would be feasible. However, I wouldn't necessarily recommend importing the corresponding CSV data every time you want to run a query. Instead, I would borrow ideas from both Tom's and HansUp's answers:
Plan "A": Start by running queries directly against the CSV files themselves to see if that is fast enough for your needs. You could test that by creating a linked table to the CSV file and running some typical queries. As Tom mentioned, a CSV linked table cannot be indexed, so if you find that the queries are too slow then you may have to go to Plan "B".
Plan "B": If you do need to import the CSV data then you'll probably want to use HansUp's suggestion of using DoCmd.TransferText to help automate the process. It seems wasteful to import the specific CSV file for every query, so you might consider creating ~8000 .accdb files and then using a query like...
strSQL = _
"SELECT * FROM TableName " & _
"IN ""C:\__tmp\region12345.accdb"" " & _
"WHERE StartDate BETWEEN #2013-05-10# AND #2013-05-15#"
...where your code could substitute
the name of the appropriate .accdb file based on the region of interest, and
the required date range.
If you will be doing this with VBA, you can use the DoCmd.TransferText Method to import CSV data into Access.
I wouldn't want to do that in your situation, though. 10 GB is too much data to reasonably manage in Access. And if you partition that into separate db files, querying data pulled from multiple db files is challenging and slow. Furthermore, if the query's combined result set hits the 2 GB Access limit, you will get a confusing error about insufficient disk space.
This is not a reasonable job for data storage in MS Access.
#Gords & #HansUps are very good answers. Use a better backend for your data. Free ones would include SQL Express & MySQL. If you're in a corporate environment, then you may have a license for MS SQL Server.
However, if you insist on doing this doing this in strictly Access, here are two related ideas. Both ideas require that you link and de-link (using VBA) to the data you need as you need it.
You don't have to import a CSV file to be able to see it as a table. You can link to it just as you would a table in another database.
Positives: You don't have to change your existing data format.
Drawbacks: You can't edit your existing data, nor can you index it,
so queries may be slow.
Or, you can convert each CSV file into it's own Access DB (you can do this with VBA to automate this). Then, like in the above suggestion, link & de-link the tables as needed.
Positives: You can edit your existing data, and also index it, so
queries may be quick.
Drawbacks: It's an awful amount of work just to avoid using a
different backend DB.
I'm trying to set up an automated process to regularly transform and export a large MS SQL 2008 database to MongoDB.
There is not a 1-1 correspondence between tables in SQL and collections in MongoDB -- for example the Address table in SQL is translated into an array embedded in each customer's record in Mongo and so on.
Right now I have a 3 step process:
Export all the relevant portions of the database to XML using a FOR XML query.
Translate XML to mongoimport-friendly JSON using XSLT
Import to mongo using mongoimport
The bottleneck right now seems to be #2. XML->JSON conversion for 3 million customer records (each with demographic info and embedded address and order arrays) takes hours with libxslt.
It seems hard to believe that there's not already some pre-built way to do this, but I can't seem to find one anywhere.
Questions:
A) Are there any pre-existing utilities I could use to do this?
B) If no, is there a way I could speed up my process?
C) Am I approaching the whole problem the wrong way?
Another approach is to go through each table and add information to mongo on a record by record basis and let Mongo do the denormalizing! For instance to add each phone number, just go through the phone number table and do a '$addToSet' for each phone number to the record.
You can also do this in parallel and do tables separately. This may speed things up but may 'fragment' the mongo database more.
You may want to add any required indexes before you start, otherwise adding the indexes at the end may be a large delay.
I have a table containing 110GB in BLOBs in one schema and I want to copy it to another schema to a different table.
I only want to copy a column of the source table so I am using an UPDATE statement, but it takes 2,5 hours to copy 3 GB of data.
Is there a faster way to do this?
Update:
The code I am using is very simple:
update schema1.A a set blobA = (select blobB from schema2.B b where b.IDB = a.IDA);
ida and idb are indexes.
Check to see if there are indexes on the destination table that are causing the performance issue, if so, temporarily disable them then recreate them after the data is copied from one column in the source table to the column in the destination table.
If you are on Oracle 10 or 11, check ADDM to see what is causing problems. It is probably I/O or transaction log problem.
What kind of disc storage is this? Did you try to copy 110 GB file from one place to another on that disc system? How long it takes?
I don't know if oracle automatically grows the database size or not. If it does, then increase the amount of space allocated to the database to exceed the amount you are about to grow it prior to running your query.
I know in SQL server, under the default setup it will automatically allocate an additional 10% of the database size as you start filling it up. When it fills up, then it stops everything and reallocates another 10%. When running queries that do bulk loading of data, this can seriously slow the query down.
Also, as zendar pointed out, check the disk IO. If it has a high queue length then you may be constrained by have fast the drives work.