It's my first post ever... and I really need help on this one so any one who has some knowlege on the subject - please help!
What I need to do is to read an xml file into sql server data tables. I was looking over and over for solutions to this one and have found a few actualy. The problem is the size of the xml which is being loaded. It weights 2GB (and there will 10GB ones). I have managed to do this but I saw one particular solution which seems to me to be a great one but I cannot figure it out.
Ok lets get to the point. Currently I do it this way:
I read an entire XML using the openrowset into a variable. (this takes the whole ram memory...)
next I use the .node() thing to pull out the data and fill the tables with them.
Thats a two-step process. I was wondering if I could do it in only one step. I saw that there are things like format files and there are numerus examples on how to use that to pull out data from flat files or even excel documents in a record-based maner (in stead of sucking the whole thing into a variable) but I CANNOT find any example which would show how to read that huge XML into a table parsing the data on the fly (based on the format file). Is it even possible? I would really appreciate some help, or guidence on where to find a good example.
Pardon my English - it's been a while since I had to write so much in that language :-)
Thanks in advance!
For very large files, you could use SSIS: Loading XML data into SQL Server 2008
It gives you the flexibility of transforming the XML data, as well as reducing your memory footprint for very large files. Of course, it might be slower compared to using OPENROWSET in BULK mode.
Related
I have a 25GB's text file with that structure(headers):
Sample Name Allele1 Allele2 Code metaInfo...
So it just one table with a few millions of records. I need to put it to database coz sometimes I need to search that file looking, for example, specific sample. Then I need to get all row and equals to file. This would be a basic application. What is important? File is constant. It no needed put function coz all samples are finished.
My question is:
Which DB will be better in this case and why? Should I put a file in SQL base or maybe MongoDB would be a better idea. I need to learn one of them and I want to pick the best way. Could someone give advice, coz I didn't find in the internet anything particular.
Your question is a bit broad, but assuming your 25GB text file in fact has a regular structure, with each line having the same number (and data type) of columns, then you might want to host this data in a SQL relational database. The reason for choosing SQL over a NoSQL solution is that the former tool is well suited for working with data having a well defined structure. In addition, if you ever need to relate your 25GB table to other tables, SQL has a bunch of tools at its disposal to make that fast, such as indices.
Both MySQL and MongoDB are equally good for your use-case, as you only want read-only operations on a single collection/table.
For comparison refer to MySQL vs MongoDB 1000 reads
But I will suggest going for MongoDB because of its aggeration pipeline. Though your current use case is very much straight forward, in future you may need to go for complex operations. In that case, MongoDB's aggregation pipeline will come very handy.
Ok, so the background to the story. I am largely self taught the bits of SQL i do know, and it tends to be just enough to make things work that need to work - albeit with a fair bit of research for the most basic jobs!
I am using a piece of software which grabs a string of data, and then passes it straight to an SQL stored procedure to move the data around, perform a few tasks on the string to make it the format i need it to be, and then grabs lumps of this data and places it in various SQL tables as outlined by the SP. I get maybe half a million lines of data each day, and this process works perfectly well and quickly. However, should data be lost, or not manage to make it through to the SQL database correctly, i do still have a log of the 500,000 lines of raw data in CSV file format.
I cant seem to find a way to simply bulk import this data into the various tables in the various formats it needs to be in. Assuming there is no way to re-pass this data through the 3rd party software (i have tried and failed), what is the best (read easiest for a relative lamen) way to send each line of this CSV file through my existing stored procedure, which can then process and import the data as normal? i have looked at the bcp utility, but that didnt seem to be viable (or i am not well enough informed to make it do what i need). I have also done the usual trawling of the web (and these forums) to see if anything jumped out at me as the obvious way forward, but come up a bit dry.
Apologies if i am asking something a bit 101, but i would certainly be grateful if anyone could help me out with this - if i missed out any salient bits of information, let me know! :)
Cheers.
The SQL Server Import/Export Wizard is a point and click solution that can be used to import CSV files into SQL Server.
The wizard builds an SSIS package behind the scenes, which can be saved and scheduled to run as needed. The wizard doesn't give you much in the way of data transformation, but the data could be loaded into a staging table and then processed by your existing stored procedure.
(postgre/my)sql/php/html/css/javascript vs xml/xsl/xsd/php/css/javascript
Trying to decide whether to go with an xml-document-based app or with SQl. Each xml document would be about 30k; say 2000 files. Essentially a choice between serving up html/javascript, or serving a 30k xml file (plus xsl/xsd/javascript). Involves some financial (ie non-floating.point) math, plus substantial data entry one day per week.
SQL-solution would invove fragmenting/reassembling data using, say, ten separate cross-referenced tables, and tie users into SQL access control systems.
Assuming xml-based solution really is more straightforward to install/maintain, and using money=cents-as-integers is okay, and "other things are equal", my questions are:
1) Is it really a good plan to have the server read/update/save a 30k xml files, say 2000 times over 8 hours once a week, every time data is updated? Or is that just a trivial load?
(so that depends what else the server is doing I guess, and how fast the internet connection is)
2) How would that scale compared to an SQL-based solution? What would be the limiting factor?
3) Most importantly: what am I overlooking?
1) Not a good plan. Even if the load is not a problem you are basically building yourself a database when the problem is a solved one.
2) SQL is going to scale better base don what you've told.
3) NoSQL or XML based DB solutions like BaseX.
You want to look at your solution architecture... Where are the XML files coming from and how do you get hold of them. You also need to look at the navigation you are looking for. How do users navigate to one specific XML file - these navigational data need to be available. So to answer your question:
It is not a plan at all :-) - is is a tiny fragment of your solution, the load doesn't look big. You need to have a look at your meta data.
It might not be an OR question. All SQL systems know XML column data types today: PrgressQL, MS-SQL, Oracle, IBM DB/2 (including the free community edition). I like DB/2 (probably because I work for IBM :-) )
CouchDB, MongoDB -> JSON stores, XML databases as Karl suggested. Most important: caching, caching, caching! If you build in Java, use the guava libraries for a cache - once a file is transformed to the stuff you send down to the browser (using XSLT), cache that with generous expiry and have your load routine invalidate the cache
Hope that helps!
I'm receiving big (around 120MB each), nested xml files. The parsing itself is very fast, currently i'm using the Nokogiri:SAXParser which is way faster then a DOM based. I need to check back a lot of values in the database. (Should it be updated or not?) Also i keep database queries as low as possible (eager loading, pure sql selects) the performance loss is about 40x in comparision to parsing only. I can't use mass inserts due to the need of validation/check back existing records/a lot of association involved. The whole process is in a transaction which speeded up things around 1.5x . What approach would you take? I'm looking forward to any help! I'm not very skilled in the whole XML thing. Would XLST help me? Also i have a XSD file for the files which arrive me.
Thanks in advance!
I ended up with a rebuild of associations which now fit more into the third party data and I can use MASS-INSERTS. (watch out for the max_allowed_packet value!!!)I'm using the sax-machine gem. When most of the basic data is already in the database i can now process (including db stuff) a 120MB file in about 10 seconds. Which is totally fine. Feel free to ask.
Our set up today takes XML data and splits that data into multiple tables in SQL. The only benefit of this is reporting is good. However, whenever we want to retrieve data we are having to re-bind all the data from hundreds of tables to re-export the XML. Each XML could be several MB to several GB.
We hardly ever run reports ironically but do retrieve / save the data very often. Due to splitting it/compiling it with several tables, both saving and retrieval is not very efficient.
Since the data comes in as XML, I'm considering updating our method and saving the XML as a large BLOB into the table. That would be so simple.
The issue now comes with reporting - without the ability to index blobs I'm wondering what options I could have to run as efficient as possible reports.
The database is in the 100's GBs.
I'm not a DBA (I'm a C# person) - I've just landed in this position at work so the only way I could think about this would be to do it using C# - build each BLOB as XML and then query the XML data in C#. This however, seems like it would be very inefficient. Maybe XQuery in SQL is better?! Despite not being a DBA, I'm more than happy for any programming (C#/VB) or SQL suggestions.
You can save the data in a single XML-type column in your database and then access the data via XQuery.
XQuery is for me, personally, a bit fiddly, but I found this list of tips to be of great help:
http://www.jackdonnell.com/?p=266
The advantage is that you are only persisting one version of the data so updates and reads are quick, apart from the XML parsing-bit (but that may depend on your data volume). Getting the data into the database from C# is straightforward, as you can map your XML to a corresponding SqlDbType.