Get a valid schema of large (1 GB) xml files - sql

I need to bulk load huge xml files to SQL Server 2005. I decided to use SQLXMLBULKLOAD in my C# app, but I need to get valid xsd-schemas of those xml files to load them. Which is best way to generate xsd file?
I tried MS VS xsd.exe, but it tries to load the file into memory, which causes OutOfMemory exception.
Thanks!

Strip the file down to create a smaller one that is representative of the whole, then generate an XSD from that. You can then tailor the result if necessary.

There are quite a few tools to generate schemas from instances, but I don't know how many of them are able to operate in pure streaming mode. One tool which will work regardless of the file size is the DTDGenerator that was originally part of Saxon; you can find it here:
http://saxon.sourceforge.net/dtdgen.html
It produces a DTD rather than a schema, but there are plenty of tools available to convert a DTD to a schema.

Related

How to put files inside files

MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.

Loading sql tables to an xml file using Talend with schema defined at runtime

I'm a bigginner at Talend,and I'm trying to load a database into an XML file, and that must be done automatically.So I don't have to specify any schema for the xml file all must be generated, because I'll have to use that XML file in other jobs. Is that possible using Talend ? and how can I do it ?
Thank you for your answers.
This is not possible by the very inner design of Talend: every schema (db, xml, delimited-files...) must be defined at compile time. It's not possible to detect it at runtime. You could try a complete java-solution using a user routine and some custom code, but this will move to a complete java-based solution, outside from Talend scope (and very inelegant and time-consuming, in my opinion). If it's your case, you probably should redesign your process.

How to compare and find the differences between two XML files in cocoa?

This is a bit of a two part question, for working with 40mb xml files.
• What’s a reasonable size to store in memory for a program running continually in the background?
• How to find what has changed in an XML file.
So on the first read the XML is loaded into NSData, then uploaded to the server.
Now instead of uploading a 40mb XML every time it changes, I would prefer to upload a “delta” file containing only what has changed. The program would monitor the file for change, and activate when it’s been modified. From what I can see, I would need to parse an old version of the xml file and parse the modified xml file, then compare them? Is it unreasonable to store 80mb in memory like this every time the file is modified?. Now I’m assuming that this has to be done with a DOM parser because I can’t see how you could compare two files like that with a SAX parser since it only has part of the file stored?
I'm a newbie at this so any help would be appreciated!
To compare two files:
There are many ways to do, (As file is to be considered, I may not be correct):
sdiff file1.xml file2.xml A unix command
You can use this command with apple script.
-[NSFileManager contentsEqualAtPath:andPath:]
This method checks to see if two files at given path are the same file, then compares their size, and finally compares their contents.
For other part:
What size is considered for background process, I dont think so, for an application it matters. You can save these into temporary files. Even safari uses 130+ MB as you can easily check through Activity monitor.
NSXMLParser ended up being the most useful for this

Exporting hundreds of thousands of records with ColdFusion

Using ColdFusion 9.0.1, I need to export hundreds of thousands of database records to Excel XLSX or CSV (XLSX is preferred). This must be done on demand. So far I've tried using cfspreadsheet but it chokes when exporting more than a couple thousand rows in the XLSX format. However, exporting to XLS works fine (of course there is a ~65,000 row limit).
What are my options to export so many records? Theoretically the users could need to export as many as one million records. I'm also using SQL Server 2008 R2 -- is there a way to somehow export the records to a file there and then send the file through CF to the user? What options do I have? Thanks.
Since you are using SQL Server 2008, you could take advantage of SQL Server Reporting Services (SSRS) and create a report that can be called via web service (or HTTP GET/POST) by ColdFusion. SSRS has the capability to export reports as Excel as well. You'll need to read up on SSRS to make this work, but it's fairly easy to do.
As you've discovered, doing this with ColdFusion's <cfspreadsheet/> tag fails because it builds the entire document in memory, which leads to JVM OutOfMemory errors. What you need is something that buffers output to disk so you don't run out of memory. This suggests CSV, which is far easier to buffer. I imagine there are ways to do it with Excel as well, but I don't know them.
So two options for you:
use a Java library
use ColdFusion's fileOpen(), fileWrite(), fileClose() methods
I'll cover each in turn.
Java Libary
opencsv is my preference. This assumes of course you know how to setup a .jar on the ColdFusion classpath. If you do then it's a matter using its APIs to open a file and specify data for each line. It's really quite simple. Check its docs for examples.
ColdFusion Methods
Be forewarned there be dragons here.
If you are exporting numbers or strings that do not contain any double quotes or commas you can probably do this. If not, figuring out what to escape and how is why you use a library in the first place. Code is roughly as such:
<!--- query to get whatever data you're working with --->
<cfset csvFile = fileOpen(filePath, 'read')>
<cfloop query="yourQuery">
<cfset csvRow = ""><!--- construct a csv row here from the query row --->
<cfset fileWrite(csvFile, csvRow)>
</cfloop>
<cfset fileClose(csvFile)>
If the query data you're working with is also large you may be dealing with a nested loop to chunk it out.
Dustin, I had to investigate this myself, and as of this writing (Summer 2011), POI does a fine job of generating large files, but you have to use xlsx. The 3.8 beta source ships with an example named "BigGridDemo" which generates a 100K, 4-column workbook very quickly. I modified it to generate 300K, 125-column sheet, and it handled it in about 2 minutes. It created a 1.6 GB, 3.6-million-row workbook in a little over half an hour.
Granted, the code isn't the prettiest to look at, but it works. I suspect it'll pretty up a bit when ported to ColdFusion.

How can I create a PDF file in classic ASP?

Is there any way to generate PDF files from classic ASP? I have a bunch of user-entered data that needs to be turned into a PDF that the user can download. How can I do this? OpenOffice allows exporting documents to PDF, so could this somehow be leveraged?
I played around a bit with this (Persits ASPPDF): http://www.asppdf.com/
Maybe running an external application that could be using CrystalReports... and you just pass it as an xml?
That's how i would do it... (lazy mode)
See a full list of PDF components here: http://www.aspin.com/home/components/document/pdf Many of them are free.
It is also possible to use XSLT to output PDF but I am not sure if this is supported by the Microsoft XML Parser. I remember there were something stopping me when I tried to do this 3-4 years ago. Might be worth checking out know depending out the type of data you have as source.
However if these are static files or a one time job consider using a PDF converter on your computer and just upload the files to the server. There are heaps of tools for this, including Adobe Acrobat.