I have an xml file around 2 MB (Yes !! 2MB small file), I want to sort the file in some predetermined format, and show the formatted result, as of not it takes 2 - 3 seconds for the whole process, we want to cut down on the time.
My Questions, are
(a) Any way to directly push XML into big query instead of CSV.
(b) I would want to do realtime, so how do i push data from my website, and get the data back on my website. (Do you think the command line would do the tricks ?
(c) I am working on .NET.
I don't think you can push XML directly into BigQuery. The documentation doesn't say, "You cannot import XML." But the fact that it only explains how to use CSV makes it pretty clear.
It doesn't sound like a perfect use case for BigQuery. BigQuery is great for huge data volumes, but you have small data (as you noted). Would it not be quicker to just sort your XML in memory without pushing it somewhere else?
Related
I have about thirty-thousand Binary records, all compressed using GZIP, and I need to search the contents of each document for a specified keyword. Currently, I am downloading and extracting all documents at startup. This works well enough, but I expect to be adding another ten-thousand each year. Ideally, I would like to perform a SELECT statement on the Binary column itself, but I have no idea how to go about it, or if this is even possible. I would like to perform this transaction with the least possible amount of data leaving the server. Any help would be appreciated.
EDIT: The Sql records are not compressed. What I mean to say is that I'm compressing the data locally and uploading the compressed files to a SQL Server column of Binary data type. I'm looking for a way to query that compressed data without downloading and decompressing each and every document. The data was stored this way to minimize overhead and reduce transfer cost, but the data must also be queried. It looks like I may have to store two versions of the data on the server, one compressed to be downloaded by the user, and one decompressed to allow search operations to be performed. Is there a more efficient approach?
SQL Server has a Full-Text Search feature. It will not work on the data that you compressed in your application, of course. You have to store it in plain-text in the database. But, it is designed specifically for this kind of search, so performance should be good.
SQL Server can also compress the data in rows or in pages, but this feature is not available in every edition of SQL Server. For more information, see Features Supported by the Editions of SQL Server. You have to measure the impact of compression on your queries.
Another possibility is to write your own CLR functions that would work on the server - load the compressed binary column, decompress it and do the search. Most likely performance would be worse than using the built-in features.
Taking your updated question into account.
I think your idea to store two versions of the data is good.
Store compressed binary data for efficient transfer to and from the server.
Store secondary copy of the data in an uncompressed format with proper indexes (consider full-text indexes) for efficient search by keywords.
Consider using CLR function to help during inserts. You can transfer only compressed data to the server, then call CLR function that would decompress it on the server and populate the secondary table with uncompressed data and indexes.
Thus you'll have both efficient storage/retrieval plus efficient searches at the expense of the extra storage on the server. You can think of that extra storage as an extra structure for the index that helps with searches.
Why compressing 30,000 or 40,000 records? Does not sound like a whole lot of data, of course depending of the average size of a record.
For keyword searching, you should not compress the database records. But to save on disk space, in most operating systems, it is possible to compress data on the file level, without the SQL Server even noticing.
update:
As Vladimir pointed out, SQL Server does not run on a compressed file system. Then you could store that data in TWO columns: once uncompressed, for keyword searching, and once compressed, for improved data transfer.
Storing data in a separate searchable column is not uncommon. For example, if you want to search on a combination of fields, you might as well store that combination in a search column, so that you could index that column to accelerate searching. In your case, you might store the data in the search column all lower cased, and with accented characters converted to ascii, and add an index, to accelerate case-insensitive searching on ascii keywords.
In fact, Vladimir already suggested this.
My question is with respect to a labVIEW VI (2013), I am trying to modify. (I am only just learning to use this language. I have searched the NI site and stackoverflow for help without success, I suspect I am using the incorrect key words).
My VI consists of a flat sequence one pane of which contains a while loop where integer data is collected from a device and displayed on a graph.
I would like to be able to be able to buffer this data and then send it to disk when a preset number of samples have been collected. My attempts so far result in only the last record being saved.
Specifically I need to know how to save the data in a buffer (array) then when the correct number of samples are captured save it all to disk (saving as it is captured slows the process down to much).
Hope the question is clear and thanks very much in advance for any suggestions.
Tom
Below is a simple circular-buffer that holds the most recent 100 readings. Each time the buffer is refilled, its contents are written to a text file. Drag the image onto a VI's block diagram to try it out.
As you learn more about LabVIEW and as your performance and multi-threaded needs increase, consider reading about some of the LabVIEW design patterns mentioned in the other answers:
State machine: http://www.ni.com/tutorial/7595/en/
Producer-consumer: http://www.ni.com/white-paper/3023/en/
I'd suggest to split the data acquisition and the data saving in two different loops using a producer/consumer design pattern..
Moreover if you need a very high throughput consider using TDMS file format.
Have a look here for an overview: http://www.ni.com/white-paper/3727/en/
Screenshot will definitely help. However, some things are clear:
Unless you are dealing with very high volume of data, very slow hard drives or have other unusual requirements, open the file before your while loop, write to it every time you acquire a sample (leaving buffering to the OS), and close it afterwards.
If you decide you need to manage buffering on your own, you can use queues. See this example: https://decibel.ni.com/content/docs/DOC-14804 for reference (they stream data from disk, buffering it in the queue, but it is the same idea)
My VI consists of a flat sequence one pane of which
Substitute flat sequence for finite state machine (e.g. http://forums.ni.com/t5/LabVIEW/Ending-a-Flat-Sequence-Inside-a-case-structure/td-p/3170025)
Was hoping for some advice on best practice here. Working in objective C and Xcode.
I made a "FileConverter" class which has a method to reads a cvs file with 7 columns of float values into a SQLite database (after verifying the data and parsing it).
The way I have done this is to load the whole file into an NSString, then split into row components, then split each row into column components (saving the result as a 2x2 NSArray.
I then open database and copy the array into the sqlite database. I'm using the TEXT datatype for storage at the moment. Once there, I plan to graph the data.
It seems to check and convert the cvs ok. However if the csv is quite long ( say 10,000 rows), I get the spinning wheel for several seconds while it does its work. For shorter files it converts almost instantly.
Ultimately, at the point the user clicks "Convert CSV", I will also be running another method which will graph the data and I expect this will result in huge delay while it reads the sql database, assembles the data into CGPoints and then draws into the graph view.
My question is about how best to optimise the process, so it can handle the larger files without spinning wheels appearing. Is this possible?
a) Using NSStrings and NSArrays certainly makes the job of reading and splitting up the data super simple and makes verifying the data easy. Is this the best way? Should I malloc a float array instead?
b) I'm working on the basis that by saving the data as TEXT values in the database, converting them to CGFloat values will be straightforward, but realise this will add processing time.
c) I'm imagining that a sqlite3 database would be a faster way for getting the data when I come to graphing it, but I could also simply copy the cvs file and parse it at the point of graphing the data.
Really appreciate advice on this
Your main thread runs the UI, and you're trying to do your computation (CSV Computations) on the main thread making the UI hang (not updating) making the OS invoke the spinning wheel. To avoid this you need to move your compute intensive operations to secondary threads.
When the csv file is small, a synchronous operation is able to perform the csv computation and update the UI immediately. In case of big files, the UI (app window) waits for the computations to finish before the UI can be updated. In such cases you need to asynchronously compute the csv calculations on a background thread. There are many ways to achieve this, the most popular being Apple's GCD.
The following link, Apple's guide to non blocking UI explains it in detail:
Link to Apple's non blocking guide
Link to GCD
I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.
DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.
I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).
How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.
I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)
I was wondering if anyone had any experience with what I am about to embark on. I have several csv files which are all around a GB or so in size and I need to load them into a an oracle database. While most of my work after loading will be read-only I will need to load updates from time to time. Basically I just need a good tool for loading several rows of data at a time up to my db.
Here is what I have found so far:
I could use SQL Loader t do a lot of the work
I could use Bulk-Insert commands
Some sort of batch insert.
Using prepared statement somehow might be a good idea. I guess I was wondering what everyone thinks is the fastest way to get this insert done. Any tips?
I would be very surprised if you could roll your own utility that will outperform SQL*Loader Direct Path Loads. Oracle built this utility for exactly this purpose - the likelihood of building something more efficient is practically nil. There is also the Parallel Direct Path Load, which allows you to have multiple direct path load processes running concurrently.
From the manual:
Instead of filling a bind array buffer
and passing it to the Oracle database
with a SQL INSERT statement, a direct
path load uses the direct path API to
pass the data to be loaded to the load
engine in the server. The load engine
builds a column array structure from
the data passed to it.
The direct path load engine uses the
column array structure to format
Oracle data blocks and build index
keys. The newly formatted database
blocks are written directly to the
database (multiple blocks per I/O
request using asynchronous writes if
the host platform supports
asynchronous I/O).
Internally, multiple buffers are used
for the formatted blocks. While one
buffer is being filled, one or more
buffers are being written if
asynchronous I/O is available on the
host platform. Overlapping computation
with I/O increases load performance.
There are cases where Direct Path Load cannot be used.
With that amount of data, you'd better be sure of your backing store - the dbf disks' free space.
sqlldr is script drive, very efficient, generally more efficient than a sql script.
The only thing I wonder about is the magnitude of the data. I personally would consider several to many sqlldr processes and assign each one a subset of data and let the processes run in parallel.
You said you wanted to load a few records at a time? That may take a lot longer than you think. Did you mean a few files at a time?
You may be able to create an external table on the CSV files and load them in by SELECTing from the external table into another table. Whether this method will be quicker not sure however might be quicker in terms of messing around getting sql*loader to work especially when you have a criteria for UPDATEs.