using ResultSet select 10M record and dump onto HDFS - sql

In one program I did something like:
ResultSet rs = con.executeQuery(sql);
List l = new List();
while(rs.next()){
l.append(rs.getObject(xxx))
}
fileSys.write(l)
the sql have 10M record. and this function takes 2hrs to finish. The Loop takes most of the time. I'm wondering is there a better way to do this thing? Is it possible to use multithread?

If you have to move huge amount of data (Giga Bytes) over the network from one system (SQL) to another (HDFS) then threading in the processor driving the copying (your code) will not help much
1 You can try to run your code on a server with better network throughput
2 You can try to copy only the records that had changed since the last time (HDFS does not have the copy yet). This would improve the performance in the long run. 1st time it would be slow. 2nd and 3rd time it would be faster as you'd only move things that had changed. Using UTC timestamps and "changed since" are the basic concepts

Related

out of memory sql execution

I have the following script:
SELECT
DEPT.F03 AS F03, DEPT.F238 AS F238, SDP.F04 AS F04, SDP.F1022 AS F1022,
CAT.F17 AS F17, CAT.F1023 AS F1023, CAT.F1946 AS F1946
FROM
DEPT_TAB DEPT
LEFT OUTER JOIN
SDP_TAB SDP ON SDP.F03 = DEPT.F03,
CAT_TAB CAT
ORDER BY
DEPT.F03
The tables are huge, when I execute the script in SQL Server directly it takes around 4 min to execute, but when I run it in the third party program (SMS LOC based on Delphi) it gives me the error
<msg> out of memory</msg> <sql> the code </sql>
Is there anyway I can lighten the script to be executed? or did anyone had the same problem and solved it somehow?
I remember having had to resort to the ROBUST PLAN query hint once on a query where the query-optimizer kind of lost track and tried to work it out in a way that the hardware couldn't handle.
=> http://technet.microsoft.com/en-us/library/ms181714.aspx
But I'm not sure I understand why it would work for one 'technology' and not another.
Then again, the error message might not be from SQL but rather from the 3rd-party program that gathers the output and does so in a 'less than ideal' way.
Consider adding paging to the user edit screen and the underlying data call. The point being you dont need to see all the rows at one time, but they are available to the user upon request.
This will alleviate much of your performance problem.
I had a project where I had to add over 7 million individual lines of T-SQL code via batch (couldn't figure out how to programatically leverage the new SEQUENCE command). The problem was that there was limited amount of memory available on my VM (I was allocated the max amount of memory for this VM). Because of the large amount lines of T-SQL code I had to first test how many lines it could take before the server crashed. For whatever reason, SQL (2012) doesn't release the memory it uses for large batch jobs such as mine (we're talking around 12 GB of memory) so I had to reboot the server every million or so lines. This is what you may have to do if resources are limited for your project.

How much time is going to take a process - by WCF Service

I’m here with another question this time.
I have an application which builds to move data from one database to another. It also deals with validation & comparison between the databases. When we start moving the data from source to destination it takes a while as it always deals with thousands of records. We use WCF service and SQL server # server side and WPF # client side to handle this.
Now I have a requirement to notify user with the time it is going to take based on the source database no: records (eventually that is what im going to create in the destination database) right before user starts this movement process.
Now my real question, which is the best way we can do this and get an estimated time out of it?
Thanks and appreciated your helps.
If your estimates are going to be updated during the upload process, you can take the time already spent, delete on number of processed records, and multiply by number of remaining records. This will give you an updating average remaining time:
TimeSpan spent = DateTime.Now - startTime;
TimeSpan remaining = (spent / numberOfProcessedRecords) * numberOfRemainingRecords;

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.

Sending huge vector to a Database in R

Good afternoon,
After computing a rather large vector (a bit shorter than 2^20 elements), I have to store the result in a database.
The script takes about 4 hours to execute with a simple code such as :
#Do the processing
myVector<-processData(myData)
#Sends every thing to the database
lapply(myVector,sendToDB)
What do you think is the most efficient way to do this?
I thought about using the same query to insert multiple records (multiple inserts) but it simply comes back to "chucking" the data.
Is there any vectorized function do send that into a database?
Interestingly, the code takes a huge amount of time before starting to process the first element of the vector. That is, if I place a browser() call inside sendToDB, it takes 20 minutes before it is reached for the first time (and I mean 20 minutes without taking into account the previous line processing the data). So I was wondering what R was doing during this time?
Is there another way to do such operation in R that I might have missed (parallel processing maybe?)
Thanks!
PS: here is a skelleton of the sendToDB function:
sendToDB<-function(id,data) {
channel<-odbcChannel(...)
query<-paste("INSERT INTO history VALUE(",id,",\"",data,"\")",sep="")
sqlQuery(channel,query)
odbcClose(channel)
}
That's the idea.
UPDATE
I am at the moment trying out the LOAD DATA INFILE command.
I still have no idea why it takes so long to reach the internal function of the lapply for the first time.
SOLUTION
LOAD DATA INFILE is indeed much quicker. Writing into a file line by line using write is affordable and write.table is even quicker.
The overhead I was experiencing for lapply was coming from the fact that I was looping over POSIXct objects. It is much quicker to use seq(along.with=myVector) and then process the data from within the loop.
What about writing it to some file and call LOAD DATA INFILE? This should at least give a benchmark. BTW: What kind of DBMS do you use?
Instead of your sendToDB-function, you could use sqlSave. Internally it uses a prepared insert-statement, which should be faster than individual inserts.
However, on a windows-platform using MS SQL, I use a separate function which first writes my dataframe to a csv-file and next calls the bcp bulk loader. In my case this is a lot faster than sqlSave.
There's a HUGE, relatively speaking, overhead in your sendToDB() function. That function has to negotiate an ODBC connection, send a single row of data, and then close the connection for each and every item in your list. If you are using rodbc it's more efficient to use sqlSave() to copy an entire data frame over as a table. In my experience I've found some databases (SQL Server, for example) to still be pretty slow with sqlSave() over latent networks. In those cases I export from R into a CSV and use a bulk loader to load the files into the DB. I have an external script set up that I call with a system() call to run the bulk loader. That way the load is happening outside of R but my R script is running the show.

How to change slow parametrized inserts into fast bulk copy (even from memory)

I had someting like this in my code (.Net 2.0, MS SQL)
SqlConnection connection = new SqlConnection(#"Data Source=localhost;Initial
Catalog=DataBase;Integrated Security=True");
connection.Open();
SqlCommand cmdInsert = connection.CreateCommand();
SqlTransaction sqlTran = connection.BeginTransaction();
cmdInsert.Transaction = sqlTran;
cmdInsert.CommandText =
#"INSERT INTO MyDestinationTable" +
"(Year, Month, Day, Hour, ...) " +
"VALUES " +
"(#Year, #Month, #Day, #Hour, ...) ";
cmdInsert.Parameters.Add("#Year", SqlDbType.SmallInt);
cmdInsert.Parameters.Add("#Month", SqlDbType.TinyInt);
cmdInsert.Parameters.Add("#Day", SqlDbType.TinyInt);
// more fields here
cmdInsert.Prepare();
Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read);
StreamReader reader = new StreamReader(stream);
char[] delimeter = new char[] {' '};
String[] records;
while (!reader.EndOfStream)
{
records = reader.ReadLine().Split(delimeter, StringSplitOptions.None);
cmdInsert.Parameters["#Year"].Value = Int32.Parse(records[0].Substring(0, 4));
cmdInsert.Parameters["#Month"].Value = Int32.Parse(records[0].Substring(5, 2));
cmdInsert.Parameters["#Day"].Value = Int32.Parse(records[0].Substring(8, 2));
// more here complicated stuff here
cmdInsert.ExecuteNonQuery()
}
sqlTran.Commit();
connection.Close();
With cmdInsert.ExecuteNonQuery() commented out this code executes in less than 2 sec. With SQL execution it takes 1m 20 sec. There are around 0.5 milion records. Table is emptied before. SSIS data flow task of similar functionality takes around 20 sec.
Bulk Insert was not an option (see below). I did some fancy stuff during this import.
My test machine is Core 2 Duo with 2 GB RAM.
When looking in Task Manager CPU was not fully untilized. IO seemed also not to be fully utilized.
Schema is simple like hell: one table with AutoInt as primary index and less than 10 ints, tiny ints and chars(10).
After some answers here I found that it is possible to execute bulk copy from memory! I was refusing to use bulk copy beacuse I thought it has to be done from file...
Now I use this and it takes aroud 20 sec (like SSIS task)
DataTable dataTable = new DataTable();
dataTable.Columns.Add(new DataColumn("ixMyIndex", System.Type.GetType("System.Int32")));
dataTable.Columns.Add(new DataColumn("Year", System.Type.GetType("System.Int32")));
dataTable.Columns.Add(new DataColumn("Month", System.Type.GetType("System.Int32")));
dataTable.Columns.Add(new DataColumn("Day", System.Type.GetType("System.Int32")));
// ... and more to go
DataRow dataRow;
object[] objectRow = new object[dataTable.Columns.Count];
Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read);
StreamReader reader = new StreamReader(stream);
char[] delimeter = new char[] { ' ' };
String[] records;
int recordCount = 0;
while (!reader.EndOfStream)
{
records = reader.ReadLine().Split(delimeter, StringSplitOptions.None);
dataRow = dataTable.NewRow();
objectRow[0] = null;
objectRow[1] = Int32.Parse(records[0].Substring(0, 4));
objectRow[2] = Int32.Parse(records[0].Substring(5, 2));
objectRow[3] = Int32.Parse(records[0].Substring(8, 2));
// my fancy stuf goes here
dataRow.ItemArray = objectRow;
dataTable.Rows.Add(dataRow);
recordCount++;
}
SqlBulkCopy bulkTask = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, null);
bulkTask.DestinationTableName = "MyDestinationTable";
bulkTask.BatchSize = dataTable.Rows.Count;
bulkTask.WriteToServer(dataTable);
bulkTask.Close();
Instead of inserting each record individually, Try using the SqlBulkCopy class to bulk insert all the records at once.
Create a DataTable and add all your records to the DataTable, and then use SqlBulkCopy.WriteToServer to bulk insert all the data at once.
Is required the transaction? Using transaction need much more resources than simple commands.
Also If you are sure, that inserted values are corect, you can use a BulkInsert.
1 minute sounds pretty reasonable for 0.5 million records. That's a record every 0.00012 seconds.
Does the table have any indexes? Removing these and reapplying them after the bulk insert would improve performance of the inserts, if that is an option.
It doesn't seem unreasonable to me to process 8,333 records per second...what kind of throughput are you expecting?
If you need better speed, you might consider implementing bulk insert:
http://msdn.microsoft.com/en-us/library/ms188365.aspx
If some form of bulk insert isn't an option, the other way would be multiple threads, each with their own connection to the database.
The issue with the current system is that you have 500,000 round trips to the database, and are waiting for the first round trip to complete before starting the next - any sort of latency (ie, a network between the machines) will mean that most of your time is spent waiting.
If you can split the job up, perhaps using some form of producer/consumer setup, you might find that you can get much more utilisation of all the resources.
However, to do this you will have to lose the one great transaction - otherwise the first writer thread will block all the others until its transaction is completed. You can still use transactions, but you'll have to use a lot of small ones rather than 1 large one.
The SSIS will be fast because it's using the bulk-insert method - do all the complicated processing first, generate the final list of data to insert and give it all at the same time to bulk-insert.
I assume that what is taking the approximately 58 seconds is the physical inserting of 500,000 records - so you are getting around 10,000 inserts a second. Without knowing the specs of your database server machine (I see you are using localhost, so network delays shouldn't be an issue), it is hard to say if this is good, bad, or abysmal.
I would look at your database schema - are there a bunch of indices on the table that have to be updated after each insert? This could be from other tables with foreign keys referencing the table you are working on. There are SQL profiling tools and performance monitoring facilities built into SQL Server, but I've never used them. But they may show up problems like locks, and things like that.
Do the fancy stuff on the data, on all records first. Then Bulk-Insert them.
(since you're not doing selects after an insert .. i don't see the problem of applying all operations on the data before the BulkInsert
If I had to guess, the first thing I would look for are too many or the wrong kind of indexes on the tbTrafficLogTTL table. Without looking at the schema definition for the table, I can't really say, but I have experienced similar performance problems when:
The primary key is a GUID and the primary index is CLUSTERED.
There's some sort of UNIQUE index on a set of fields.
There are too many indexes on the table.
When you start indexing half a million rows of data, the time spent to create and maintain indexes adds up.
I will also note that if you have any option to convert the Year, Month, Day, Hour, Minute, Second fields into a single datetime2 or timestamp field, you should. You're adding a lot of complexity to your data architecture, for no gain. The only reason I would even contemplate using a split-field structure like that is if you're dealing with a pre-existing database schema that cannot be changed for any reason. In which case, it sucks to be you.
I had a similar problem in my last contract. You're making 500,000 trips to SQL to insert your data. For a dramatic increase in performance, you want to investigate the BulkInsert method in the SQL namespace. I had "reload" processes that went from 2+ hours to restore a couple of dozen tables down to 31 seconds once I implemented Bulk Import.
This could best be accomplished using something like the bcp command. If that isn't available, the suggestions above about using BULK INSERT are your best bet. You're making 500,000 round trips to the database and writing 500,000 entries to the log files, not to mention any space that needs to be allocated to the log file, the table, and the indexes.
If you're inserting in an order that is different from your clustered index, you also have to deal with the time require to reorganize the physical data on disk. There are a lot of variables here that could possibly be making your query run slower than you would like it to.
~10,000 transactions per second isn't terrible for individual inserts coming roundtripping from code/
BULK INSERT = bcp from a permission
You could batch the INSERTs to reduce roundtrips
SQLDataAdaptor.UpdateBatchSize = 10000 gives 50 round trips
You still have 500k inserts though...
Article
MSDN