Scenario:
I load some data into a local MySQL database each day, about 2 million rows;
I have to (have to - it's an audit/regulatory thing) move to a "properly" administered server, which currently looks to be Oracle 10g;
The server is in a different country: a network round-trip current takes 60-70 ms;
Input is a CSV file in a denormalised form: I normalise the data before loading, each line typically results in 3-8 INSERTs across up to 4 tables;
The load script is currently implemented in Ruby, using ActiveRecord and fastercsv. I've tried the ar-extensions gem, but it assumes that the MySQL style multiple values clause idea will work. It doesn't.
EDIT: Extremely useful answers already - thank-you! More about that pesky input file. The number of fields is variable and positions have changed a few times - my current script determines content by analysing the header row (well, fastercsv and a cunning converter do it). So a straight upload and post-process SQL wouldn't work without several versions of the load file, which is horrible. Also it's a German CSV file: semi-colon delimited (no big deal) and decimals indicated by commas (rather bigger deal unless we load as VARCHAR and text-process afterwards - ugh).
The problem:
Loading 2 million rows at about 7/sec is going to take rather more than 24 hours! That's likely to be a drawback with a daily process, not to mention that the users would rather like to be able to access the data about 5 hours after it becomes available in CSV form!
I looked at applying multiple inserts per network trip: the rather ungainly INSERT ALL... syntax would be fine, except that at present I'm applying a unique id to each row using a sequence. It transpires that
INSERT ALL
INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2)
INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4)
INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6)
SELECT 1 FROM dual;
(did I say it was ungainly?) tries to use the same id for all three rows. Oracle docus appear to confirm this.
Latest attempt is to send multiple INSERTs in one execution, e.g.:
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2);
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4);
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6);
I haven't found a way to persuade Oracle to accept that.
The Question(s)
Have I missed something obvious? (I'd be so pleased if that turned out to be the case!)
If I can't send multiple inserts, what else could I try?
Why Accept That One?
For whatever reason, I prefer to keep my code as free from platform-specific constructs as possible: one reason this problem arose is that I'm migrating from MySQL to Oracle; it's possible another move could occur one day for geographical reasons, and I can't be certain about the platform. So getting my database library to the point where it can use a text SQL command to achieve reasonable scaling was attractive, and the PL/SQL block accomplishes that. Now if another platform does appear, the change will be limited to changing the adapter in code: a one-liner, in all probability.
How about shipping the csv file to the oracle db server, use SQLLoader to load the csv file into a staging table and then running a stored procedure to transform and INSERT it in the final tables?
You could use:
insert into tablea (id,b,c)
( select tablea_seq.nextval,1,2 from dual union all
select tablea_seq.nextval,3,4 from dual union all
select tablea_seq.nextval,3,4 from dual union all
select tablea_seq.nextval,3,4 from dual union all
...
)
This works until about up to 1024 lines when I remember correctly.
You could also send it as a PL/SQL batch instruction:
BEGIN
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2);
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4);
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6);
...
COMMIT;
END
I would be loading the raw CSV file to a dedicated table in the database using SQL*Loader without normalising it, then running code against the data in the table to normalise it out to the various required tables. This would minimise the round trips, and is a pretty conventional approach in the Oracle community.
SQL*Loader can be a little challenging in terms of initial learning curve, but you can soon post again if you get stuck.
http://download.oracle.com/docs/cd/B19306_01/server.102/b14215/part_ldr.htm#i436326
I have a quick suggestion. I'm from the MySQL world but was trained on Oracle, I think this will work.
In MySQL you can insert multiple records with a single insert statement. It looks like this:
INSERT INTO table_name (column_one, column_two, column_three, column_four)
VALUES
('a', 'one', 'alpha', 'uno'), // Row 1
('b', 'two', 'beta', 'dos'), // Row 2
('c', 'three', 'gamma', 'tres'), // etc.
....
('z', 'twenty-six', 'omega', 'veintiséis');
Now obviously you can only insert into one table at once, and you wouldn't want to do 2 million records, but you could easily do 10 or 20 or 100 at a time (if you are allowed packets that big). You may have to generate this by hand, I don't know if they frameworks you are using (or any, for that matter) support making this kind of code for you.
In the MySQL world this DRAMATICALLY speeds up inserts. I assume it does all the index updates and such at the same time, but it also prevents it from having to re-parse the SQL on each insert.
If you combine this with prepared statements (so the SQL is cached and it doesn't have to be parsed out each time) and transactions (to make sure things are always in a sane state when you have to do inserts across multiple tables... I think you will be doing pretty good.
NunoG is right that you can load the CSVs directly into Oracle. You may be best off reading in the input file, generating a normalized set of CSV files (one for each table), and then loading in each of those one at a time.
Instead of executing the SQL over the network you could write the Inserts to a text file, move it over the network and run it locally there.
SQL*Loader is an Oracle-supplied utility that allows you to load data from a flat file into one or more database tables. This will be 10-100 times faster than inserts using queries.
http://www.orafaq.com/wiki/SQL*Loader_FAQ
If SQL*Loader doesn't cut it, try a small preprocessor program that formats the file in SQL*Loader readable format.
Related
As of now, Hive Terminal is showing only column headers after a create table code is run. What settings should I change to make Hive Terminal show few rows also, say first 100 rows?
Code I am using to create table t2 from table t1 which resides in the database (I don't know how t1 is created):
create table t2 as
select *
from t1
limit 100;
Now while development, I am writing select * from t2 limit 100; after each create table section to get the rows with headers.
You cannot
The Hive Create Table documentation does not mention anything about showing records. This, combined with my experience in Hive makes me quite confident that you cannot achieve this by mere regular config changes.
Of course you could tap into the code of hive itself, but that is not something to be attempted lightly.
And you should not want to
Changing the create command could lead to all kinds of problems. Especially because unlike the select command, it is in fact an operation on metadata, followed by an insert. Both of these normally would not show you anything.
If you would create a huge table, it would be problematic to show everything. If you choose always to just show the first 100 rows, that would be inconsistent.
There are ways
Now, there are some things you could do:
Change hive itself (not easy, probably not desirable)
Do it in 2 steps (what you currently do)
Write a wrapper:
If you want to automate things and don't like code duplication, you can look into writing a small wrapper function to call the create and select based on just the input of source (and limit) and destination.
This kind of wrapper could be written in bash, python, or whatever you choose.
However, note that if you like executing the commands ad-hoc/manually this may not be suitable, as you will need to start a hive JVM each time you run such a program and thus response time is expected to be slow.
All in all you are probably best off just doing the create first and select second.
The below command mentioned seems to be correct to show the first 100 rows:
select * from <created_table> limit 100;
Paste the code you have written to create the table will help to diagnose the issue in hand!!
Nevertheless , check if you have correctly mentioned the delimiters for the elements, key-value pairs, collection items etc while creating the table.
If you have not defined them correctly you might end up with having only the first row(header) being shown.
I'm currently playing with different ways of getting data into SQL and yesterday hit a problem using BCP which although I solved reminded me of working with SSIS packages because of the not very useful error information. I feel that for the way I like to work I would be much happier loading entire datarows whether fixed-width or delimited into a staging table (using BCP or Bulk Insert) and then operating on the data rows rather than trying to force them into typed columns on the way in to SQL.
As such I would like to find an approach that would allow me to split and validate (check datatype) the data before I insert it into its destination and also write out any bad datarows to another table so I can then decide what to do with them.
I've knocked together a script to simulate the scenario, the importedData table would be the output of my BCP or BULK INSERT. All the data from ImportedData needs to end up either in the Presenters or the RejectedData tables.
I need an approach that could scale reasonably well, a real life situation might me more like 40 columns across with 20 million rows of data so I'm thinking I'll have to do something like process 10,000 rows at a time.
SQL Server 2012 has the new try_parse function which would probably help but I need to be able to do this on 2005 and 2008 machines.
IF OBJECT_ID (N'ImportedData', N'U') IS NOT NULL DROP TABLE dbo.ImportedData
CREATE TABLE dbo.ImportedData (RowID INT IDENTITY(1,1), DataRow VARCHAR(MAX))
IF OBJECT_ID (N'Presenters', N'U') IS NOT NULL DROP TABLE dbo.Presenters
CREATE TABLE dbo.Presenters (PresenterID INT, FirstName VARCHAR(10), LastName VARCHAR(10))
IF OBJECT_ID (N'RejectedData', N'U') IS NOT NULL DROP TABLE dbo.RejectedData
CREATE TABLE dbo.RejectedData (DataRow VARCHAR(MAX))
-- insert as fixed-width
INSERT INTO dbo.ImportedData(DataRow)
SELECT '1 Bruce Forsythe '
UNION ALL SELECT '2 David Dickinson '
UNION ALL SELECT 'X BAD DATA'
UNION ALL SELECT '3 Keith Chegwin '
-- insert as CSV
/*INSERT INTO dbo.ImportedData(DataRow)
SELECT '1,Bruce,Forsythe'
UNION ALL SELECT '2,David,Dickinson'
UNION ALL SELECT 'X,BAD,DATA'
UNION ALL SELECT '3,Keith,Chegwin'
*/
---------- DATA PROCESSING -------------------------------
SELECT
SUBSTRING(DataRow,1,3) AS ID,
SUBSTRING(DataRow,4,10) AS FirstName,
SUBSTRING(DataRow,14,10) AS LastName
FROM
ImportedData
---------- DATA PROCESSING -------------------------------
SELECT * FROM ImportedData
SELECT * FROM Presenters
SELECT * FROM RejectedData
For your 20M row scenario and concerns over performance, let's dive into that.
Step 1, load big file into database. The file system is going to disk and read all that data up. Maybe you're sitting on banks of Fusion-io drives and iops is not a concern but baring that unlikely scenario, you will spend X amount of time reading that data off disk via bcp/bulk insert/ssis/.net/etc. You then get to spend time writing all of that same data back to disk in the form of the table insert(s).
Step 2, parse that data. Before we spend any CPU time running those substring operations, we'll need to identify the data rows. If your machine is well provisioned on RAM, then the data pages for ImportedData might be in memory and it will be far less costly to access them. Odds are though, they aren't all in memory so a combination of logical and physical reads will occur to get that data. You've now effectively read that source file in twice for no gain.
Now time to start splitting your data. Out of the box, TSQL will give you trims, left, right, and substring methods. With CLR, you can get some wrapper methods to the .NET string library to help simplify the coding efforts but you'll trade some coding efficiencies with instantiation costs. Last I read on the matter the answer was (tsql vs clr) was "it depends." Shocking if you know the community but it really depends on your string lengths and a host of factors.
Finally, we're ready to parse the values and see whether it's a legit value. As you say, with SQL 2012, we have try_parse as well as try_convert. Parse is completely new but if you need to deal with locale aware data (01-02-05 In GB, it's Feb 1, 2005. In US, it's Jan 2, 2005. In JP, it's Feb 5, 2001) it's invaluable. If you're not on 2012, you could roll your own versions with a CLR wrapper.
Step 3, Errors! Someone slipped in a bad date or whatever and your cast fails. What happens? Since the query either succeeds or it doesn't, all of the rows fail and you get ever so helpful error messages like "string or binary data would be truncated" or "conversion failed when converting datetime from character string." Which row out of your N size slice? You won't know until you go looking for it and this is when folks usually devolve into an RBAR approach further degrading performance. Or they try and stay set based but run repeated queries against the source set filtering for scenarios that will fail the conversion before attempting the insert.
You can see by my tags, I'm an SSIS kind of guy but I am not such a bigot to think it's the only thing that can work. If there's an approach for ETL, I like to think I've tried it. In your case, I think you will get much better performance and scalability by developing your own ETL code/framework or using an existing one (rhino or reactive)
Finally finally, be aware of the implications of varchar(max). It has a performance cost associated to it.
varchar(max) everywhere?
is there an advantage to varchar(500) over varchar(8000)?
Also, as described, your process would only allow for a single instance of the ETL to be running at once. Maybe that covers your use case but in companies where we did lots of ETL, we couldn't force client B to wait for client A's etl to finish processing before starting their work or we'd be short of clients in no time at all.
There is no simple way of doing it in T-SQL.in this case you need to have isdate() ,isnumeric() type of UDF for all the datatypes you will try to parse. then you can move the rejected one to rejected Table ,delete those rows from importeddate and then continue with your load..
SELECT
RecordID,
SUBSTRING(DataRow,1,3) AS ID,
SUBSTRING(DataRow,4,10) AS FirstName,
SUBSTRING(DataRow,14,10) AS LastName,
SUBSTRING(DataRow,24,8) AS DOB,
SUBSTRING(DataRow,32,10) AS Amount,
INTO RejectedData
FROM ImportedData
WHERE ISDATE(SUBSTRING(DataRow,24,8))= 0
OR ISNUMERIC(SUBSTRING(DataRow,32,10))=0
then delete from imported data
DELETE FROM ImportedData WHERE RecordID IN (SELECT RecordID FROM RejectedData )
and then insert into presenter
INSERT INTO Presenters
SELECT
RecordID,
SUBSTRING(DataRow,1,3) AS ID,
SUBSTRING(DataRow,4,10) AS FirstName,
SUBSTRING(DataRow,14,10) AS LastName,
CONVERT(Date,SUBSTRING(DataRow,24,8)) AS DOB,
CONVERT(DECIMAL(18,2),SUBSTRING(DataRow,32,10)) AS Amount,
FROM ImportedData
and for managing batches in inserts this is a very good article.
http://sqlserverplanet.com/data-warehouse/transferring-large-amounts-of-data-using-batch-inserts
I am lazy, sometimes excruciatingly lazy but hey (ironically) this is how we get stuff done right?
Had a simple idea that may or not be out there. If it is I would like to know and if not perhaps I will make it.
When working with my MSSQL database sometimes I want to test the performance of various transactions over tables and view and procedures etc... Does anyone know if there is a way to fill a table up with x rows of junk data mearly to experiment with.
One could simple enough..
INSERT INTO `[TABLE]`
SELECT `COLUMNS` FROM [`SOURCE_TABLE`]
Or do some kind of...
DECLARE count int
SET count = 0
WHILE count <= `x`
BEGIN
INSERT INTO `[TABLE]`
(...column list...)
VALUES
(...VALUES (could include the count here as a primary key))
SET count = count + 1
END
But it seems like there is or should already be something out there. Any ideas??
I use redgate
SQL Data generator
Use a Data Generation Plan (a feature of Visual Studio database projects).
WinSQL seems to have a data generator (which I did not test) and has a free version. But the Test data generation wizard seems to be reserved to the Pro version.
My personal favorite would be to generate a CSV file (using a 4.5 lines script) and load it into your SQL DB using BULK INSERT. This will also allow better customization of the data as sometimes is needed (e.g. when writing tests).
I am asking a question that is related to Execute multiple SQL commands in one round trip but not exactly the same because I am having this problem on a much bigger scale:
I have an input file with many different SQL commands (ExecuteNonQuery) that I have to process with a .net application.
Example:
INSERT INTO USERS (name, password) VALUES (#name, #pw); #name="abc"; #pw="def";
DELETE FROM USERS WHERE name=#name; #name="ghi";
INSERT INTO USERS (name, password) VALUES (#name, #pw); #name="mno"; #pw="pqr";
All of the commands have parameters so I would like the parameter mechanism that .net provides. But my application has to read these statements and execute them within an acceptable time span. There might be multiple thousand statements in one single file.
My first thought was to use SQLCommand with parameters since that would really be the way to do it properly (parameters are escaped by .net) but I can't afford to wait 50msec for each command to complete (network communication with DB server, ...). I need a way to chain the commands.
My second thought was to escape and insert the parameters myself so I could combine multiple commands in one SQLCommand:
INSERT INTO USERS (name, password) VALUES ('abc', 'def'); DELETE FROM USERS WHERE name=#name; #name='ghi'; INSERT INTO USERS (name, password) VALUES ('mno', 'pqr');
However I do feel uneasy with this solution because I don't like to escape the input myself if there are predefined functions to do it.
What would you do? Thanks for your answers, Chris
Assuming everything in the input is valid, what I would do is this:
Parse out the parameter names and values
Rewrite the parameter names so they are unique across all queries (i.e., so you would be able to execute two queries with a #name parameter in the same batch)
Group together a bunch of queries into a single batch and run the batches inside a transaction
The reason why you (likely) won't be able to run this all in a single batch is because there is a parameter limit of 2100 in a single batch (at least there was when I did this same thing with SQL Server); depending on the performance you get, you'll want to tweak the batch separation limit. 250-500 worked best for my workload; YMMV.
One thing I would not do is multi-thread this. If the input is arbitrary, the program has no idea if the order of the operations is important; therefore, you can't start splitting up the queries to run simultaneously.
Honestly, as long as you can get the queries to the server somehow, you're probably in good shape. With only "multiple thousands" of statements, the whole process won't take very long. (The application I wrote had to do this with several million statements.)
Interesting dilemma..
I would suggest any of these:
Have control over sql server? create a stored procedure that loads the file and do the work
Use sqlcommand, but then cache the parameters, and read only command type (delete, insert, etc) and the values from each line to execute. Parameter caching examples here, here, and here.
Use multiple threads.. A parent thread to read the lines and send them over to other threads: one to do the inserts, another to do the deletion, or as many as needed. Look at Tasks
I would really appreciate a bit of help/pointers on the following problem.
Background Info:
Database version: Oracle 9i
Java version: 1.4.2
The problem
I have a database table with multiple columns representing various meta data about a document.
E.g.:
CREATE TABLE mytable
(
document_id integer,
filename varchar(255),
added_date date,
created_by varchar(32),
....
)
Due to networking/latency issues between a webserver and database server, I would like to minimise the number of queries made to the database.
The documents are listed in a web page, but there are thousands of different documents.
To aid navigation, we provide filters on the web page to select just documents matching a certain value - e.g. created by user 'joe bloggs' or created on '01-01-2011'. Also, paging is provided so triggering a db call to get the next 50 docs or whatever.
The web pages themselves are kept pretty dumb - they just present what's returned by a java servlet. Currently, these filters are each provided with their distinct values through separate queries for distinct values on each column.
This is taking quite a long time due to networking latency and the fact it means 5 extra queries.
My Question
I would like to know if there is a way to get this same information in just one query?
For example, is there a way to get distinct results from that table in a form like:
DistinctValue Type
01-01-2011 added_date
01-02-2011 added_date
01-03-2011 added_date
Joe Bloggs created_by
AN Other created_by
.... ...
I'm guessing one issue with the above is that the datatypes are different across the columns, so dates and varchars could not both be returned in a "DistinctValue" column.
Is there a better/standard approach to this problem?
Many thanks in advance.
Jay
Edit
As I mentioned in a comment below, I thought of a possibly more memory/load effective approach that removes the original requirement to join the queries up -
I imagine another way it could work is
instead of populating the drop-downs
initially, have them react to a user
typing and then have a "suggester"
style drop-down appear of just those
distinct values that match the entered
text. I think this would mean a)
keeping the separate queries for
distinct values, but b) only running
the queries individually as needed,
and c) reducing the resultset by
filtering the unique values on the
user's text.
This query will return an output as you describe above:
SELECT DocumentID As DocumentID, 'FileName' As AttributeType, FileName As DistinctValue
FROM TableName
UNION
SELECT DocumentID, 'Added Date', Added_date FROM TableName
UNION
SELECT DocumentID, 'Created By', created_by FROM TableName
UNION
....
If you have the privilege you could create a view using this SQL and you could use it for your queries.
Due to networking/latency issues
between a webserver and database
server, I would like to minimise the
number of queries made to the
database.
The documents are listed in a web
page, but there are thousands of
different documents.
You may want to look into Lucene. Whenever I see "minimise queries to db" combined with "searching documents", this is what I think of. I've used this with very good success, and can be used with read-only or updating environments. Oracle's answer is Oracle Text, but (to me anyway) its a bit of a bear to setup and use. Depends on your company's technical resources and strengths.
Anyway, sure beats the heck out of multiple queries to the db for each connection.