How can I merge my insert scripts with IntelliJ? - intellij-idea

I have a (spring boot) project in IntelliJ Ultimate. There are two tables Main and Extension where every entry in Main has one corresponding entry in Extension, e.g.
Main
main_id
main_col_a
0
lorem
1
ipsum
Extension
main_id
extension_col_a
extension_col_b
0
b
irrelevant
1
c
data
Now I have merged the tables, so that Main consists of main_id,main_col_a and extension_col_a (and Extension is dropped). But for my many tests I have ~100 sql files with insert clauses that need to be merged as well, so I need to turn
INSERT INTO MAIN(MAIN_ID, MAIN_COL_A) VALUES
(0, 'lorem'),
(1, 'ipsum');
INSERT INTO EXTENSION(MAIN_ID, EXTENSION_COL_A, EXTENSION_COL_B) VALUES
(0, 'b', 'irrelevant'),
(1, 'c', 'data');
into
INSERT INTO MAIN(MAIN_ID, MAIN_COL_A) VALUES
(0, 'lorem', 'b'),
(1, 'ipsum', 'c');
in an automated way.
There is some variation such as alignment, but the inserts for Extension always follow those for MAIN and the ids are always in the same order.
I'm not worried about deleting the Extension table but about moving the column from Extensions to Main. I'm currently considering writing a python script but I'm wondering if it can maybe done easily with IntelliJ features. I know about multiple cursors, but there are too many files for that and for Macros I don't think they can be easily applied for the varying number of lines in the insert statements.

You can use "Replace in Files" with regex expressions.
Open Edit -> Find -> Replace in Files...
Create expressions based on this guide: https://www.jetbrains.com/help/idea/regular-expression-syntax-reference.html#regex-syntax-reference
Here is something to get you started:
Find: INSERT INTO MAIN\(MAIN_ID\, MAIN_COL_A\) VALUES[\s]*(\([\w'"]*, [\w'"]*, [\w'"]*\),)*[\s]*(\([\w'"]*\, [\w'"]*)\)([\w\W]*INSERT INTO EXTENSION\(main_id, extension_col_a, extension_col_b\) values)[\s]*\([\w'"]\, ([\w'"]*)\, [\w'"]*\)[,;]
Replace:
INSERT INTO MAIN\(MAIN_ID\, MAIN_COL_A\) VALUES\n$1\n$2, $4\)$3
Use find&replace as many times as needed to merge the tables and then use it to edit the first line of the insert into main and delete the insert into extension.

Related

postgresql-pgadmin 4 insert data

I want to insert data using for example(it works in the tutorial on yt):
INSERT INTO cars (name ,price)
VALUES ('renault' , '10000')
but it doesn't work in my database and I have no idea why, instead of this I have to use:
INSERT INTO public."cars" VALUES ('renault','10000')
So my question is: What's the difference between public."cars" and just cars?
The difference between quoted and unquoted identifiers is that the former allow arbitrary characters and SQL keywords and are not folded to lower case. None of this applies in your case.
The only difference is that in one case you qualify the name with a schema, so perhaps your problem is that there is another table cars on your search_path.
It is impossible to say more, because “does not work” is too unspecific.

SQL Server : fix misspelled business names

I'm looking for advice on how to tackle the issue of different spelling for the same name.
I have a SQL Server database with company names, and there are some companies that are the same but the spelling is different.
For example:
Building Supplies pty
Buidings Supplies pty
Building Supplied l/d
The problem is that there are no clear consistencies in the variation. Sometimes it's an extra 's', other times its an extra space.
Unfortunately I don't have a lookup list, so I can't use Fuzzy LookUp. I need to create the clean list.
Is there a method that people use to deal with this problem?
p.s I tried searching for this problem but can't seem to find a similar thread
Thanks
You can use SOUNDEX() DIFFERENCE() for this purpose.
DECLARE #SampleData TABLE(ID INT, BLD VARCHAR(50), SUP VARCHAR(50))
INSERT INTO #SampleData
SELECT 1, 'Building','Supplies'
UNION
SELECT 2, 'Buidings','Supplies'
UNION
SELECT 3, 'Biulding','Supplied'
UNION
SELECT 4, 'Road','Contractor'
UNION
SELECT 5, 'Raod','Consractor'
UNION
SELECT 6, 'Highway','Supplies'
SELECT *, DIFFERENCE('Building', BLD) AS DIF
FROM #SampleData
WHERE DIFFERENCE('Building', BLD) >= 3
Result
ID BLD SUP DIF
1 Building Supplies 4
2 Buidings Supplies 3
3 Biulding Supplied 4
If this serves your purpose you can write an update query to update selected record accordingly.
Aside from the SOUNDEX() DIFFERENCE() option (which is a very good one btw!) you could look into SSIS more.
Provided your data is in english and not exclusively names of people there is a lot you can do with these components:
Term extraction
Term lookup
Fuzzy grouping
Fuzzy lookup
The main flow would be a tiered structure where you try to find duplicates at increasingly less certain ways. Instead of applying them automaticaly you send all the names and keys you would need to apply the changes to a staging area where they can be reviewed and if needed applied.
If you go about it really smart you can use the reviewed data as a repository for making the package "learn", for example iu is hardly ever valid in english so if that is found and changing it to ui makes a valid english word you might want to start applying those automaticaly at some point.
One other thing to consider is keeping a list of all validated names and use this to check for duplicates of that names and to prevent unnecesary recursion/load on checking the source data.

Using table valued parameters with wildcards?

The scenario...
I'm developing a web site in ASP.net and Visual Basic that displays product listings (hereafter referred to as "Posts") generated by the users. In this site, I have a search box that allows the user to find his post more easily. The search box is designed to allow the user to input specific keywords which will hopefully provide a more customized search. For example:
In the example above, the user has specified that he would like to search for books with an author matching the name "John Smith" and also with the tags "Crime" and "Suspense".
Steps:
My SearchBoxManager class retrieves these key names (Author,
Tags) and the key values (John Smith, Crime, Suspense).
I run a SQL query using those parameters looking for exact matching.
I extract each word from the key values so I can search them separately.
This is where I am having issues. I run a SQL query using those parameters looking for non-exact matching (e.g., '%John%', '%Smith%'). This step is to create a system of results based on relevancy. This is my first relevancy algorithm so I may be going about it the wrong way.
The problem...
In order for the third step to work properly, I would like to place each set of separated words from the key values into a table-valued parameter, and then use that table-valued parameter in the SqlCommandText surrounded by the wildcard '%'.
In other words, because the number of words in each key value will probably change each time, I need to place them in a table-valued parameter of some kind. I also think I could string them together somehow and just use them directly in the query string, but I stopped mid-way because it was getting a little messy.
The question...
Can I use wildcards with table-valued parameters, and if so how?
Also, if any of you have developed a relevancy ranking algorithm for local search boxes or know of one, I would be ever-grateful if you could point me in the right direction.
Thank you in advance.
so some sample data
create table authors ( id int, name varchar(200) )
insert into authors values (1, 'John Smith')
insert into authors values (2, 'Jack Jones')
insert into authors values (3, 'Charles Johnston')
if you want to work with a table var we can make one and populate with a couple of search words
declare #t table(word varchar(10) )
insert into #t select 'smith' union select 'jones'
now we can select using wildcards
select authors.* from authors, #t words
where authors.name like '%'+words.word+'%'

Send multiple commands to one SQLConnection

I am asking a question that is related to Execute multiple SQL commands in one round trip but not exactly the same because I am having this problem on a much bigger scale:
I have an input file with many different SQL commands (ExecuteNonQuery) that I have to process with a .net application.
Example:
INSERT INTO USERS (name, password) VALUES (#name, #pw); #name="abc"; #pw="def";
DELETE FROM USERS WHERE name=#name; #name="ghi";
INSERT INTO USERS (name, password) VALUES (#name, #pw); #name="mno"; #pw="pqr";
All of the commands have parameters so I would like the parameter mechanism that .net provides. But my application has to read these statements and execute them within an acceptable time span. There might be multiple thousand statements in one single file.
My first thought was to use SQLCommand with parameters since that would really be the way to do it properly (parameters are escaped by .net) but I can't afford to wait 50msec for each command to complete (network communication with DB server, ...). I need a way to chain the commands.
My second thought was to escape and insert the parameters myself so I could combine multiple commands in one SQLCommand:
INSERT INTO USERS (name, password) VALUES ('abc', 'def'); DELETE FROM USERS WHERE name=#name; #name='ghi'; INSERT INTO USERS (name, password) VALUES ('mno', 'pqr');
However I do feel uneasy with this solution because I don't like to escape the input myself if there are predefined functions to do it.
What would you do? Thanks for your answers, Chris
Assuming everything in the input is valid, what I would do is this:
Parse out the parameter names and values
Rewrite the parameter names so they are unique across all queries (i.e., so you would be able to execute two queries with a #name parameter in the same batch)
Group together a bunch of queries into a single batch and run the batches inside a transaction
The reason why you (likely) won't be able to run this all in a single batch is because there is a parameter limit of 2100 in a single batch (at least there was when I did this same thing with SQL Server); depending on the performance you get, you'll want to tweak the batch separation limit. 250-500 worked best for my workload; YMMV.
One thing I would not do is multi-thread this. If the input is arbitrary, the program has no idea if the order of the operations is important; therefore, you can't start splitting up the queries to run simultaneously.
Honestly, as long as you can get the queries to the server somehow, you're probably in good shape. With only "multiple thousands" of statements, the whole process won't take very long. (The application I wrote had to do this with several million statements.)
Interesting dilemma..
I would suggest any of these:
Have control over sql server? create a stored procedure that loads the file and do the work
Use sqlcommand, but then cache the parameters, and read only command type (delete, insert, etc) and the values from each line to execute. Parameter caching examples here, here, and here.
Use multiple threads.. A parent thread to read the lines and send them over to other threads: one to do the inserts, another to do the deletion, or as many as needed. Look at Tasks

Can I send "batched" INSERTs to Oracle?

Scenario:
I load some data into a local MySQL database each day, about 2 million rows;
I have to (have to - it's an audit/regulatory thing) move to a "properly" administered server, which currently looks to be Oracle 10g;
The server is in a different country: a network round-trip current takes 60-70 ms;
Input is a CSV file in a denormalised form: I normalise the data before loading, each line typically results in 3-8 INSERTs across up to 4 tables;
The load script is currently implemented in Ruby, using ActiveRecord and fastercsv. I've tried the ar-extensions gem, but it assumes that the MySQL style multiple values clause idea will work. It doesn't.
EDIT: Extremely useful answers already - thank-you! More about that pesky input file. The number of fields is variable and positions have changed a few times - my current script determines content by analysing the header row (well, fastercsv and a cunning converter do it). So a straight upload and post-process SQL wouldn't work without several versions of the load file, which is horrible. Also it's a German CSV file: semi-colon delimited (no big deal) and decimals indicated by commas (rather bigger deal unless we load as VARCHAR and text-process afterwards - ugh).
The problem:
Loading 2 million rows at about 7/sec is going to take rather more than 24 hours! That's likely to be a drawback with a daily process, not to mention that the users would rather like to be able to access the data about 5 hours after it becomes available in CSV form!
I looked at applying multiple inserts per network trip: the rather ungainly INSERT ALL... syntax would be fine, except that at present I'm applying a unique id to each row using a sequence. It transpires that
INSERT ALL
INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2)
INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4)
INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6)
SELECT 1 FROM dual;
(did I say it was ungainly?) tries to use the same id for all three rows. Oracle docus appear to confirm this.
Latest attempt is to send multiple INSERTs in one execution, e.g.:
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2);
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4);
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6);
I haven't found a way to persuade Oracle to accept that.
The Question(s)
Have I missed something obvious? (I'd be so pleased if that turned out to be the case!)
If I can't send multiple inserts, what else could I try?
Why Accept That One?
For whatever reason, I prefer to keep my code as free from platform-specific constructs as possible: one reason this problem arose is that I'm migrating from MySQL to Oracle; it's possible another move could occur one day for geographical reasons, and I can't be certain about the platform. So getting my database library to the point where it can use a text SQL command to achieve reasonable scaling was attractive, and the PL/SQL block accomplishes that. Now if another platform does appear, the change will be limited to changing the adapter in code: a one-liner, in all probability.
How about shipping the csv file to the oracle db server, use SQLLoader to load the csv file into a staging table and then running a stored procedure to transform and INSERT it in the final tables?
You could use:
insert into tablea (id,b,c)
( select tablea_seq.nextval,1,2 from dual union all
select tablea_seq.nextval,3,4 from dual union all
select tablea_seq.nextval,3,4 from dual union all
select tablea_seq.nextval,3,4 from dual union all
...
)
This works until about up to 1024 lines when I remember correctly.
You could also send it as a PL/SQL batch instruction:
BEGIN
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,1,2);
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,3,4);
INSERT INTO tablea (id,b,c) VALUES (tablea_seq.nextval,5,6);
...
COMMIT;
END
I would be loading the raw CSV file to a dedicated table in the database using SQL*Loader without normalising it, then running code against the data in the table to normalise it out to the various required tables. This would minimise the round trips, and is a pretty conventional approach in the Oracle community.
SQL*Loader can be a little challenging in terms of initial learning curve, but you can soon post again if you get stuck.
http://download.oracle.com/docs/cd/B19306_01/server.102/b14215/part_ldr.htm#i436326
I have a quick suggestion. I'm from the MySQL world but was trained on Oracle, I think this will work.
In MySQL you can insert multiple records with a single insert statement. It looks like this:
INSERT INTO table_name (column_one, column_two, column_three, column_four)
VALUES
('a', 'one', 'alpha', 'uno'), // Row 1
('b', 'two', 'beta', 'dos'), // Row 2
('c', 'three', 'gamma', 'tres'), // etc.
....
('z', 'twenty-six', 'omega', 'veintiséis');
Now obviously you can only insert into one table at once, and you wouldn't want to do 2 million records, but you could easily do 10 or 20 or 100 at a time (if you are allowed packets that big). You may have to generate this by hand, I don't know if they frameworks you are using (or any, for that matter) support making this kind of code for you.
In the MySQL world this DRAMATICALLY speeds up inserts. I assume it does all the index updates and such at the same time, but it also prevents it from having to re-parse the SQL on each insert.
If you combine this with prepared statements (so the SQL is cached and it doesn't have to be parsed out each time) and transactions (to make sure things are always in a sane state when you have to do inserts across multiple tables... I think you will be doing pretty good.
NunoG is right that you can load the CSVs directly into Oracle. You may be best off reading in the input file, generating a normalized set of CSV files (one for each table), and then loading in each of those one at a time.
Instead of executing the SQL over the network you could write the Inserts to a text file, move it over the network and run it locally there.
SQL*Loader is an Oracle-supplied utility that allows you to load data from a flat file into one or more database tables. This will be 10-100 times faster than inserts using queries.
http://www.orafaq.com/wiki/SQL*Loader_FAQ
If SQL*Loader doesn't cut it, try a small preprocessor program that formats the file in SQL*Loader readable format.