I use a SELECT statement to fill an internal table with a large amount of records. I am new to ABAP and OpenSQL. I know how cursors work and why I need them in this case, but I can't seem to find any good examples that show a correct implementation of them. This is the code I am working with:
TYPES: BEGIN OF lty_it_ids,
iteration_id TYPE dat_itr_id,
END OF lty_it_ids.
DATA: lt_it_ids TYPE STANDARD TABLE OF lty_it_ids,
lt_records_to_delete TYPE STANDARD TABLE OF tab_01p.
SELECT 01r~iteration_id
INTO TABLE lt_it_ids
FROM tab_01r AS 01r INNER JOIN tab_01a AS 01a
ON 01r~iteration_id = 01a~iteration_id
WHERE 01a~collection_id = i_collection_id.
IF lt_it_ids IS NOT INITIAL.
SELECT * FROM tab_01p INTO CORRESPONDING FIELDS OF TABLE lt_records_to_delete
FOR ALL ENTRIES IN lt_it_ids
WHERE iteration_id = lt_it_ids-iteration_id AND collection_id = i_collection_id.
IF lt_records_to_delete IS NOT INITIAL.
DELETE tab_01p FROM TABLE lt_records_to_delete.
ENDIF.
ENDIF.
In the first SELECT statement I fill a small internal table with some values that correspond with the index of a larger table. With these indexes I can search faster through the larger table to find all the entries I want to DELETE. It is the second SELECT statement that fills a large (a few million rows) internal table. All the records from this (lt_records_to_delete) internal table I want to delete from the database table.
In what way can I introduce a cursor to this code so it selects and deletes records in smaller batches?
There is a good example in the documentation. I am not entirely sure why you need to read the entries before deleting them, but there might be a good reason that you neglected to mention (for example logging the values). For the process you are implementing, be aware of the following warning in the documentation:
If write accesses are made on a database table for which a database
cursor is open, the results set is database-specific and undefined.
Avoid this kind of parallel access if possible.
Related
Does it take more time to create a table using select as statement or to just run the select statement? Is the time difference too large or can it be neglected?
For example between
create table a
as select * from b
where c = d;
and
select * from b
where c = d;
which one should run faster and can the time difference be neglected?
Creating the table will take more time. There is more overhead. If you look at the metadata for your database, you will find lots of tables or views that contain information about tables. In particular, the table names and column names need to be stored.
That said, the data processing effort is pretty similar. However, there might be overhead in storing the result set in permanent storage rather than in the data structures needed for a result set. In fact, the result set may never need to be stored "on disk" (i.e. permanently). But with a table creation, that is needed.
Depending on the database, the two queries might also be optimized differently. The SELECT query might be optimized to return the first row as fast as possible. The CREATE query might be optimized to return all rows as fast as possible. Also, the SELECT quite might just look faster if your database and interface start returning rows when they first appear.
I should point out that under most circumstances, the overhead might not really be noticeable. But, you can get other errors with the create table statement that you would not get with just the select. For instance, the table might already exist. Or duplicate column names might pose a problem (although some databases don't allow duplicate column names in result sets either).
I'm trying to copy data from one table to another. The schemas are identical except that the source table has fields as nullable when they were meant to be required. Big query is complaining that the fields are null. I'm 99% certain the issue is that in many entries the repeatable fields are absent, which causes no issues when inserting into the table using our normal process.
The table I'm copying from used to have the exact same schema, but accidentally lost the required fields when recreating the table with a different partitioning scheme.
From what I can tell, there is no way to change the fields from nullable to required in an existing table. It looks to me like you must create a new table then use a select query to copy data.
I tried enabling "Allow large results" and unchecking "flatten results" but I'm seeing the same issue. The write preference is "append to table"
(Note: see edit below as I am incorrect here - it is a data issue)
I tried building a query to better confirm my theory (and not that the records exist but are null) but I'm struggling to build a query. I can definitely see in the preview that having some of the repeated fields be null is a real use case, so I would presume that translates to the nested required fields also being null. We have a backup of the table before it was converted to the new partitioning, and it has the same required schema as the table I'm trying to copy into. A simple select count(*) where this.nested.required.field is null in legacy sql on the backup indicates that there are quite a few columns that fit this criteria.
SQL used to select for insert:
select * from my_table
Edit:
When making a partition change on the table was also setting certain fields to a null value. It appears that somehow the select query created objects with all fields null rather than simply a null object. I used a conditional to set a nested object to either null or pick its existing value. Still investigating, but at this point I think what I'm attempting to do is normally supported, based on playing with some toy tables/queries.
When trying to copy from one table to another, and using SELECT AS STRUCT, run a null check like this:
IF(foo.bar is null, null, (SELECT AS STRUCT foo.bar.* REPLACE(...))
This prevents null nested structures from turning into structures full of null values.
To repair it via select statement, use a conditional check against a value that is required like this:
IF (bar.req is null, null, bar)
Of course a real query is more complicated than that. The good news is that the repair query should look similar to the original query that messed up the format
I am currently writing an application that needs to be able to select a subset of IDs from Millions of users...
I am currently writing software to select a group of 100.000 IDs from a table that contains the whole list of Brazilian population 200.000.000 (200M), I need to be able to do this in a reasonable amount of time... ID on Table = ID on XML
I am thinking of parsing the xml file and starting a thread that performs a SELECT statement on a database, I would need a connection for each thread, still this way seems like a brute force approach, perhaps there is a more elegant way?
1) what is the best database to do this?
2) what is a reasonable limit to the amount of db connections?
Making 100.000 queries would take a long time, and splitting up the work on separate threads won't help you much as you are reading from the same table.
Don't get a single record at a time, rather divide the 100.000 items up in reasonably small batches, for example 1000 items each, which you can send to the database. Create a temporary table in the database with those id values, and make a join against the database table to get those records.
Using MS SQL Server for example, you can send a batch of items as an XML to a stored procedure, which can create the temporary table from that and query the database table.
Any modern DBMS that can handle an existing 200M row table, should have no problem comparing against a 100K row table (assuming your hardware is up to scratch).
Ideal solution: Import your XML (at least the IDs) into to a new table, ensure the columns you're comparing are indexed correctly. And then query.
What language? If your using .NET you could load your XML and SQL as datasources, and then i believe there are some enumerable functions that could be used to compare the data.
Do this:
Parse the XML and store the extracted IDs into a temporary table1.
From the main table, select only the rows whose ID is also present in the temporary table:
SELECT * FROM MAIN_TABLE WHERE ID IN (SELECT ID FROM TEMPORARY_TABLE)
A decent DBMS will typically do the job quicker than you can, even if you employed batching/chunking and parallelization on your end.
1 Temporary tables are typically created using CREATE [GLOBAL|LOCAL] TEMPORARY TABLE ... syntax and you'll probably want it private for the session (check your DBMS's interpretation of GLOBAL vs. LOCAL for this). If your DBMS of choice doesn't support temporary tables, you can use "normal" tables instead, but be careful not to let concurrent sessions mess with that table while you are still using it.
I'm converting data from one schema to another. Each table in the source schema has a 'status' column (default NULL). When a record has been converted, I update the status column to 1. Afterwards, I can report on the # of records that are (not) converted.
While the conversion routines are still under development, I'd like to be able to quickly reset all values for status to NULL again.
An UPDATE statement on the tables is too slow (there are too many records). Does anyone know a fast alternative way to accomplish this?
The fastest way to reset a column would be to SET UNUSED the column, then add a column with the same name and datatype.
This will be the fastest way since both operations will not touch the actual table (only dictionary update).
As in Nivas' answer the actual ordering of the columns will be changed (the reset column will be the last column). If your code rely on the ordering of the columns (it should not!) you can create a view that will have the column in the right order (rename table, create view with the same name as old table, revoke grants from base table, add grants to view).
The SET UNUSED method will not reclaim the space used by the column (whereas dropping the column will free space in each block).
If the column is nullable (since default is NULL, I think this is the case), drop and add the column again?
While the conversion routines are still under development, I'd like to be able to quickly reset all values for status to NULL again.
If you are in development why do you need 70 million records? Why not develop against a subset of the data?
Have you tried using flashback table?
For example:
select current_scn from v$database;
-- 5607722
-- do a bunch of work
flashback table TABLE_NAME to scn 5607722;
What this does is ensure that the table you are working on is IDENTICAL each time you run your tests. Of course, you need to ensure you have sufficient UNDO to hold your changes.
hm. maybe add an index to the status column.
or alterately, add a new table with the primary key only in it. then insert to that table when the record is converted, and TRUNC that table to reset...
I like some of the other answers, but I just read in a tuning book that for several reasons it's often quicker to recreate the table than to do massive updates on the table. In this case, it seems ideal, since you would be writing the CREATE TABLE X AS SELECT with hopefully very few columns.
My web application parses data from an uploaded file and inserts it into a database table. Due to the nature of the input data (bank transaction data), duplicate data can exist from one upload to another. At the moment I'm using hideously inefficient code to check for the existence of duplicates by loading all rows within the date range from the DB into memory, and iterating over them and comparing each with the uploaded file data.
Needless to say, this can become very slow as the data set size increases.
So, I'm looking to replace this with a SQL query (against a MySQL database) which checks for the existence of duplicate data, e.g.
SELECT count(*) FROM transactions WHERE desc = ? AND dated_on = ? AND amount = ?
This works fine, but my real-world case is a little bit more complicated. The description of a transaction in the input data can sometimes contain erroneous punctuation (e.g. "BANK 12323 DESCRIPTION" can often be represented as "BANK.12323.DESCRIPTION") so our existing (in memory) matching logic performs a little cleaning on this description before we do a comparison.
Whilst this works in memory, my question is can this cleaning be done in a SQL statement so I can move this matching logic to the database, something like:
SELECT count(*) FROM transactions WHERE CLEAN_ME(desc) = ? AND dated_on = ? AND amount = ?
Where CLEAN_ME is a proc which strips the field of the erroneous data.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
Thanks a lot
can this cleaning be done in a SQL statement
Yes, you can write a stored procedure to do it in the database layer:
mysql> CREATE FUNCTION clean_me (s VARCHAR(255))
-> RETURNS VARCHAR(255) DETERMINISTIC
-> RETURN REPLACE(s, '.', ' ');
mysql> SELECT clean_me('BANK.12323.DESCRIPTION');
BANK 12323 DESCRIPTION
This will perform very poorly across a large table though.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
No, as far as databases are concerned the cleanest way is always the cleverest way (as long as performance isn't awful).
Do that, and add indexes to the columns you're doing bulk compares on, to improve performance. If it's actually intrinsic to the type of data that desc/dated-on/amount are always unique, then express that in the schema by making it a UNIQUE index constraint.
The easiest way to do that is to add a unique index on the appropriate columns and to use ON DUPLICATE KEY UPDATE. I would further recommend transforming the file into a csv and loading it into a temporary table to get the most out of mysql's builtin functions, which are surely faster than anything that you could write yourself - if you consider that you would have to pull the data into your own application, while mysql does everything in place.
The cleanest way is indeed to make sure only correct data is in the database.
In this example the "BANK.12323.DESCRIPTION" would be returned by:
SELECT count(*) FROM transactions
WHERE desc LIKE 'BANK%12323%DESCRIPTION' AND dated_on = ? AND amount = ?
But this might impose performance issues when you have a lot of data in the table.
Another way that you could do it is as follows:
Clean the description before inserting.
Create a primary key for the table that is a combination of the columns that uniquely identifier the entry. Sounds like that might be cleaned description, date and amount.
Use the either the 'replace' or 'on duplicate key' syntax, which ever is more appropriate. 'replace' actually replaces the existing row in the db with the updated one when an existing unique key confict occurs, e.g:
REPLACE INTO transactions (desc, dated_on, amount) values (?,?,?)
'on duplicate key' allows you to specify which columns to update on a duplicate key error:
INSERT INTO transaction (desc, dated_on, amount) values (?,?,?)
ON DUPLICATE KEY SET amount = amount
By using the multi-column primary key, you will gain a lot of performance since primary key lookups are usually quite fast.
If you prefer to keep your existing primary key, you could also create a unique unix on those three columns.
Whichever way you choose, I would recommend cleaning the description before going into the db, even if you also store the original description and just use the cleaned one for indexing.