Postgres concurrent copy without ID value? - sql

I am performing concurrent copy commands but am not specifying a value for a serial ID field. As far as I know this is ok if I have just one copy command since Postgres will generate an ID.
But would this cause conflicts with more than 1 copy command running since the sequence is never updated by a copy command?

copy command update id serial automatically. so, it works fine without id conflicts.
I test performing concurrent copy commands in postgresql 9.24
I create table like below
create table tbl_test (id serial primary key, name varchar(16), age integer);
I also made 2 csv file having 1,000,000 data.
file1.csv
"1", 1
"2", 2
...
"1000000", 1000000
file2.csv
"n1", 1
"n2", 2
...
"n1000000", 1000000
when I try to copy simultaneously from file1, I get result like below
...
1000245 | n453649 | 453649
1000246 | 546595 | 546595
1000247 | n453650 | 453650
1000248 | 546596 | 546596
1000249 | n453651 | 453651
1000250 | 546597 | 546597
...
all data copied well.
postgres=# select count(*) from tbl_test;
count
---------
2000000
(1 row)

As long as the column has a sequence as a default (or is a SERIAL/BIGSERIAL datatype) and you are not referencing that directly in the COPY command you will not have ever have conflicts on that id.
Sequences are designed to be atomic, even within transactions, which also generates another common question "How do I get gapless sequences?"

Related

Open sums in SQL / dynamic selection of tables

Much ink has been spilled on the topic of sum types in SQL. The standard solutions are called absorption, separation, and partition; see, e.g.: https://www.inf.unibz.it/~montali/teaching/1415/dpm/slides/4.relational-mapping.pdf .
I want to ask about how to encode open sums. Normal sums allow a field to be one of a fixed set of several different types; with open sums, this set is not fixed.
The basic setup in our program: There is a list of "triggers," where each trigger can be one of many different things. Plugins can be written defining new trigger types, although the set of trigger types can be assumed to be known at compile time.
We want a table of all triggers.
Our current best idea:
Dynamically create a materialized view of the following form:
id | id_in_plugin_table | thing_in_main_program_it_refs | plugin_name
---------------------------------------------------------------------
1 | 27 | 8 | RegexTrigger
2 | 27 | 12 | RidiculouslyUnsafeCustomJSTrigger
This relation is automatically generated from the various plugin tables, each of which have their own ID and a thing_in_main_program_it_refs field.
For illustration, here's what the referenced tables may look like.
RegexTrigger table:
id | thing_in_main_program_it_refs | regex
---------------------------------------------------------------------
27 | 8 | hel*o
RidiculouslyUnsafeCustomJSTrigger
id | thing_in_main_program_it_refs | custom_js
---------------------------------------------------------------------
27 | 12 | (x) => isPrime(x.length())
Either use two roundtrips to lookup the plugin table and then query it, or combine them into a single SQL program which uses EXEC.
I'm happy with part 1, but not with part 2. Neither option sounds efficient, and the latter option uses EXEC.
So, we're looking for either (a) a better way to dynamically select a table in a query, or (b) a different approach to open sums.

Incremental integer ID in Impala

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which
Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.
Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.
The desired result should look like:
-- UUID() provided as example - I want to achieve the `my_id`-column.
| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1 | 50d53ca4-b...| "a" |
| 2 | 6ba8dd54-1...| "b" |
| 3 | 515362df-f...| "c" |
| 4 | a52db5e9-e...| "d" |
|-------|--------------|--------------|
How can I achieve the desired result (integer-ID ranging from 1..n)?
Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.
Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
, some_content
FROM some_table
row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.
The command creates the desired incremental ID without any nested SQL-statements or other tricks.
| my_id | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|
I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:
SELECT
row_number() OVER (ORDER BY actual_column) as my_id
, some_content
FROM some_table
It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.
Understandably, it took a lot longer to run than the dummy version.

SELECT MAX values for duplicate values in another column

I am having some trouble finding an answer for this one, so I apologize if it was somewhere else.
I have a table 'dbo.MileageImport' that has the following layout which I pulled to find duplicate entries:
|KEY | DATA |
---------------------
|V9864653 | 180288 |
|V9864653 | 22189 |
|V9864811 | 11464 |
|V9864811 | 12688 |
What I am having troubles with is when I run the following SQL in a DB2 environment:
SELECT KEY, MIN(DATA)
FROM dbo.MileageImport
GROUP BY KEY
HAVING (COUNT(KEY)>1);
It ends up pulling the following data:
|KEY | DATA |
---------------------
|V9864811 | 11464 |
|V9864653 | 180288 |
For some reason it's pulling the MIN value for V9864811, but not V9864653. If I inverse that and put MAX instead of MIN, it pulls the opposite values.
Is there something I am missing here so I can pull the MIN DATA value for only duplicate KEY records, or is there another way to do this? The report where this data comes from changes from month to month, so there could be different keys that end up being duplicated that I need to correct. Ultimately I am turning this into a DELETE statement to delete the lower of the two (or more) duplicated mileage entries.
Is your DATA column numerical? or a VARCHAR?
If you find its better to change it to a number if you can, maybe an integer if you aren't having any fractions and its just round numbers.
if not, then you could cast them to an integer value, but if there are lots of transactions or its a big table it will be slow and not ideal. Its bad practise to do that if you could just change the datatype!
SELECT KEY, MIN(CAST(DATA as Int))
FROM dbo.MileageImport
GROUP BY KEY
HAVING (COUNT(KEY)>1)

What is the shortest notation for updating a column in an internal table?

I have an ABAP internal table. Structured, with several columns (e.g. 25). Names and types are irrelevant. The table can get pretty large (e.g. 5,000 records).
| A | B | ... |
| --- | --- | --- |
| 7 | X | ... |
| 2 | CCC | ... |
| 42 | DD | ... |
Now I'd like to set one of the columns (e.g. B) to a specific constant value (e.g. 'Z').
What is the shortest, fastest, and most memory-efficient way to do this?
My best guess is a LOOP REFERENCE INTO. This is pretty efficient as it changes the table in-place, without wasting new memory. But it takes up three statements, which makes me wonder whether it's possible to get shorter:
LOOP AT lt_table REFERENCE INTO DATA(ls_row).
ls_row->b = 'Z'.
ENDLOOP.
Then there is the VALUE operator which reduces this to one statement but is not very efficient because it creates new memory areas. It also gets longish for a large number of columns, because they have to be listed one by one:
lt_table = VALUE #( FOR ls_row in lt_table ( a = ls_row-a
b = 'Z' ) ).
Are there better ways?
The following code sets PRICE = 0 of all lines at a time. Theoritically, it should be the fastest way to update all the lines of one column, because it's one statement. Note that it's impossible to omit the WHERE, so I use a simple trick to update all lines.
DATA flights TYPE TABLE OF sflight.
DATA flight TYPE sflight.
SELECT * FROM sflight INTO TABLE flights.
flight-price = 0.
MODIFY flights FROM flight TRANSPORTING price WHERE price <> flight-price.
Reference: MODIFY itab - itab_lines
If you have a workarea declared...
workarea-field = 'Z'.
modify table from workarea transporting field where anything.
Edition:
After been able to check that syntax in my current system, I could prove (to myself) that the WHERE clause must be added or a DUMP is raised.
Thanks Sandra Rossi.

Hive line break issue

I have a hive table over an accumulo table (because we need cell level security):
CREATE TABLE testtable(rowid string, value string)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES('accumulo.columns.mapping' = ':rowid,c:value') TBLPROPERTIES ('accumulo.table.name' = 'testtable');
If i have a value which contains "/n" it conflicts with the default hive line break property which is also "/n".
for example:
accumulo insert: insert 1 c value line\x0Abreak
hive select: select rowid, value, row_number() over (order by null) as rank from testtable;
you will get back two rows instead of one.
| rowid | value | rank |
+---------+--------+-------+
| 2 | line | NULL |
| break | 1 | NULL |
Is there any idea how can I avoid this? Thank you for all the help!
That seems very unexpected (as the author of the AccumuloStorageHandler), but maybe I just don't know something that Hive is trying to do?
I'd file a JIRA issue for Hive over at https://issues.apache.org/jira/secure/CreateIssue!default.jspa. Feel free to mention me and I can try to help write a test and get to the bottom of it.