Incremental integer ID in Impala - sql

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which
Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.
Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.
The desired result should look like:
-- UUID() provided as example - I want to achieve the `my_id`-column.
| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1 | 50d53ca4-b...| "a" |
| 2 | 6ba8dd54-1...| "b" |
| 3 | 515362df-f...| "c" |
| 4 | a52db5e9-e...| "d" |
|-------|--------------|--------------|
How can I achieve the desired result (integer-ID ranging from 1..n)?
Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.

Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
, some_content
FROM some_table
row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.
The command creates the desired incremental ID without any nested SQL-statements or other tricks.
| my_id | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|

I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:
SELECT
row_number() OVER (ORDER BY actual_column) as my_id
, some_content
FROM some_table
It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.
Understandably, it took a lot longer to run than the dummy version.

Related

What is the maximum value for STRING ordering in SQL (SQLite)?

I have a SQLite database and I want to order my results by ascending order of a String column (name). I want the null-valued rows to be last in ascending order.
Moreover, I am doing some filtering on the same column (WHERE name>"previously obtained value"), which filters out the NULL-valued rows, which I do not want. Plus, the version of SQLite I'm using (I don't have control over this) does not support NULLS LAST. Therefore, to keep it simple I want to use IFNULL(name,"Something") in my ORDER BY and my comparison.
I want this "Something" to be as large as possible, so that my null-valued rows are always last. I have texts in Japanese and Korean, so I can't just use "ZZZ".
Therefore, I see two possible solutions. First, use the "maximum" character used by SQLite in the default ordering of strings, do you know what this value is or how to obtain it? Second, as the cells can contain any type in SQLite, is there a value of any other type that will always be considered larger than any string?
Example:
+----+-----------------+---------------+
| id | name | othercol |
+----+-----------------+---------------+
| 1 | English name | hello |
| 2 | NULL | hi |
| 3 | NULL | hi hello |
| 4 | 暴鬼 | hola |
| 5 | NULL | bonjour hello |
| 6 | 아바키 | hello bye |
+----+-----------------+---------------+
Current request:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (name,id)>("English name",1) ORDER BY (name,id)
Result (by ids): 6
Problems: NULL names are filtered out because of the comparison, and when I have no comparison they are shown first.
What I think would solve these problems:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (IFNULL(name,"Something"),id)>("English name",1) ORDER BY (IFNULL(name,"Something"),id)
But I need "Something" to be larger than any string I might encounter.
Expected result: 6, 3, 5
I think a simpler way is to use nulls last:
order by column nulls last
This works with both ascending and descending sorts. And it has the advantage that it can make use of an index on the column, which coalesce() would probably prevent.
Change your WHERE clause to:
WHERE SOMECOL > "previously obtained value" OR SOMECOL IS NULL
so the NULLs are not filtered out (since you want them).
You can sort the NULLs last, like this:
ORDER BY SOMECOL IS NULL, SOMECOL
The expresssion:
SOMECOL IS NULL
evaluates to 1 (True) or 0 (False), so the values that are not NULL will be sorted first.
Edit
If you want a string that is greater than any name in the table, then you can get it by:
select max(name) || ' ' from mytable
so in your code use:
ifnull(name, (select max(name) || ' ' from mytable))
Finally found a solution, for anyone looking for a character larger than any other (when I'm posting this, the unicode table might get expanded), here's your guy:
CAST(x'f48083bf' AS TEXT).
Example in my case:
SELECT * FROM mytable WHERE othercol LIKE "hello" AND (IFNULL(name,CAST(x'f48083bf' AS TEXT)),id)>("English name",1) ORDER BY (IFNULL(name,CAST(x'f48083bf' AS TEXT)),id)

SELECT MAX values for duplicate values in another column

I am having some trouble finding an answer for this one, so I apologize if it was somewhere else.
I have a table 'dbo.MileageImport' that has the following layout which I pulled to find duplicate entries:
|KEY | DATA |
---------------------
|V9864653 | 180288 |
|V9864653 | 22189 |
|V9864811 | 11464 |
|V9864811 | 12688 |
What I am having troubles with is when I run the following SQL in a DB2 environment:
SELECT KEY, MIN(DATA)
FROM dbo.MileageImport
GROUP BY KEY
HAVING (COUNT(KEY)>1);
It ends up pulling the following data:
|KEY | DATA |
---------------------
|V9864811 | 11464 |
|V9864653 | 180288 |
For some reason it's pulling the MIN value for V9864811, but not V9864653. If I inverse that and put MAX instead of MIN, it pulls the opposite values.
Is there something I am missing here so I can pull the MIN DATA value for only duplicate KEY records, or is there another way to do this? The report where this data comes from changes from month to month, so there could be different keys that end up being duplicated that I need to correct. Ultimately I am turning this into a DELETE statement to delete the lower of the two (or more) duplicated mileage entries.
Is your DATA column numerical? or a VARCHAR?
If you find its better to change it to a number if you can, maybe an integer if you aren't having any fractions and its just round numbers.
if not, then you could cast them to an integer value, but if there are lots of transactions or its a big table it will be slow and not ideal. Its bad practise to do that if you could just change the datatype!
SELECT KEY, MIN(CAST(DATA as Int))
FROM dbo.MileageImport
GROUP BY KEY
HAVING (COUNT(KEY)>1)

I want a hierachical order in my output but dont know how?

I want in the output a hierarchical order like so:
My Data :
Name | Cost | Level
----------------+--------+------
Car1 | 2000 | 1
Component1.1 | 3000 | 2
Component1.2 | 2300 | 3
Computer2 | 5000 | 1
Component2.1 | 2000 | 2
Component2.2 | Null | 3
Output: Show all those data, which has money in it and order it by the level, something like first 1, then 2, then,3 and after that start with 1 again.
Name | Level
------------------+------
Car1 | 1
Component1.1 | 2
Component1.2 | 3
Computer | 1
Component2.1 | 2
What ORDER BY does is:
Name | Level
----------------+------
Car1 | 1
Computer1 | 1
Component1.1 | 2
Component2.1 | 2
Component1.2 | 3
Component2.2 | 3
I tried the CONNECT BY PRIOR function and it didn't work well
SELECT Name, Level
FROM Product
CONNECT BY PRIOR Level;
In MySQL you would normaly use 'order by'. So if you want to order on table row "level" your synntax would be something like this:
SELECT * FROM items ORDER BY level ASC
You can make use of ASC (Ascending) or DESC (descending).
Hope this will help you.
You can use below one
SELECT Name, Level
FROM Auction
ORDER BY Name, Level
WHERE Money!='Null' ;
Doing order by Name will print the result in hierarchical order, but if it has a column called parent id, then it would have been easier to show.
i suggest this for you :
SELECT Name, Level FROM Product ORDER BY Name, Level WHERE Money!='Null' ASC;
i wish this help you brother
It is still not clear whether this is what you are really expecting. It seems to me from your data set, you want to numerically order the components based on some kind of a version number at the end of the component. If that's truly what you want then you may ignore the non-numeric characters in the name and order by pure numbers towards the end of string ( with the required where clause ).
ORDER BY REPLACE ( name, TRANSLATE(name,' .0123456789',' '),'');
If that's the case, then the adding level too to the ORDER BY shouldn't make any difference unless your numeric order of versions and level are in sync.
A problem may appear if you have components like component2_name1.2 etc which could break this logic, for which you may require REGEXP to identify the required pattern. But, it doesn't appear so from your data and I assumed that to be the case and you may want to clarify if that's not what you may always have in your dataset.
Here's a demo of the result obtained for your sample data.
Demo
This will work of the numeric character is always a valid decimal and has only one decimal point. If you have complex versioning system like say 1.1.8, 2.1.1 etc, it needs far sophisticated ordering on top of REPLACE ( name, TRANSLATE(name,' .0123456789',' '),'').
You will find such examples in posts such as this one Here
Note: I would request you to also please read the instructions here to know how to ask a good question. This would avoid all confusion to people who try to understand and answer your question.

Why postgres returns unordered data in select query, after updation of row?

I am bit confused over default ordering of the rows returned by postgres.
postgres=# select * from check_user;
id | name
----+------
1 | x
2 | y
3 | z
4 | a
5 | c1\
6 | c2
7 | c3
(7 rows)
postgres=# update check_user set name = 'c1' where name = 'c1\';
UPDATE 1
postgres=# select * from check_user;
id | name
----+------
1 | x
2 | y
3 | z
4 | a
6 | c2
7 | c3
5 | c1
(7 rows)
Before any updation, it was returning rows ordered by id, but after updation, the order has changed. So my question is that if order by is not specified, what default ordering does postgres uses ?
Thanks in advance.
Put very simply the "default order" is whatever it happens to read from the disk. Updating a row will not change the row in place... Usually it marks the old row as deleted and writes a new one.
When postgres reads rows from pages of memory, it will (probably) read them in the order they are stored on the page. It will read pages in whatever order it thinks is quickest (that may or may not be how they appear on disk). It can change based on whether or not it decides to use an index. So it can suddenly change without your app asking for anything different.
If you don't specify an order by it will not take any action to re-order them.
NEVER rely on the default order. It is undefined behaviour.
SQL tables represent unordered sets.
SQL results sets are unordered unless you explicitly include an order by.
Your select has no order by. Hence, the rows can come back in any order. Even running the same query twice can produce different orders.

PostgreSQL calculate the top places per group and other statistics

I have a table with the following structure
|user_id | place | type_of_place | money_earned| time |
|--------+-------+---------------+-------------+------|
| | | | | |
The table is very large, several millions of rows. The data is in a PostgreSQL 9.1 database.
I want to calculate, per user_id and type_of_place: the mean, the standard deviation, and the top 5 of places (ordered by counts), and the most used hour of time (mode).
The resulting data must be in this form:
| user_id | type_of_place | avg | stddev | top5_places | mode |
+---------+---------------+-----+--------+------------------+------+
| 1 | tp1 | 10 | 1 | {p1,p2,p3,p4,p5} | 8 |
| 2 | tp1 | 3 | 2 | {p3,p4} | 23 |
| 1 | tp3 | 1 | 1 | {p1} | 4 |
etc.
Is there a for of doing this with window functions efficiently?
What if I want to grouping by week? (i.e. another column that represents the number of week)
Thank you!
A standard GROUP BY query will get you most of the way:
SELECT
user_id,
type_of_place,
avg(money_earned) AS avg,
stddev(money_earned) AS stddev
FROM
earnings -- I'm not sure what your data table is called...
GROUP BY
user_id,
type_of_place
This leaves the top5_places and mode columns. These are both also aggregates, but not ones which are defined in the standard PostgreSQL installation. Luckily, you can add them.
Here's a page discussing how to define a mode aggregate function: http://wiki.postgresql.org/wiki/Aggregate_Mode
Once you have a mode aggregate function, assuming time is a timestamp of some kind, the expression you will add to the select list will be:
SELECT
...
mode(extract(hour FROM time)) AS mode -- Add this expression
FROM
...
Assuming order by money
For top5_places, there are several approaches, but the quickest is probably to use PostgreSQL's builtin array_agg function, and take the first 5 elements:
SELECT
...
(array_agg(place ORDER BY money_earned DESC))[1:5] AS top5_places -- Add this expression
FROM
...
One alternative is to define another aggregate called (for instance) top5, which performs the same function. This could be more efficient if there are many distinct places for each user/type of place combination, since it can stop accumulating after the first 5, whereas the above expression will generally build a complete array of all places, and then truncate to the first 5.
This assumes that a place has a unique earnings entry for each user/type combination. If a place can occur more than once, and you want to sort by sum(money_earned) for each place, then you need to use a subquery like in the examples below...
Order by counts
Ok, so the places should be ordered by how often they occur. Here's a quick way, which uses a couple of subqueries -- add this as an expression to the select-clause of the above query:
(SELECT
(array_agg(place ORDER BY cnt DESC))[1:5]
FROM
(SELECT place, count(*) FROM earnings AS t2
WHERE t2.user_id = earnings.user_id AND t2.type_of_place = earnings.type_of_place
GROUP BY place) AS s (place, cnt)
) AS top5_places
The inner subquery called s evaluates to a table of each place for that user/type combination, and the number of times it occurs (which I've called cnt). These are then fed to array_agg in descending order of that count.
I suspect there could be much neater (and probably more efficient) ways of writing it. If not, then I would recommend trying to move this complicated expression into a function or aggregate, if you can...
Histrogram of places in each hour
We'll use a similar expression, which will return the array of counts, ordered by hour:
(SELECT
array_agg(cnt ORDER BY hour DESC)
FROM
(SELECT extract(hour FROM time), count(*) FROM earnings AS t2
WHERE t2.user_id = earnings.user_id AND t2.type_of_place = earnings.type_of_place
GROUP BY 1) AS s (hour, cnt)
) AS hourly_histogram
(Add that to the select-clause of the original query.)