Sorting concatenated strings after grouping in Netezza - sql

I'm using the code on this page to create concatenated list of strings on a group by aggregation basis.
https://dwgeek.com/netezza-group_concat-alternative-working-example.html/
I'm trying to get the concatenated string in sorted order, so that, for example, for DB1 I'd get data1,data2,data5,data9
I tied modifying the original code to selecting from a pre-sorted table but it doesn't seem to make any difference.
select Col1
, count(*) as NUM_OF_ROWS
, trim(trailing ',' from SETNZ..replace(SETNZ..replace (SETNZ..XMLserialize(SETNZ..XMLagg(SETNZ..XMLElement('X',col2))), '<X>','' ),'</X>' ,',' )) AS NZ_CONCAT_STRING
from
(select * from tbl_concat_demo order by 1,2) AS A
group by Col1
order by 1;
Is there a way to sort the strings before they get aggregated?
BTW - I'm aware there is a GROUP_CONCAT UDF function for Netezza, but I won't have access to it.

This is notoriously difficult to accomplish in sql, since sorting is usually done while returning the data, and you want to do it in the ‘input’ set.
Try this:
1)
Create temp table X as select * from tbl_concat_demo Order by col2
Partition by (col1)
In your original code above: select from X instead of tbl_concat_demo
Let me know if it works ?

Related

find duplicate records in the table with all columns are the same

Say If I have a table with hundreds of columns. The task is that I want to find out duplicate records with all the columns are the same, basically find out identical records.
I tried group by as the following
select *
from some_table
group by *
having count(*) > 1
but it seems like group by * is not allowed in sql. Anyone has some idea as to what kind of command I could run to find out identical records? Thanks in advance.
Just put comma separated list of columns instead of * in both places - select and group by. Buy not count - the count(*) should remain as is.
I verified it on SQL Server, but I am pretty sure it is ANSI SQL and should work on most (any?) ANSI SQL compatible RDBMS.
Postgresql solution, I think.
SELECT all rows, and use EXCEPT ALL to remove one of each (the SELECT DISTINCT). Now we will have the duplicates only.
select * from table
except all
select distinct * from table
You have to list out all the columns:
select col1, col2, col3, . . .
from t
group by col1, col2, col3, . . .
having count(*) > 1;
MSSQL 2016+
Add a new column in the table to hash all the columns, MSSQL HashBytes
notes to consider:
you need to convert all the columns to Varchar or Varbinary.
is you comparison case sensitive, if yes use upper() or lower()
Null values, use column sperator.
the hashing algorithm Performance on the server.
for me usualy go for something like
select col1 , col2, col3 , col4
,HASHBYTES ( 'MD5',
concat(
Convert (varbinary ,col1),'|'
,Convert (varbinary ,col2),'|'
,Convert (varbinary ,col3),'|'
,Convert (varbinary ,col4),'|'
)
) as Row_Hash
from table1
the the row_hash can be use as a singl column in the table/CTE to present the content of all the other columns
you can count by it and Order by it to find the duplicates

Ordering distinct column values by (first value of) other column in aggregate function

I'm trying to order the output order of some distinct aggregated text based on the value of another column with something like:
string_agg(DISTINCT sometext, ' ' ORDER BY numval)
However, that results in the error:
ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list
I do understand why this is, since the ordering would be "ill-defined" if the numval of two repeated values differs, with that of another lying in-between.
Ideally, I would like to order them by first appearance / lowest order-by value, but the ill-defined cases are actually rare enough in my data (it's mostly sequentially repeated values that I want to get rid of with the DISTINCT) that I ultimately don't particularly care about their ordering and would be happy with something like MySQL's GROUP_CONCAT(DISTINCT sometext ORDER BY numval SEPARATOR ' ') that simply works despite its sloppiness.
I expect some Postgres contortionism will be necessary, but I don't really know what the most efficient/concise way of going about this would be.
Building on DISTINCT ON
SELECT string_agg(sometext, ' ' ORDER BY numval) AS no_dupe
FROM (
SELECT DISTINCT ON (1,2) <whatever>, sometext, numval
FROM tbl
ORDER BY 1,2,3
) sub;
This is the simpler equivalent of #Gordon's query.
From your description alone I would have suggested #Clodoaldo's simpler variant.
uniq() for integer
For integer values instead of text, the additional module intarray has just the thing for you:
uniq(int[]) int[] remove adjacent duplicates
Install it once per database with:
CREATE EXTENSION intarray;
Then the query is simply:
SELECT uniq(array_agg(some_int ORDER BY <whatever>, numval)) AS no_dupe
FROM tbl;
Result is an array, wrap it in array_to_string() if you need a string.
Related:
How to create an index for elements of an array in PostgreSQL?
Compare arrays for equality, ignoring order of elements
In fact, it wouldn't be hard to create a custom aggregate function to do the same with text ...
Custom aggregate function for any data type
Function that only adds next element to array if it is different from the previous. (NULL values are removed!):
CREATE OR REPLACE FUNCTION f_array_append_uniq (anyarray, anyelement)
RETURNS anyarray
LANGUAGE sql STRICT IMMUTABLE AS
'SELECT CASE WHEN $1[array_upper($1, 1)] = $2 THEN $1 ELSE $1 || $2 END';
Using polymorphic types to make it work for any scalar data-type.
Custom aggregate function:
CREATE AGGREGATE array_agg_uniq(anyelement) (
SFUNC = f_array_append_uniq
, STYPE = anyarray
, INITCOND = '{}'
);
Call:
SELECT array_to_string(
array_agg_uniq(sometext ORDER BY <whatever>, numval)
, ' ') AS no_dupe
FROM tbl;
Note that the aggregate is PARALLEL UNSAFE (default) by nature, even though the transition function could be marked PARALLEL SAFE.
Related answer:
Custom PostgreSQL aggregate for circular average
Eliminate the need to do a distinct by pre aggregating
select string_agg(sometext, ' ' order by numval)
from (
select sometext, min(numval) as numval
from t
group by sometext
) s
#Gordon's answer brought a good point. That is if there are other needed columns. In this case a distinct on is recommended
select x, string_agg(sometext, ' ' order by numval)
from (
select distinct on (sometext) *
from t
order by sometext, numval
) s
group by x
What I've ended up doing is to avoid using DISTINCT altogether and instead opted to use regular expression substitution to remove sequentially repeated entries (which was my main goal) as follows:
regexp_replace(string_agg(sometext, ' ' ORDER BY numval),
'(\y\w+\y)(?:\s+\1)+', '\1', 'g')
This doesn't remove repeats if the external ordering leads to another entry coming in between them, but this works for me, probably even better. It may be a bit slower than other options, but I find it speedy enough for my purposes.
If this is part of a larger expression, it might be inconvenient to do a select distinct in a subquery. In this case, you can take advantage of the fact that string_agg() ignores NULL input values and do something like:
select string_agg( (case when seqnum = 1 then sometext end) order by numval)
from (select sometext, row_number() over (partition by <whatever>, sometext order by numval) as seqnum
from t
) t
group by <whatever>
The subquery adds a column but does not require aggregating the data.

MS Access SQL: select rows with the same order as IN clause

I know that this question has been asked several times and I've read all the answer but none of them seem to completely solve my problem.
I'm switching from a mySQL database to a MS Access database. In both of the case I use a php script to connect to the database and perform SQL queries.
I need to find a suitable replacement for a query I used to perform on mySQL.
I want to:
perform a first query and order records alphabetically based on one of the columns
construct a list of IDs which reflects the previous alphabetical order
perform a second query with the IN clause applied with the IDs' list and ordered by this list.
In mySQL I used to perform the last query this way:
SELECT name FROM users WHERE id IN ($name_ids) ORDER BY FIND_IN_SET(id,'$name_ids')
Since FIND_IN_SET is available only in mySQL and CHARINDEX and PATINDEX are not available from my php script, how can I achieve this?
I know that I could write something like:
SELECT name
FROM users
WHERE id IN ($name_ids)
ORDER BY CASE id
WHEN ... THEN 1
WHEN ... THEN 2
WHEN ... THEN 3
WHEN ... THEN 4
END
but you have to consider that:
IDs' list has variable length and elements because it depends on the first query
that list can easily contains thousands of elements
Have you got any hint on this?
Is there a way to programmatically construct the ORDER BY CASE ... WHEN ... statement?
Is there a better approach since my list of IDs can be big?
UPDATE: I perform two separated query because I need to access two different tables.
The databse it's not very simple so I try to make an example:
Suppose I have a table which contains a list of users and a table which contains all the books that every user have in their bookshelf.
Since the dabase was designed in mySQL, for every book record I store the user_id in the books table in order to have a relationship between the user and the book.
Suppose now that I want to obtain a list of all the user that have books with a title starting with letter 'a' and I want to order the user based on the alphabetical oder of the books.
This is what I do:
perform a first query to find all the books which start with letter 'a' and sort the alphabetically
create a list of user_id which should reflect the alphabetical order of the book
perform a query in the users table to find out the users names and sort them with the user_id list to have the required sorting by book
Hope this clarify what I need.
If I understand correctly, you're trying to get a set of information in the same order that you specify the ID values. There is a hack that can convert a list into a table using XML and CROSS APPLY. This can be combined with the ROW_NUMBER function to generate your sort order. See the code below:
CREATE FUNCTION [dbo].[GetNvarcharsFromXmlArray]
(
#Strings xml = N'<ArrayOfStrings/>'
)
RETURNS TABLE
AS
RETURN
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS RowNumber, Strings.String.value('.', 'nvarchar(MAX)') AS String
FROM #Strings.nodes('/ArrayOfStrings/string/text()') AS Strings(String)
)
Which functions with the following structure:
<ArrayOfStrings>
<string>myvalue1</string>
<string>myvalue2</string>
</ArrayOfStrings>
It's also the same format .NET xml serializes string arrays.
If you want to pass a comma separated list, you can simply use:
CREATE FUNCTION [dbo].[GetNvarcharsCSV]
(
#CommaSeparatedStrings nvarchar(MAX) = N''
)
RETURNS TABLE
AS
RETURN
(
DECLARE #Strings xml
SET #Strings = CONVERT(xml, N'<ArrayOfStrings><string>' + REPLACE(#CommaSeperatedStrings, ',', N'</string><string>') + N'</string></ArrayOfStrings>')
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS RowNumber, Strings.String.value('.', 'nvarchar(MAX)') AS String
FROM #Strings.nodes('/ArrayOfStrings/string/text()') AS Strings(String)
)
This makes your query:
SELECT name
FROM users
INNER JOIN dbo.GetNvarcharsCSV(#name_ids) AS IDList ON users.ID = IDList.String
ORDER BY RowNumber
Note that it's a pretty simple rewrite to make the function return a table of integers if that's what you need.
You can see xml Data Type Methods to get a better understanding of what you can do with XML in SQL queries. Also, see ROW_NUMBER (Transact-SQL).
It sounds like you need a JOIN...
This should work, although it may need to be translated to Access syntax (which is apparently subtly different):
SELECT b.name, a.title
FROM book as a
JOIN user as b
ON b.id = a.userId
WHERE SUBSTRING(LOWER(a.title), 1, 1) = 'a'
ORDER by a.title
I don't know why you're switching to Access, although I have heard it's been improving in recent years. I think I'd prefer almost any other RDBMS, though. And your schema could probably stand some tweaking, from the sound of things.
You would have to use a user-defined function that maintains the order, and then order by that column. For example:
CREATE FUNCTION dbo.SplitList
(
#List VARCHAR(8000)
)
RETURNS TABLE
AS
RETURN
(
SELECT DISTINCT
[Rank],
[Value] = CONVERT(INT, LTRIM(RTRIM(SUBSTRING(#List, [Rank],
CHARINDEX(',', #List + ',', [Rank]) - [Rank]))))
FROM
(
SELECT TOP (8000) [Rank] = ROW_NUMBER()
OVER (ORDER BY s1.[object_id])
FROM sys.all_objects AS s1
CROSS JOIN sys.all_objects AS s2
) AS n
WHERE [Rank] <= LEN(#List)
AND SUBSTRING(',' + #List, [Rank], 1) = ','
);
GO
Now your query can look something like this:
SELECT u.name
FROM dbo.users AS u
INNER JOIN dbo.SplitList($name_ids) AS s
ON u.id = s.Value
ORDER BY s.[Rank];
You may have to surround $name_ids with single quotes (dbo.SplitList('$name_ids')) depending on how the SQL statement is constructed. You may want to consider using a stored procedure instead of building this query in PHP.
You might also consider skipping MS-Access as a hopping point altogether. Why not just have PHP communicate directly with SQL Server?

Fast way to eyeball possible duplicate rows in a table?

Similar: How can I delete duplicate rows in a table
I have a feeling this is impossible and I'm going to have to do it the tedious way, but I'll see what you guys have to say.
I have a pretty big table, about 4 million rows, and 50-odd columns. It has a column that is supposed to be unique, Episode. Unfortunately, Episode is not unique - the logic behind this was that occasionally other fields in the row change, despite Episode being repeated. However, there is an actually unique column, Sequence.
I want to try and identify rows that have the same episode number, but something different between them (aside from sequence), so I can pick out how often this occurs, and whether it's worth allowing for or I should just nuke the rows and ignore possible mild discrepancies.
My hope is to create a table that shows the Episode number, and a column for each table column, identifying the value on both sides, where they are different:
SELECT Episode,
CASE WHEN a.Value1<>b.Value1
THEN a.Value1 + ',' + b.Value1
ELSE '' END AS Value1,
CASE WHEN a.Value2<>b.Value2
THEN a.Value2 + ',' + b.Value2
ELSE '' END AS Value2
FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode
WHERE a.Value1<>b.Value1
OR a.Value2<>b.Value2
(That is probably full of holes, but the idea of highlighting changed values comes through, I hope.)
Unfortunately, making a query like that for fifty columns is pretty painful. Obviously, it doesn't exactly have to be rock-solid if it will only be used the once, but at the same time, the more copy-pasta the code, the more likely something will be missed. As far as I know, I can't just do a search for DISTINCT, since Sequence is distinct and the same row will pop up as different.
Does anyone have a query or function that might help? Either something that will output a query result similar to the above, or a different solution? As I said, right now I'm not really looking to remove the duplicates, just identify them.
Use:
SELECT DISTINCT t.*
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.
50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.
UPDATE
I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:
SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.
Instead of typing out all 50 columns, you could do this:
select column_name from information_schema.columns where table_name = 'your table name'
then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:
select
count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.
Here's where I like to use SQL to generate more SQL. This should get you started:
select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'
You'll plug in those join parameters to this query:
select * from YourTable t1
inner join (
select
count(episode) 'epcount'
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
) t2 on
...plug in all those join parameters here...
select count distinct ....
Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.
I think something like this is what you want:
select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode
This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.
Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).
Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.
A relatively simple solution that Ponies sparked:
SELECT t.*
FROM Table t
INNER JOIN ( SELECT episode
FROM Table
GROUP BY Episode
HAVING COUNT(*) > 1
) AS x ON t.episode = x.episode
And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:
=AND($C2=$C1,A2<>A1)
Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).
Generate and store a hash key for each row, designed so the hash values mirror your
definition of sameness. Depending on the complexity of your rows, updating the
hash might be a simple trigger on modifying the row.
Query for duplicates of the hash key, which are your "very probably" identical rows.

SQLServer SQL query with a row counter

I have a SQL query, that returns a set of rows:
SELECT id, name FROM users where group = 2
I need to also include a column that has an incrementing integer value, so the first row needs to have a 1 in the counter column, the second a 2, the third a 3 etc
The query shown here is just a simplified example, in reality the query could be arbitrarily complex, with several joins and nested queries.
I know this could be achieved using a temporary table with an autonumber field, but is there a way of doing it within the query itself ?
For starters, something along the lines of:
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY my_order_column) AS Row_Counter
FROM my_table
However, it's important to note that the ROW_NUMBER() OVER (ORDER BY ...) construct only determines the values of Row_Counter, it doesn't guarantee the ordering of the results.
Unless the SELECT itself has an explicit ORDER BY clause, the results could be returned in any order, dependent on how SQL Server decides to optimise the query. (See this article for more info.)
The only way to guarantee that the results will always be returned in Row_Counter order is to apply exactly the same ordering to both the SELECT and the ROW_NUMBER():
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY my_order_column) AS Row_Counter
FROM my_table
ORDER BY my_order_column -- exact copy of the ordering used for Row_Counter
The above pattern will always return results in the correct order and works well for simple queries, but what about an "arbitrarily complex" query with perhaps dozens of expressions in the ORDER BY clause? In those situations I prefer something like this instead:
SELECT t.*
FROM
(
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY ...) AS Row_Counter -- complex ordering
FROM my_table
) AS t
ORDER BY t.Row_Counter
Using a nested query means that there's no need to duplicate the complicated ORDER BY clause, which means less clutter and easier maintenance. The outer ORDER BY t.Row_Counter also makes the intent of the query much clearer to your fellow developers.
In SQL Server 2005 and up, you can use the ROW_NUMBER() function, which has options for the sort order and the groups over which the counts are done (and reset).
The simplest way is to use a variable row counter. However it would be two actual SQL commands. One to set the variable, and then the query as follows:
SET #n=0;
SELECT #n:=#n+1, a.* FROM tablename a
Your query can be as complex as you like with joins etc. I usually make this a stored procedure. You can have all kinds of fun with the variable, even use it to calculate against field values. The key is the :=
Heres a different approach.
If you have several tables of data that are not joinable, or you for some reason dont want to count all the rows at the same time but you still want them to be part off the same rowcount, you can create a table that does the job for you.
Example:
create table #test (
rowcounter int identity,
invoicenumber varchar(30)
)
insert into #test(invoicenumber) select [column] from [Table1]
insert into #test(invoicenumber) select [column] from [Table2]
insert into #test(invoicenumber) select [column] from [Table3]
select * from #test
drop table #test