I am trying to summarize some massive tables in a way that can help me investigate further for some data issues. There are hundreds of 1000s of rows, and roughly 80+ columns of data in most of these tables.
From each table, I have already performed queries to throw out any columns that have all nulls or 1 value only. I've done this for 2 reasons.... for my purposes, a single value or nulls in the columns are not interesting to me and provide little information about the data; additionally, the next step is that I want to query each of the remaining columns and return up to 30 distinct values in each column (if the column has more than 30 distinct values, we show the 1st 30 distinct values).
Here is the general format of output I wish to create:
Column_name(total_num_distinct): distinct_val1(val1_count), distinct_val2(val2_couunt), ... distinct_val30(val30_count)
Assuming my data fields are integers, floats, and varchar2 data types, this is the SQL I was trying to use to generate that output:
declare
begin
for rw in (
select column_name colnm, num_distinct numd
from all_tab_columns
where
owner = 'scz'
and table_name like 'tbl1'
and num_distinct > 1
order by num_distinct desc
) loop
dbms_output.put(rw.colnm || '(' || numd || '): ');
for rw2 in (
select dis_val, cnt from (
select rw.colnm dis_val, count(*) cnt
from tbl1
group by rw.colnm
order by 2 desc
) where rownum <= 30
) loop
dbms_output.put(rw2.dis_val || '(' || rw2.cnt || '), ');
end loop;
dbms_output.put_line(' ');
end loop;
end;
I get the output I expect from the 1st loop, but the 2nd loop that is supposed to output examples of the unique values in each column, coupled with the frequency of their occurrence for the 30 values with the highest frequency of occurrence, appears to not be working as I intended. Instead of seeing unique values along with the number of times that value occurs in the field, I get the column names and count of total records in that table.
If the 1st loop suggests the first 4 columns in 'tbl1' with more than 1 distinct value are the following:
| colnm | numd |
|----------------|
| Col1 | 2 |
| Col3 | 4 |
| Col7 | 17 |
| Col12 | 30 |
... then the full output of 1st and 2nd loop together looks something like the following from my SQL:
Col1(2): Col1(tbl1_tot_rec_count), Col3(tbl1_tot_rec_count)
Col3(4): Col1(tbl1_tot_rec_count), Col3(tbl1_tot_rec_count), Col7(tbl1_tot_rec_count), Col12(tbl1_tot_rec_count)
Col7(17): Col1(tbl1_tot_rec_count), Col3(tbl1_tot_rec_count), Col7(tbl1_tot_rec_count), Col12(tbl1_tot_rec_count), .... , ColX(tbl1_tot_rec_count)
Col12(30): Col1(tbl1_tot_rec_count), Col3(tbl1_tot_rec_count), Col7(tbl1_tot_rec_count), Col12(tbl1_tot_rec_count), .... , ColX(tbl1_tot_rec_count)
The output looks cleaner when real data is output, each table outputting somewhere between 20-50 lines of output (i.e. columns with more than 2 values), and listing 30 unique values for each field (with their counts) only requires a little bit of scrolling, but isn't impractical. Just to give you an idea with fake values, the output would look more like this with real data if it was working correctly (but fake in my example):
Col1(2): DisVal1(874,283), DisVal2(34,578),
Col3(4): DisVal1(534,223), DisVal2(74,283), DisVal3(13,923), null(2348)
Col7(17): DisVal1(54,223), DisVal2(14,633), DisVal3(13,083), DisVal4(12,534), DisVal5(9,876), DisVal6(8,765), DisVal7(7654), DisVal8(6543), DisVal9(5432), ...., ...., ...., ...., ...., ...., ...., DisVal17(431)
I am not an Oracle or SQL guru, so I might not be approaching this problem in the easiest, most efficient way. While I do appreciate any better ways to approach this problem, I also want to learn why the SQL code above is not giving me the output I expect. My goal is trying to quickly run a single query that can tell me which columns have interesting data I might what to examine further in that table. I have probably 20 tables I need to examine that are all of similar dimensions and so very difficult to examine comprehensively. Being able to reduce these tables in this way to know what possible combinations of values may exist across the various fields in each of these tables would be very helpful in further queries to deep dive into the intricacies of the data.
It's because the select rw.colnm dis_val, count(*) cnt from tbl1 group by rw.colnm order by 2 desc is not doing at all what you think, and what you think to be done can't be done without dynamic SQL. What it does is in fact select 'a_column_of_tabl1' dis_val, count(*) cnt from tbl1 group by 'a_column_of_tabl1' order by 2 desc and what you need to do is execute dynamically the SQL: 'select ' || rw.colnm || ' dis_val, count(*) cnt from tbl1 group by ' || rw.colnm || ' order by 2 desc'.
Here is the (beta) query you can use to get the SQL to execute:
(I tested with user_* VIEWs instead of ALL_* to avoid getting to many results here...)
select utc.table_name, utc.column_name,
REPLACE( REPLACE(q'~select col_name, col_value, col_counter
from (
SELECT col_name, col_value, col_counter,
ROW_NUMBER() OVER(ORDER BY col_counter DESC) AS rn
FROM (
select '{col_name}' AS col_name,
{col_name} AS col_value, COUNT(*) as col_counter
from {table_name}
group by {col_name}
)
)
WHERE rn <= 30~', '{col_name}', utc.column_name), '{table_name}', utc.table_name) AS sql
from user_tab_columns utc
where not exists(
select 1
from user_indexes ui
join user_ind_columns uic on uic.table_name = ui.table_name
and uic.index_name = ui.index_name
where
ui.table_name = utc.table_name
and exists (
select 1 from user_ind_columns t where t.table_name = ui.table_name
and t.index_name = uic.index_name and t.column_name = utc.column_name
)
group by ui.table_name, ui.index_name having( count(*) = 1 )
)
and not exists(
SELECT 1
FROM user_constraints uc
JOIN user_cons_columns ucc ON ucc.constraint_name = uc.constraint_name
WHERE constraint_type IN (
'P', -- primary key
'U' -- unique
)
AND uc.table_name = utc.table_name AND ucc.column_name = utc.column_name
)
;
Related
I have a long and wide list, the following table is just an example. Table structure might look a bit horrible using SQL, but I was wondering whether there's a way to extract IDs' price using CASE expression without typing column names in order to match in the expression
IDs
A_Price
B_Price
C_Price
...
A
23
...
B
65
82
...
C
...
A
10
...
..
...
...
...
...
Table I want to achieve:
IDs
price
A
23;10
B
65
C
82
..
...
I tried:
SELECT IDs, string_agg(CASE IDs WHEN 'A' THEN A_Price
WHEN 'B' THEN B_Price
WHEN 'C' THEN C_Price
end::text, ';') as price
FROM table
GROUP BY IDs
ORDER BY IDs
To avoid typing A, B, A_Price, B_Price etc, I tried to format their names and call them from a subquery, but it seems that SQL cannot recognise them as columns and cannot call the corresponding values.
WITH CTE AS (
SELECT IDs, IDs||'_Price' as t FROM ID_list
)
SELECT IDs, string_agg(CASE IDs WHEN CTE.IDs THEN CTE.t
end::text, ';') as price
FROM table
LEFT JOIN CTE cte.IDs=table.IDs
GROUP BY IDs
ORDER BY IDs
You can use a document type like json or hstore as stepping stone:
Basic query:
SELECT t.ids
, to_json(t.*) ->> (t.ids || '_price') AS price
FROM tbl t;
to_json() converts the whole row to a JSON object, which you can then pick a (dynamically concatenated) key from.
Your aggregation:
SELECT t.ids
, string_agg(to_json(t.*) ->> (t.ids || '_price'), ';') AS prices
FROM tbl t
GROUP BY 1
ORDER BY 1;
Converting the whole (big?) row adds some overhead, but you have to read the whole table for your query anyway.
A union would be one approach here:
SELECT IDs, A_Price FROM yourTable WHERE A_Price IS NOT NULL
UNION ALL
SELECT IDs, B_Price FROM yourTable WHERE B_Price IS NOT NULL
UNION ALL
SELECT IDs, C_Price FROM yourTable WHERE C_Price IS NOT NULL;
Suppose I have a table with text primary key called "name". Given a name (that may contain any arbitrary characters including %), I need all of the rows from that table that start with that name, are longer than that name, and that don't start with anything else in the table that is longer than the given name.
For example, suppose my table contains names ad, add, adder, and adage. If I query for "children of ad", I want to get back add, adage. (adder is a child of add). Can this be done efficiently, as I have several million rows? Recursive queries are certainly available.
I have a different approach at present where I maintain a "parent" column. The code to maintain this column is quite painful, and it would be unnecessary if this other approach were reasonable.
I can't tell about its efficiency but I think it works:
with cte as (
select name
from tablename
where name like 'ad' || '_%'
)
select c.name
from cte c
where not exists (
select 1 from cte
where c.name like name || '_%'
);
See the demo.
Equivalent to the above query with a self LEFT JOIN:
with cte as (
select name
from tablename
where name like 'ad' || '_%'
)
select c.name
from cte c left join cte cc
on c.name like cc.name || '_%'
where cc.name is null
See the demo.
Results:
| name |
| ----- |
| add |
| adage |
You could use left on the column like this:
select *
from sometable
where lower(left(name,2)) = 'ad' and length(name) > 2
I have a column that has several items in which I need to count the times it is called, my column table looks something like this:
Table Example
Id_TR Triggered
-------------- ------------------
A1_6547 R1:23;R2:0;R4:9000
A2_1235 R2:0;R2:100;R3:-100
A3_5436 R1:23;R2:100;R4:9000
A4_1245 R2:0;R5:150
And I would like the result to be like this:
Expected Results
Triggered Count(1)
--------------- --------
R1:23 2
R2:0 3
R2:100 2
R3:-100 1
R4:9000 2
R5:150 1
I've tried to do some substring, but cant seem to find how to solve this problem. Can anyone help?
This solution is X3 times faster than the CONNECT BY solution
performance: 15K records per second
with cte (token,suffix)
as
(
select substr(triggered||';',1,instr(triggered,';')-1) as token
,substr(triggered||';',instr(triggered,';')+1) as suffix
from t
union all
select substr(suffix,1,instr(suffix,';')-1) as token
,substr(suffix,instr(suffix,';')+1) as suffix
from cte
where suffix is not null
)
select token,count(*)
from cte
group by token
;
with x as (
select listagg(Triggered, ';') within group (order by Id_TR) str from table
)
select regexp_substr(str,'[^;]+',1,level) element, count(*)
from x
connect by level <= length(regexp_replace(str,'[^;]+')) + 1
group by regexp_substr(str,'[^;]+',1,level);
First concatenate all values of triggered into one list using listagg then parse it and do group by.
Another methods of parsing list you can find here or here
This is a fair solution.
performance: 5K records per second
select triggered
,count(*) as cnt
from (select id_tr
,regexp_substr(triggered,'[^;]+',1,level) as triggered
from t
connect by id_tr = prior id_tr
and level <= regexp_count(triggered,';')+1
and prior sys_guid() is not null
) t
group by triggered
;
This is just for learning purposes.
Check my other solutions.
performance: 1K records per second
select x.triggered
,count(*)
from t
,xmltable
(
'/r/x'
passing xmltype('<r><x>' || replace(triggered,';', '</x><x>') || '</x></r>')
columns triggered varchar(100) path '.'
) x
group by x.triggered
;
With the following MySQL table:
+-----------------------------+
+ id INT UNSIGNED +
+ name VARCHAR(100) +
+-----------------------------+
How can I select a single row AND its position amongst the other rows in the table, when sorted by name ASC. So if the table data looks like this, when sorted by name:
+-----------------------------+
+ id | name +
+-----------------------------+
+ 5 | Alpha +
+ 7 | Beta +
+ 3 | Delta +
+ ..... +
+ 1 | Zed +
+-----------------------------+
How could I select the Beta row getting the current position of that row? The result set I'm looking for would be something like this:
+-----------------------------+
+ id | position | name +
+-----------------------------+
+ 7 | 2 | Beta +
+-----------------------------+
I can do a simple SELECT * FROM tbl ORDER BY name ASC then enumerate the rows in PHP, but it seems wasteful to load a potentially large resultset just for a single row.
Use this:
SELECT x.id,
x.position,
x.name
FROM (SELECT t.id,
t.name,
#rownum := #rownum + 1 AS position
FROM TABLE t
JOIN (SELECT #rownum := 0) r
ORDER BY t.name) x
WHERE x.name = 'Beta'
...to get a unique position value. This:
SELECT t.id,
(SELECT COUNT(*)
FROM TABLE x
WHERE x.name <= t.name) AS position,
t.name
FROM TABLE t
WHERE t.name = 'Beta'
...will give ties the same value. IE: If there are two values at second place, they'll both have a position of 2 when the first query will give a position of 2 to one of them, and 3 to the other...
This is the only way that I can think of:
SELECT `id`,
(SELECT COUNT(*) FROM `table` WHERE `name` <= 'Beta') AS `position`,
`name`
FROM `table`
WHERE `name` = 'Beta'
If the query is simple and the size of returned result set is potentially large, then you may try to split it into two queries.
The first query with a narrow-down filtering criteria just to retrieve data of that row, and the second query uses COUNT with WHERE clause to calculate the position.
For example in your case
Query 1:
SELECT * FROM tbl WHERE name = 'Beta'
Query 2:
SELECT COUNT(1) FROM tbl WHERE name >= 'Beta'
We use this approach in a table with 2M record and this is way more scalable than OMG Ponies's approach.
The other answers seem too complicated for me.
Here comes an easy example, let's say you have a table with columns:
userid | points
and you want to sort the userids by points and get the row position (the "ranking" of the user), then you use:
SET #row_number = 0;
SELECT
(#row_number:=#row_number + 1) AS num, userid, points
FROM
ourtable
ORDER BY points DESC
num gives you the row postion (ranking).
If you have MySQL 8.0+ then you might want to use ROW_NUMBER()
The position of a row in the table represents how many rows are "better" than the targeted row.
So, you must count those rows.
SELECT COUNT(*)+1 FROM table WHERE name<'Beta'
In case of a tie, the highest position is returned.
If you add another row with same name of "Beta" after the existing "Beta" row, then the position returned would be still 2, as they would share same place in the classification.
Hope this helps people that will search for something similar in the future, as I believe that the question owner already solved his issue.
I've got a very very similar issue, that's why I won't ask the same question, but I will share here what did I do, I had to use also a group by, and order by AVG.
There are students, with signatures and socore, and I had to rank them (in other words, I first calc the AVG, then order them in DESC, and then finally I needed to add the position (rank for me), So I did something Very similar as the best answer here, with a little changes that adjust to my problem):
I put finally the position (rank for me) column in the external SELECT
SET #rank=0;
SELECT #rank := #rank + 1 AS ranking, t.avg, t.name
FROM(SELECT avg(students_signatures.score) as avg, students.name as name
FROM alumnos_materia
JOIN (SELECT #rownum := 0) r
left JOIN students ON students.id=students_signatures.id_student
GROUP BY students.name order by avg DESC) t
I was going through the accepted answer and it seemed bit complicated so here is the simplified version of it.
SELECT t,COUNT(*) AS position FROM t
WHERE name <= 'search string' ORDER BY name
I have similar types of problem where I require rank(Index) of table order by votes desc. The following works fine with for me.
Select *, ROW_NUMBER() OVER(ORDER BY votes DESC) as "rank"
From "category_model"
where ("model_type" = ? and "category_id" = ?)
may be what you need is with add syntax
LIMIT
so use
SELECT * FROM tbl ORDER BY name ASC LIMIT 1
if you just need one row..
I added the columns in the select list to the order by list, but it is still giving me the error:
ORDER BY items must appear in the select list if SELECT DISTINCT is specified.
Here is the stored proc:
CREATE PROCEDURE [dbo].[GetRadioServiceCodesINGroup]
#RadioServiceGroup nvarchar(1000) = NULL
AS
BEGIN
SET NOCOUNT ON;
SELECT DISTINCT rsc.RadioServiceCodeId,
rsc.RadioServiceCode + ' - ' + rsc.RadioService as RadioService
FROM sbi_l_radioservicecodes rsc
INNER JOIN sbi_l_radioservicecodegroups rscg
ON rsc.radioservicecodeid = rscg.radioservicecodeid
WHERE rscg.radioservicegroupid IN
(select val from dbo.fnParseArray(#RadioServiceGroup,','))
OR #RadioServiceGroup IS NULL
ORDER BY rsc.RadioServiceCode,rsc.RadioServiceCodeId,rsc.RadioService
END
Try this:
ORDER BY 1, 2
OR
ORDER BY rsc.RadioServiceCodeId, rsc.RadioServiceCode + ' - ' + rsc.RadioService
While they are not the same thing, in one sense DISTINCT implies a GROUP BY, because every DISTINCT could be re-written using GROUP BY instead. With that in mind, it doesn't make sense to order by something that's not in the aggregate group.
For example, if you have a table like this:
col1 col2
---- ----
1 1
1 2
2 1
2 2
2 3
3 1
and then try to query it like this:
SELECT DISTINCT col1 FROM [table] WHERE col2 > 2 ORDER BY col1, col2
That would make no sense, because there could end up being multiple col2 values per row. Which one should it use for the order? Of course, in this query you know the results wouldn't be that way, but the database server can't know that in advance.
Now, your case is a little different. You included all the columns from the order by clause in the select clause, and therefore it would seem at first glance that they were all grouped. However, some of those columns were included in a calculated field. When you do that in combination with distinct, the distinct directive can only be applied to the final results of the calculation: it doesn't know anything about the source of the calculation any more.
This means the server doesn't really know it can count on those columns any more. It knows that they were used, but it doesn't know if the calculation operation might cause an effect similar to my first simple example above.
So now you need to do something else to tell the server that the columns are okay to use for ordering. There are several ways to do that, but this approach should work okay:
SELECT rsc.RadioServiceCodeId,
rsc.RadioServiceCode + ' - ' + rsc.RadioService as RadioService
FROM sbi_l_radioservicecodes rsc
INNER JOIN sbi_l_radioservicecodegroups rscg
ON rsc.radioservicecodeid = rscg.radioservicecodeid
WHERE rscg.radioservicegroupid IN
(SELECT val FROM dbo.fnParseArray(#RadioServiceGroup,','))
OR #RadioServiceGroup IS NULL
GROUP BY rsc.RadioServiceCode,rsc.RadioServiceCodeId,rsc.RadioService
ORDER BY rsc.RadioServiceCode,rsc.RadioServiceCodeId,rsc.RadioService
Try one of these:
Use column alias:
ORDER BY RadioServiceCodeId,RadioService
Use column position:
ORDER BY 1,2
You can only order by columns that actually appear in the result of the DISTINCT query - the underlying data isn't available for ordering on.
Distinct and Group By generally do the same kind of thing, for different purposes... They both create a 'working" table in memory based on the columns being Grouped on, (or selected in the Select Distinct clause) - and then populate that working table as the query reads data, adding a new "row" only when the values indicate the need to do so...
The only difference is that in the Group By there are additional "columns" in the working table for any calculated aggregate fields, like Sum(), Count(), Avg(), etc. that need to updated for each original row read. Distinct doesn't have to do this... In the special case where you Group By only to get distinct values, (And there are no aggregate columns in output), then it is probably exactly the same query plan.... It would be interesting to review the query execution plan for the two options and see what it did...
Certainly Distinct is the way to go for readability if that is what you are doing (When your purpose is to eliminate duplicate rows, and you are not calculating any aggregate columns)
When you define concatenation you need to use an ALIAS for the new column if you want to order on it combined with DISTINCT
Some Ex with sql 2008
--this works
SELECT DISTINCT (c.FirstName + ' ' + c.LastName) as FullName
from SalesLT.Customer c
order by FullName
--this works too
SELECT DISTINCT (c.FirstName + ' ' + c.LastName)
from SalesLT.Customer c
order by 1
-- this doesn't
SELECT DISTINCT (c.FirstName + ' ' + c.LastName) as FullName
from SalesLT.Customer c
order by c.FirstName, c.LastName
-- the problem the DISTINCT needs an order on the new concatenated column, here I order on the singular column
-- this works
SELECT DISTINCT (c.FirstName + ' ' + c.LastName)
as FullName, CustomerID
from SalesLT.Customer c
order by 1, CustomerID
-- this doesn't
SELECT DISTINCT (c.FirstName + ' ' + c.LastName) as FullName
from SalesLT.Customer c
order by 1, CustomerID
You could try a subquery:
SELECT DISTINCT TEST.* FROM (
SELECT rsc.RadioServiceCodeId,
rsc.RadioServiceCode + ' - ' + rsc.RadioService as RadioService
FROM sbi_l_radioservicecodes rsc
INNER JOIN sbi_l_radioservicecodegroups rscg ON rsc.radioservicecodeid = rscg.radioservicecodeid
WHERE rscg.radioservicegroupid IN
(select val from dbo.fnParseArray(#RadioServiceGroup,','))
OR #RadioServiceGroup IS NULL
ORDER BY rsc.RadioServiceCode,rsc.RadioServiceCodeId,rsc.RadioService
) as TEST