String Grouping from a single column in Oracle database having million rows and removing duplicates - sql

We have a huge table and one of the column contains queries like e.g. in row 1
1. (((firstname:Adam OR firstname:Neil ) AND lastname:Lee) ) AND category:"Legal" AND type:Individual
and in row 2 of same column
2. (((firstname:Adam* OR firstname:Neil ) AND lastname:Lee) ) AND category:"Legal" AND type:Organization
Similarly there are few other types of Query strings which are used eventually to query external services.
Issue is based on certain criteria I have to group and remove duplicates from this table.
There are few rules to determine grouping of Strings in different rows.One of them is that if first name and lastname are same then ignore category and type values, therefore above two rows will be grouped to one. There are around million rows. Comparing Strings and doing grouping is not looking elegant solution. What could be best possible solution using sql.

Related

combine multiple rows into one row based on column value

IS there a way in SQL to make multiple rows for a common column, show up in one?
For example I would like this output:
to show up as:
I believe this article contains solutions to your question -
SQL Query to concatenate column values from multiple rows in Oracle

Remove duplicates from table which doesn't have any key column

I have 2 tables TabA and TabB. Both don't have any key columns.
Column wise both are replica and have more than 80 columns.
TabA has 30 million records. TabB has 2000 records only.
Now I need to compare all the columns between both tables since NO key column is there and remove duplicate records from TabA.
I would like to find best approach to compare both tables instead of placing all 80 columns either in JOINS or WHERE clause.
If all 80 columns are needed to identify the record, you'd have a hard time not using all of them in your query, one way or another.
You could calculate a hashcode using HASHBYTES() on all the columns, and then compare only the resulting hashcode.
There's also CHECKSUM(*) function that calculates hash on all the columns, without the need to explicitly list them, but it returns int as a result, which is too weak if false positives now and then are not acceptable.

Selecting a large number of rows by index using SQL

I am trying to select a number of rows by the value of a column called ID. I know you can do this pretty easily by:
SELECT col1, col2, col3 FROM mytable WHERE id IN (1,2,3,4,5...)
However, what if there are a few million IDs I want to select and the IDs don't always have pattern (which means I can't use something like BETWEEN x AND y)? Does this select statement still work or is there better ways of doing so?
The actual application is this. Filters are specified by users, which is compared to some attributes of the records. From those filters, we create a subset of the data which is of interest to a particular user. There are about 30 million records each with roughly ~3000 attributes (which is stored in roughly 30 tables, but every table has ID as a primary key), so every time someone makes a query about their desired subset of records, we'd have to join many tables, apply those filters, and figure out what his subset looks like. In order to avoid joining many tables all the time, I thought maybe it's a better idea to join the tables once, figure out the id of the selected subset, and this way each time a new query is made, all we have to do is select the relevant columns of the rows that match the filtered ids.
This depends on the database and the interface you are using. For a few hundred or thousand values, no problem. But your question specifies millions. And that could start to get into limits on the length of the query -- either specified by the database, the tool you are using, or intermediate libraries.
If you have so many ids, I would strongly recommend that you load them into a table in the database with the id as the primary key. Then use join or exists to identify the rows in your table that match.
Often, such a list would be generated in the database anyway. In that case, you can use a subquery or CTE and just include that code in your final query.

SQL query returns different number of rows depending on the columns selected

Thanks for the contributions so far. After more digging I am re-stating the question (and indeed the title of the question) as follows:
I am selecting just 2 columns from a view that contains several columns. The view returns 50,497 rows if I select all columns, but only 50,496 (i.e. 1 fewer) when I select just 2 columns, these being [Patient_ID] (which is a bigint column) and [Condition_Code] (a varchar(6) column).
Version 1:
SELECT * FROM [vw_Query1]
returns 50,497 rows.
But:
SELECT [Patient_ID], [Condition_Code] FROM [vw_Query1]
returns 50,496 rows.
I can post the code for [vw_Query1] if required, but an understanding at a fundamental level how this can happen when no GROUP BY clause has been used is the key question for me.
UPDATE:
It turns out that if I exclude one particular column, I get the lower number of rows of 50,496. This column is unique in having a Case-Sensitive collation. I still dont understand why it is dropping one particular row but at least I am getting closer to an understanding.

SQL Server : selecting row with column that contains multiple comma delimited values

I have a table (a) that contains imported data, and one of the values in that table needs to be joined to another table (b) based on that value. In table b, sometimes that value is in a comma separated list, and it is stored as a varchar. This is the first time I have dealt with a database column that contains multiple pieces of data. I didn't design it, and I don't believe it can be changed, although, I believe it should be changed.
For example:
Table a:
column_1
12345
67890
24680
13579
Table b:
column_1
12345,24680
24680,67890
13579
13579,24680
So I am trying to join these table together, based on this number and 2 others, but when I run my query, I'm only getting the one that contain 13579, and none of the rest.
Any ideas how to accomplish this?
Storing lists as a comma delimited data structure is a sign of bad design, particularly when storing ids, which are presumably an integer in their native format.
Sometimes, this is necessary. Here is a method:
select *
from a join
b
on ','+b.column_1+',' like '%,'+cast(a.column_1 as varchar(255))+',%'
This will not perform particularly well, because the query will not take advantage of any indexes.
The idea is to put the delimiter (,) at the beginning and end of b.column_1. Every value in the column then has a comma before and after. Then, you can search for the match in a.column_1 with commas appended. The commas ensure that 10 does not match 100.
If possible, you should consider an alternative way to represent the data. If you know there are at most two values, you might consider having two columns in a. In general, though, you would have a "join" table, with a separate row for each pair.