How to get tables for distinct value of a variable that is multiple times in the data - sql

I didn't work with SQL much. I have a dataset with Variable A, Index, Fail and Country. Here A is unique, but we don't need that for the analysis. What I want is to find which Countries have most fail for distinct index number. So what I tried is
SELECT Index, Count(Fail), Country
FROM Data
GROUP BY Country
SORT BY Count(Fail) DESC
But the fact is for an index we might have multiple fails, but I want count only one fail for a single index number, so for instance the Index 1 has 2 fails, 2 has 1 fail, 3 has 4 fail, I want only The count(fail) to be 3, not (2+1+4=7). FYI in the table each row represent either one fail or not. So in the table the fail values are either 0 or 1. I think, I need to include sum/distinct clause, but not sure how to do it.

You can put DISTINCTin the COUNT() as follows:
SELECT Index, Count(DISTINCT Fail), Country
FROM Data
GROUP BY Index,Country
SORT BY Count(Fail) DESC
I've added Index in the GROUP BY to avoid it causing an error.

Related

select count(*)+count(*) is this sql statement produce any result or error?

I have faced this question on interview with option like error,1,2,3
Now got the result as : 2
select count(*)+COUNT(*)
result is 2
Normally all selects are of the form SELECT [columns, scalar computations on columns, grouped computations on columns, or scalar computations] FROM [table or joins of tables, etc]
Because this allows plain scalar computations we can do something like SELECT 1 + 1 FROM SomeTable and it will return a recordset with the value 2 for every row in the table SomeTable.
Now, if we didn't care about any table, but just wanted to do our scalar computed we might want to do something like SELECT 1 + 1. This isn't allowed by the standard, but it is useful and most databases allow it (Oracle doesn't unless it's changed recently, at least it used to not).
Hence such bare SELECTs are treated as if they had a from clause which specified a table with one row and no column (impossible of course, but it does the trick). Hence SELECT 1 + 1 becomes SELECT 1 + 1 FROM ImaginaryTableWithOneRow which returns a single row with a single column with the value 2.
Mostly we don't think about this, we just get used to the fact that bare SELECTs give results and don't even think about the fact that there must be some one-row thing selected to return one row.
In doing SELECT COUNT() you did the equivalent of SELECT COUNT() FROM ImaginaryTableWithOneRow which of course returns 1.
References : Why MySQL COUNT without table name gives 1

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 or will it get only specific records i.e 5

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 and then it will it get only specific records i.e 5? or do it only get 5 records from internal memory. suppose we have 1000 records in tbl2.
For future reference, SQL actually has a huge amount of amazing documentation/resources online. For instance: This google search.
As to your question, it will pull the top five results matching your criteria, so it depends. It'll go through, and find the first five results matching your criteria. If those are the last results it goes through, it'll still have to do the comparisons and filtering on all rows, the only difference will be that it will have to send less rows to your computer.
For example, let's say we have a table my_table with two columns: Customer_ID (which is unique) and First_Purchase_Date (which is a date value between 2015-01-01 and 2017-07-26). If we simply do SELECT TOP 5 * FROM my_table then it will go through and pull the first five rows it finds, without looking at the rest of the rows. On the other hand, if we do SELECT TOP 5 * FROM my_table WHERE First_Purchase_Date = '2017-05-17' then it will have to go through all the rows until it can find five rows with a First_Purchase_Date of 2017-05-17. If First_Purchase_Date is indexed, this should not be very expensive, as it'll more or less know where to look. If it's not, then it depends on your how SQL has decided to structure your table, and if it has created any useful statistics. Worst case, it could not have any statistics and the desired rows could be the last five in the database, in which case it will have to complete the comparison on all the rows in the database.
By the way, this is a somewhat poor idea, as the columns returned will not necessarily stay consistent over time. It may be a good idea to throw in an ORDER BY clause, to ensure you get the same records every time.
The SELECT TOP clause is used to specify the number of records to return.
It limits the rows returned in a query result set to a specified number of rows or percentage of rows. When TOP is used in conjunction with the ORDER BY clause, the result set is limited to the first N number of ordered rows; otherwise, it returns the first N number of rows in an undefined order.
SELECT TOP number|percent column_name(s)
FROM table_name
WHERE condition;

Does column of integers contains value

What is the fastest regarding performance way to check that integer column contains specific value?
I have a table with 10 million rows in postgresql 8.4. I need to do at least 10000 checks per sec.
Currently i am doing query SELECT id FROM table WHERE id = my_value and then checking does DataReader have rows. But it is quite slow. Is there any way to speed up without loading whole column into memory?
You can select COUNT instead:
SELECT COUNT(*) FROM table WHERE id = my_value
It will return just one integer value - number of rows matching your select condition.
You need two things,
As Marcin pointed out, you want to use the COUNT(*) if all you need is to know how many. You also need an index on that column. The index will have the answer pretty much right at hand. Without the index, Postgresql would still have to go through the entire table to count that one number.
CREATE INDEX id_idx ON table (id) ASC NULLS LAST;
Something of the sort should get you there. Whether it is enough to run the query 10,000/sec. will depend on your hardware...
If you use where id = X then all values matching X will be returned. Suppose 1000 values match X then 1000 values will be returned.
Now, if you only want to check if the value is at least once then after you matched the first value there is no need to process the other 999. Even if you count the values you are still going through all of them.
What I would do in this case is this:
SELECT 1 FROM table
WHERE id = my_value
LIMIT 1
Note that I'm not even returning the id itself. So if you get one record then the value is there.
Of course, in order to improve this query, make sure you have an index on the id column.

How to check if all fields are unique in Oracle?

How to check if all fields are unique in Oracle?
SELECT myColumn, COUNT(*)
FROM myTable
GROUP BY myColumn
HAVING COUNT(*) > 1
This will return to you all myColumn values along with the number of their occurence if their number of occurences is higher than one (i.e. they are not unique).
If the result of this query is empty, then you have unique values in this column.
An easy way to do this is to analyze the table using DBMS_STATS. After you do, you can look at dba_tables... Look at the num_rows column. The look at dab_tab_columns. Compare the num_distinct for each column to the number of rows. This is a round about way of doing what you want without performing a full table scan if you are worried about affecting a production system on a huge table. If you want direct results, do what the the others suggest by running the query against the table with a group by.
one way is to create a unique index.
if the index creation fails, you have existing duplicate info, if an insert fails, it would have produced a duplicate...

Optimizing "ORDER BY" when the result set is very large and it can't be ordered by an index

How can I make an ORDER BY clause with a small LIMIT (ie 20 rows at a time) return quickly, when I can't use an index to satisfy the ordering of rows?
Let's say I would like to retrieve a certain number of titles from a table 'node' (simplified below). I'm using MySQL by the way.
node_ID INT(11) NOT NULL auto_increment,
node_title VARCHAR(127) NOT NULL,
node_lastupdated INT(11) NOT NULL,
node_created INT(11) NOT NULL
But I need to limit the rows returned to only those a particular user has access to. Many users have access large numbers of nodes. I have this information pre-calculated in a big lookup table (an attempt to make things easier) where the primary key covers both columns and the presence of a row means that usergroup has access to that node:
viewpermission_nodeID INT(11) NOT NULL,
viewpermission_usergroupID INT(11) NOT NULL
My query therefore contains something like
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
... and I also use a GROUP BY or a DISTINCT so that a node is only returned once even if two of the user's 'usergroups' both have access to that node.
My problem is that there seems to be no way for an ORDER BY clause which sorts results by created or last updated date to use an index, because the rows being returned depend on values in the other viewpermission table.
Therefore MySQL would need to find all rows which match the criteria, then sort them all itself. If there are one million rows for a particular user, and we want to view, say, the latest 100 or rows 100-200 when ordered by last update, the DB would need to figure out which one million rows the user can see, sort this whole result set itself, before it can return those 100 rows, right?
Is there any creative way to get around this? I've been thinking along the lines of:
Somehow add dates into the viewpermission lookup table so that I can build an index containing the dates as well as the permissions. It's a possibility I guess.
Edit: Simplified question
Perhaps I can simplify the question by rewriting it like this:
Is there any way to rewrite this query or create an index for the following such that an index can be used to do the ordering (not just to select the rows)?
SELECT nodeid
FROM lookup
WHERE
usergroup IN (2, 3)
GROUP BY
nodeid
An index on (usergroup) allows the WHERE part to be satisfied by an index, but the GROUP BY forces a temporary table and filesort on those rows. An index on (nodeid) does nothing for me, because the WHERE clause needs an index with usergroup as its first column. An index on (usergroup, nodeid) forces a temporary table and filesort because the GROUP BY is not the first column of the index that can vary.
Any solutions?
Can I answer my own question?
I believe I have found that the only way to do what I describe is for my lookup table to have rows for every possible combination of usergroups a person may want to be a member of.
To pick a simplified example, instead of doing this:
SELECT id FROM ids WHERE groups IN(1,2) ORDER BY id
If you need to use the index both to select rows and to order them, you have to abstract that IN(1,2) so that it is constant rather than a range, ie:
SELECT id FROM ids WHERE grouplist='1,2' ORDER BY id
Of course instead of using the string '1,2' you could have a foreign key there, etc. The point being that you'd have to have a row not just for each group but for each combination of multiple groups.
So, there is my answer.
Anyway, for my application, I feel that maintaining a lookup for all possible combinations of usergroups for each node is not worth it. For my purposes, I predict that most nodes are visible to most users, so I feel that it is acceptable to simply to make the GROUP BY use the index, as the filtering doesn't need it so badly.
In other words, the approach I'll take for my original query may be something like:
SELECT
<fields>
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
FORCE INDEX(node_created_and_node_ID)
GROUP BY
node_created, node_ID
GROUP BY can use an index if it starts at the left most column of the index and it is in the first non-const non-system table to be processed. The join then deals with the entire list (which is already ordered), and only those not visible to the current user (which will be a small proportion) are removed by the INNER JOIN.
Copy the value you are going to order by into to viewpermission table and add it to your index.
You could use a trigger to maintain that value from the other table.
select * from
(
select *
FROM node
INNER JOIN viewpermission
ON viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
) a
order by a.node_lastupdated desc
The inner query gives you the filtered subset, which I understand is substantially smaller than the whole set. Only the smaller has to be sorted.
MySQL has problems when you use GROUP BY and ORDER BY in the same query. That causes a filesort, and that's probably the biggest penalty for performance.
You can eliminate the need for a DISTINCT (or GROUP BY) by using a non-correlated subquery instead of a JOIN.
SELECT * FROM node
WHERE node_id IN (
SELECT viewpermission_nodeID
FROM viewpermission
WHERE viewpermissiong_usergroupID IN ( <...usergroups...> )
)
ORDER BY node_lastupdated DESC
LIMIT 100;
There's no need to sort or do a DISTINCT on the subquery, since IN (1, 1, 2, 3) is the same as IN (1, 3, 2).
Note that MySQL can use only one index per table in a given query, so it'll try to make the best choice between an index on node_id and an index on node_lastupdated. It can't use both, and even if you made a compound index it wouldn't help in this case.
Remember to analyze different solutions with EXPLAIN.