Count the number of rows that contain a letter/number - sql

What I am trying to achieve is straightforward, however it is a little difficult to explain and I don't know if it is actually even possible in postgres. I am at a fairly basic level. SELECT, FROM, WHERE, LEFT JOIN ON, HAVING, e.t.c the basic stuff.
I am trying to count the number of rows that contain a particular letter/number and display that count against the letter/number.
i.e How many rows have entries that contain an "a/A" (Case insensitive)
The table I'm querying is a list of film names. All I want to do is group and count 'a-z' and '0-9' and output the totals. I could run 36 queries sequentially:
SELECT filmname FROM films WHERE filmname ilike '%a%'
SELECT filmname FROM films WHERE filmname ilike '%b%'
SELECT filmname FROM films WHERE filmname ilike '%c%'
And then run pg_num_rows on the result to find the number I require, and so on.
I know how intensive like is and ilike even more so I would prefer to avoid that. Although the data (below) has upper and lower case in the data, I want the result sets to be case insensitive. i.e "The Men Who Stare At Goats" the a/A,t/T and s/S wouldn't count twice for the resultset. I can duplicate the table to a secondary working table with the data all being strtolower and working on that set of data for the query if it makes the query simpler or easier to construct.
An alternative could be something like
SELECT sum(length(regexp_replace(filmname, '[^X|^x]', '', 'g'))) FROM films;
for each letter combination but again 36 queries, 36 datasets, I would prefer if I could get the data in a single query.
Here is a short data set of 14 films from my set (which actually contains 275 rows)
District 9
Surrogates
The Invention Of Lying
Pandorum
UP
The Soloist
Cloudy With A Chance Of Meatballs
The Imaginarium of Doctor Parnassus
Cirque du Freak: The Vampires Assistant
Zombieland
9
The Men Who Stare At Goats
A Christmas Carol
Paranormal Activity
If I manually lay out each letter and number in a column and then register if that letter appears in the film title by giving it an x in that column and then count them up to produce a total I would have something like this below. Each vertical column of x's is a list of the letters in that filmname regardless of how many times that letter appears or its case.
The result for the short set above is:
A x x xxxx xxx 9
B x x 2
C x xxx xx 6
D x x xxxx 6
E xx xxxxx x 8
F x xxx 4
G xx x x 4
H x xxxx xx 7
I x x xxxxx xx 9
J 0
K x 0
L x xx x xx 6
M x xxxx xxx 8
N xx xxxx x x 8
O xxx xxx x xxx 10
P xx xx x 5
Q x 1
R xx x xx xxx 7
S xx xxxx xx 8
T xxx xxxx xxx 10
U x xx xxx 6
V x x x 3
W x x 2
X 0
Y x x x 3
Z x 1
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 x x 1
In the example above, each column is a "filmname" As you can see, column 5 marks only a "u" and a "p" and column 11 marks only a "9". The final column is the tally for each letter.
I want to build a query somehow that gives me the result rows: A 9, B 2, C 6, D 6, E 8 e.t.c taking into account every row entry extracted from my films column. If that letter doesn't appear in any row I would like a zero.
I don't know if this is even possible or whether to do it systematically in php with 36 queries is the only possibility.
In the current dataset there are 275 entries and it grows by around 8.33 a month (100 a year). I predict it will reach around 1000 rows by 2019 by which time I will be no doubt using a completely different system so I don't need to worry about working with a huge dataset to trawl through.
The current longest title is "Percy Jackson & the Olympians: The Lightning Thief" at 50 chars (yes, poor film I know ;-) and the shortest is 1, "9".
I am running version 9.0.0 of Postgres.
Apologies if I've said the same thing multiple times in multiple ways, I am trying to get as much information out so you know what I am trying to achieve.
If you need any clarification or larger datasets to test with please just ask and I'll edit as needs be.
Suggestion are VERY welcome.
Edit 1
Erwin Thanks for the edits/tags/suggestions. Agree with them all.
Fixed the missing "9" typo as suggested by Erwin. Manual transcribe error on my part.
kgrittn, Thanks for the suggestion but I am not able to update the version from 9.0.0. I have asked my provider if they will try to update.
Response
Thanks for the excellent reply Erwin
Apologies for the delay in responding but I have been trying to get your query to work and learning the new keywords to understand the query you created.
I adjusted the query to adapt into my table structure but the result set was not as expected (all zeros) so I copied your lines directly and had the same result.
Whilst the result set in both cases lists all 36 rows with the appropriate letters/numbers however all the rows shows zero as the count (ct).
I have tried to deconstruct the query to see where it may be falling over.
The result of
SELECT DISTINCT id, unnest(string_to_array(lower(film), NULL)) AS letter
FROM films
is "No rows found". Perhaps it ought to when extracted from the wider query, I'm not sure.
When I removed the unnest function the result was 14 rows all with "NULL"
If I adjust the function
COALESCE(y.ct, 0) to COALESCE(y.ct, 4)<br />
then my dataset responds all with 4's for every letter instead of zeros as explained previously.
Having briefly read up on COALESCE the "4" being the substitute value I am guessing that y.ct is NULL and being substituted with this second value (this is to cover rows where the letter in the sequence is not matched, i.e if no films contain a 'q' then the 'q' column will have a zero value rather than NULL?)
The database I tried this on was SQL_ASCII and I wondered if that was somehow a problem but I had the same result on one running version 8.4.0 with UTF-8.
Apologies if I've made an obvious mistake but I am unable to return the dataset I require.
Any thoughts?
Again, thanks for the detailed response and your explanations.

This query should do the job:
Test case:
CREATE TEMP TABLE films (id serial, film text);
INSERT INTO films (film) VALUES
('District 9')
,('Surrogates')
,('The Invention Of Lying')
,('Pandorum')
,('UP')
,('The Soloist')
,('Cloudy With A Chance Of Meatballs')
,('The Imaginarium of Doctor Parnassus')
,('Cirque du Freak: The Vampires Assistant')
,('Zombieland')
,('9')
,('The Men Who Stare At Goats')
,('A Christmas Carol')
,('Paranormal Activity');
Query:
SELECT l.letter, COALESCE(y.ct, 0) AS ct
FROM (
SELECT chr(generate_series(97, 122)) AS letter -- a-z in UTF8!
UNION ALL
SELECT generate_series(0, 9)::text -- 0-9
) l
LEFT JOIN (
SELECT letter, count(id) AS ct
FROM (
SELECT DISTINCT -- count film once per letter
id, unnest(string_to_array(lower(film), NULL)) AS letter
FROM films
) x
GROUP BY 1
) y USING (letter)
ORDER BY 1;
This requires PostgreSQL 9.1! Consider the release notes:
Change string_to_array() so a NULL separator splits the string into
characters (Pavel Stehule)
Previously this returned a null value.
You can use regexp_split_to_table(lower(film), ''), instead of unnest(string_to_array(lower(film), NULL)) (works in versions pre-9.1!), but it is typically a bit slower and performance degrades with long strings.
I use generate_series() to produce the [a-z0-9] as individual rows. And LEFT JOIN to the query, so every letter is represented in the result.
Use DISTINCT to count every film once.
Never worry about 1000 rows. That is peanuts for modern day PostgreSQL on modern day hardware.

A fairly simple solution which only requires a single table scan would be the following.
SELECT
'a', SUM( (title ILIKE '%a%')::integer),
'b', SUM( (title ILIKE '%b%')::integer),
'c', SUM( (title ILIKE '%c%')::integer)
FROM film
I left the other 33 characters as a typing exercise for you :)
BTW 1000 rows is tiny for a postgresql database. It's beginning to get large when the DB is larger then the memory in your server.
edit: had a better idea
SELECT chars.c, COUNT(title)
FROM (VALUES ('a'), ('b'), ('c')) as chars(c)
LEFT JOIN film ON title ILIKE ('%' || chars.c || '%')
GROUP BY chars.c
ORDER BY chars.c
You could also replace the (VALUES ('a'), ('b'), ('c')) as chars(c) part with a reference to a table containing the list of characters you are interested in.

This will give you the result in a single row, with one column for each matching letter and digit.
SELECT
SUM(CASE WHEN POSITION('a' IN filmname) > 0 THEN 1 ELSE 0 END) AS "A",
SUM(CASE WHEN POSITION('b' IN filmname) > 0 THEN 1 ELSE 0 END) AS "B",
SUM(CASE WHEN POSITION('c' IN filmname) > 0 THEN 1 ELSE 0 END) AS "C",
...
SUM(CASE WHEN POSITION('z' IN filmname) > 0 THEN 1 ELSE 0 END) AS "Z",
SUM(CASE WHEN POSITION('0' IN filmname) > 0 THEN 1 ELSE 0 END) AS "0",
SUM(CASE WHEN POSITION('1' IN filmname) > 0 THEN 1 ELSE 0 END) AS "1",
...
SUM(CASE WHEN POSITION('9' IN filmname) > 0 THEN 1 ELSE 0 END) AS "9"
FROM films;

A similar approach like Erwins, but maybe more comfortable in the long run:
Create a table with each character you're interested in:
CREATE TABLE char (name char (1), id serial);
INSERT INTO char (name) VALUES ('a');
INSERT INTO char (name) VALUES ('b');
INSERT INTO char (name) VALUES ('c');
Then grouping over it's values is easy:
SELECT char.name, COUNT(*)
FROM char, film
WHERE film.name ILIKE '%' || char.name || '%'
GROUP BY char.name
ORDER BY char.name;
Don't worry about ILIKE.
I'm not 100% happy about using the keyword 'char' as table title, but hadn't had bad experiences so far. On the other hand it is the natural name. Maybe if you translate it to another language - like 'zeichen' in German, you avoid ambiguities.

Related

grouping in sql in Access

I have a table that looks like this with three columns From, To, and Symbol:
From To Symbol
0 2 dog
2 5 dog
5 9 cat
9 15 cat
15 20 dog
20 40 dog
40 45 dog
I was trying to write an SQL query that groups records in a way that produces the following result:
From To Symbol
0 5 dog
5 15 cat
15 45 dog
That is, if the From and To values are continuous for the same Symbol, one result record is created with the smallest From and the largest To values and the Symbol. In the above example table, since the second record has a value of 2 in the To column which is not the same as the From value in the next record with the same Symbol (15, 20, dog), two result records are created for the same Symbol (dog).
I have tried to join the table to itself, then group by. But I could not figure out how exactly that can be done. I have to do this in Microsoft Access. Any help would be greatly appreciated. Thanks!
Assuming the values have no overlaps and that gaps separate values, you can do this in MS Access with a trick. You need to identify the adjacent symbols that are the same. Well, you can identify them by counting the number of previous rows with different symbols (using a subquery). Once you have this information, the rest is aggregation:
select symbol, min(from) as from, max(to) as to
from (select t.*,
(select count(*)
from t as t2
where t2.from < t.from and t2.symbol <> t.symbol
) as grp
from t
) t
group by symbol, grp;
Gaps would make this problem much harder in MS Access.
Note: Don't use reserved words or keywords for column names. This code uses the names supplied in the question, but doesn't bother to escape them. I think that just makes it harder to understand the query.

Searching for a number in a database column where column contains series of numbers seperated by a delimeter '"&" in SQLite

My table structure is as follows :
id category
1 1&2&3
2 18&2&1
3 11
4 1&11
5 3&1
6 1
My Question: I need a sql query which generates the result set as follows when the user searched category is 1
id category
1 1&2&3
2 18&2&1
4 1&11
5 3&1
6 1
but i am getting all the results not the expected one
I have tried regexp and like operators but no success.
select * from mytable where category like '%1%'
select * from mytable where category regexp '([.]*)(1)(.*)'
I really dont know about regexp I just found it.
so please help me out.
For matching a list item separated by &, use:
SELECT * FROM mytable WHERE '&'||category||'&' LIKE '%&1&%';
this will match entire item (ie, only 1, not 11, ...), whether it is at list beginning, middle or end.

substring and trim in Teradata

I am working in Teradata with some descriptive data that needs to be transformed from a gerneric varchar(60) into the different field lengths based on the type of data element and the attribute value. So I need to take whatever is in the Varchar(60) and based on field 'ABCD' act on field 'XYZ'. In this case XYZ is a varchar(3). To do this I am using CASE logic within my select. What I want to do is
eliminate all occurances of non alphabet/numeric data. All I want left are upper case Alpha chars and numbers.
In this case "Where abcd = 'GROUP' then xyz should come out as a '000', '002', 'A', 'C'
eliminate extra padding
Shift everything Right
abcd xyz
1 GROUP NULL
2 GROUP $
3 GROUP 000000000000000000000000000000000000000000000000000000000000
4 GROUP 000000000000000000000000000000000000000000000000000000000002
5 GROUP A
6 GROUP C
7 GROUP r
To do this I have tried TRIM and SUBSTR amongst several other things that did not work. I have pasted what I have working now, but I am not reliably working through the data within the select. I am really looking for some options on how to better work with strings in Teradata. I have been working out of the "SQL Functions, Operators, Expressions and Predicates" online PDF. Is there a better reference. We are on TD 13
SELECT abcd
, CASE
-- xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WHEN abcd= 'GROUP'
THEN(
CASE
WHEN SUBSTR(tx.abcd,60, 4) = 0
THEN (
SUBSTR(tx.abcd,60, 3)
)
ELSE
TRIM (TRAILING FROM tx.abcd)
END
)
END AS abcd
FROM db.descr tx
WHERE tx.abcd IS IN ( 'GROUP')
The end result should look like this
abcd xyz
1 GROUP 000
2 GROUP 002
3 GROUP A
4 GROUP C
I will have to deal with approx 60 different "abcd" types, but they should all conform to the type of data I am currently seeing.. ie.. mixed case, non numeric, non alphabet, padded, etc..
I know there is a better way, but I have come in several circles trying to figure this out over the weekend and need a little push in the right direction.
Thanks in advance,
Pat
The SQL below uses the CHARACTER_LENGTH function to first determine if there is a need to perform what amounts to a RIGHT(tx.xyz, 3) using the native functions in Teradata 13.x. I think this may accomplish what you are looking to do. I hope I have not misinterpreted your explanation:
SELECT CASE WHEN tx.abcd = 'GROUP'
AND CHARACTER_LENGTH(TRIM(BOTH FROM tx.xyz) > 3
THEN SUBSTRING(tx.xyz FROM (CHARACTER_LENGTH(TRIM(BOTH FROM tx.xyz)) - 3))
ELSE tx.abcd
END
FROM db.descr tx;
EDIT: Fixed parenthesis in SUBSTRING

Finding what words a set of letters can create?

I am trying to write some SQL that will accept a set of letters and return all of the possible words it can make. My first thought was to create a basic three table database like so:
Words -- contains 200k words in real life
------
1 | act
2 | cat
Letters -- contains the whole alphabet in real life
--------
1 | a
3 | c
20 | t
WordLetters --First column is the WordId and the second column is the LetterId
------------
1 | 1
1 | 3
1 | 20
2 | 3
2 | 1
2 | 20
But I'm a bit stuck on how I would write a query that returns words that have an entry in WordLetters for every letter passed in. It also needs to account for words that have two of the same letter. I started with this query, but it obviously does not work:
SELECT DISTINCT w.Word
FROM Words w
INNER JOIN WordLetters wl
ON wl.LetterId = 20 AND wl.LetterId = 3 AND wl.LetterId = 1
How would I write a query to return only words that contain all of the letters passed in and accounting for duplicate letters?
Other info:
My Word table contains close to 200,000 words which is why I am trying to do this on the database side rather than in code. I am using the enable1 word list if anyone cares.
Ignoring, for the moment, the SQL part of the problem, the algorithm I'd use is fairly simple: start by taking each word in your dictionary, and producing a version of it with the letters in sorted order, along with a pointer back to the original version of that word.
This would give a table with entries like:
sorted_text word_id
act 123 /* we'll assume `act` was word number 123 in the original list */
act 321 /* we'll assume 'cat' was word number 321 in the original list */
Then when we receive an input (say, "tac") we sort it's letters, look it up in our table of sorted letters joined to the table of the original words, and that gives us a list of the words that can be created from that input.
If I were doing this, I'd have the tables for that in a SQL database, but probably use something else to pre-process the word list into the sorted form. Likewise, I'd probably leave sorting the letters of the user's input to whatever I was using to create the front-end, so SQL would be left to do what it's good at: relational database management.
If you use the solution you provide, you'll need to add an order column to the WordLetters table. Without that, there's no guarantee that you'll retrieve the rows that you retrieve are in the same order you inserted them.
However, I think I have a better solution. Based on your question, it appears that you want to find all words with the same component letters, independent of order or number of occurrences. This means that you have a limited number of possibilities. If you translate each letter of the alphabet into a different power of two, you can create a unique value for each combination of letters (aka a bitmask). You can then simply add together the values for each letter found in a word. This will make matching the words trivial, as all words with the same letters will map to the same value. Here's an example:
WITH letters
AS (SELECT Cast('a' AS VARCHAR) AS Letter,
1 AS LetterValue,
1 AS LetterNumber
UNION ALL
SELECT Cast(Char(97 + LetterNumber) AS VARCHAR),
Power(2, LetterNumber),
LetterNumber + 1
FROM letters
WHERE LetterNumber < 26),
words
AS (SELECT 1 AS wordid, 'act' AS word
UNION ALL SELECT 2, 'cat'
UNION ALL SELECT 3, 'tom'
UNION ALL SELECT 4, 'moot'
UNION ALL SELECT 5, 'mote')
SELECT wordid,
word,
Sum(distinct LetterValue) as WordValue
FROM letters
JOIN words
ON word LIKE '%' + letter + '%'
GROUP BY wordid, word
As you'll see if you run this query, "act" and "cat" have the same WordValue, as do "tom" and "moot", despite the difference in number of characters.
What makes this better than your solution? You don't have to build a lot of non-words to weed them out. This will constitute a massive savings of both storage and processing needed to perform the task.
There is a solution to this in SQL. It involves using a trick to count the number of times that each letter appears in a word. The following expression counts the number of times that 'a' appears:
select len(word) - len(replace(word, 'a', ''))
The idea is to count the total of all the letters in the word and see if that matches the overall length:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word, (LEN(w.word) - LEN(replace(w.word, wl.letter))) as LettersInWord
from word w
cross join wordletters wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
This particular solution allows multiple occurrences of a letter. I'm not sure if this was desired in the original question or not. If we want up to a certain number of occurrences, then we might do the following:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word,
(case when (LEN(w.word) - LEN(replace(w.word, wl.letter))) <= maxcount
then (LEN(w.word) - LEN(replace(w.word, wl.letter)))
else maxcount end) as LettersInWord
from word w
cross join
(
select letter, count(*) as maxcount
from wordletters wl
group by letter
) wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
If you want an exact match to the letters, then the case statement should use " = maxcount" instead of " <= maxcount".
In my experience, I have actually seen decent performance with small cross joins. This might actually work server-side. There are two big advantages to doing this work on the server. First, it takes advantage of the parallelism on the box. Second, a much smaller set of data needs to be transfered across the network.

Display more than one row with the same result from a field

I need to show more than one result from each field in a table. I need to do this with only one SQL sentence, I donĀ“t want to use a Cursor.
This seems silly, but the number of rows may vary for each item. I need this to print afterwards this information as a Crystal Report detail.
Suppose I have this table:
idItem Cantidad <more fields>
-------- -----------
1000 3
2000 2
3000 5
4000 1
I need this result, using one only SQL Sentence:
1000
1000
1000
2000
2000
3000
3000
3000
3000
3000
4000
where each idItem has Cantidad rows.
Any ideas?
It seems like something that should be handled in the UI (or the report). I don't know Crystal Reports well enough to make a suggestion there. If you really, truly need to do it in SQL, then you can use a Numbers table (or something similar):
SELECT
idItem
FROM
Some_Table ST
INNER JOIN Numbers N ON
N.number > 0 AND
N.number <= ST.cantidad
You can replace the Numbers table with a subquery or function or whatever other method you want to generate a result set of numbers that is at least large enough to cover your largest cantidad.
Check out UNPIVOT (MSDN)
Another example
If you use a "numbers" table that is useful for this and many similar purposes, you can use the following SQL:
select t.idItem
from myTable t
join numbers n on n.num between 1 and t.Cantidad
order by t.idTtem
The numbers table should just contain all integer numbers from 0 or 1 up to a number big enough so that Cantidad never exceeds it.
As others have said, you need a Numbers or Tally table which is just a sequential list of integers. However, if you knew that Cantidad was never going to be larger than five for example, you can do something like:
Select idItem
From Table
Join (
Select 1 As Value
Union All Select 2
Union All Select 3
Union All Select 4
Union All Select 5
) As Numbers
On Numbers.Value <= Table.Cantidad
If you are using SQL Server 2005, you can use a CTE to do:
With Numbers As
(
Select 1 As Value
Union All
Select N.Value + 1
From Numbers As N
)
Select idItem
From Table
Join Numbers As N
On N.Value <= Table.Cantidad
Option (MaxRecursion 0);