Efficient Classification of records by common letters in impala - sql

I have a table in impala (TBL1), that contains different names with different number of first common letters. The table contains about 3M records. I would like to add add an new attribute to the table, where each common first letters will have a class. It is the same way as DENSE_RANK work but with dynamic number of first letters. The number of same first letters should not be less than p=3 letters (p = parameter).
Here is an example for the table and the required results:
| ID | Attr1 | New_Attr1 | Some more attribute...
+-------+--------------+-------------+-----------------------
| 1 | ZXA-12 | 1 |
| 2 | YL3300 | 2 |
| 3 | ZXA-123 | 1 |
| 4 | YL3400 | 2 |
| 5 | YL3-aaa | 2 |
| 6 | TSA 789 | 3 |
...

Does this do what you want?
select t.*,
dense_rank() over (order by strleft(attr1, 3)) as newcol
from . . .;
The "3" is your parameter.
As a note: In your example, you seem to have assigned the new value in reverse alphabetic order. Hence, you would want desc for the order by.

Related

How to find two consecutive rows sorted by date, containing a specific value?

I have a table with the following structure and data in it:
| ID | Date | Result |
|---- |------------ |-------- |
| 1 | 30/04/2020 | + |
| 1 | 01/05/2020 | - |
| 1 | 05/05/2020 | - |
| 2 | 03/05/2020 | - |
| 2 | 04/05/2020 | + |
| 2 | 05/05/2020 | - |
| 2 | 06/05/2020 | - |
| 3 | 01/05/2020 | - |
| 3 | 02/05/2020 | - |
| 3 | 03/05/2020 | - |
| 3 | 04/05/2020 | - |
I'm trying to write an SQL query (I'm using SQL Server) which returns the date of the first two consecutive negative results for a given ID.
For example, for ID no. 1, the first two consecutive negative results are on 01/05 and 05/05.
The first two consecutive results for ID No. 2 are on 05/05 and 06/05.
The first two consecutive negative results for ID No. 3 are on on 01/05 and 02/05 .
So the query should produce the following result:
| ID | FirstNegativeDate |
|---- |------------------- |
| 1 | 01/05 |
| 2 | 05/05 |
| 3 | 01/05 |
Please note that the dates aren't necessarily one day apart. Sometimes, two consecutive negative tests may be several days apart. But they should still be considered as "consecutive negative tests". In other words, two negative tests are not 'consecutive' only if there is a positive test result in between them.
How can this be done in SQL? I've done some reading and it looks like maybe the PARTITION BY statement is required but I'm not sure how it works.
This is a gaps-and-island problem, where you want the start of the first island of '-'s that contains at least two rows.
I would recommend lead() and aggregation:
select id, min(date) first_negative_date
from (
select t.*, lead(result) over(partition by id order by date) lead_result
from mytable t
) t
where result = '-' and lead_result = '-'
group by id
Use LEAD or LAG functions over ID partition ordered by your Date column.
Then simple check where LEAD/LAG column is equal to Result.
You'll need also to filter the top ones.
The image attached just shows what LEAD/LAG would return

How to count the unique rows after aggregating to array

Trying to solve the problem in a read-only manner.
My table (answers) looks like the one below:
| user_id | value |
+----------------+-------------+
| 6 | pizza |
| 6 | tosti |
| 9 | fries |
| 9 | tosti |
| 10 | pizza |
| 10 | tosti |
| 12 | pizza |
| 12 | tosti |
| 13 | sushi | -> did not finish the quiz.
NOTE: the actual table has 15+ different possible values. (Answers to questions).
I've been able to make create the table below:
| value arr | count | user_id |
+----------------+--------------+-----------+
| pizza, tosti | 2 | 6 |
| fries, tosti | 2 | 9 |
| pizza, tosti | 2 | 10 |*
| pizza, tosti | 2 | 12 |*
| sushi | 1 | 13 |
I'm not sure if the * rows show up in my current query (DB has 30k rows and 15+ value options). The problem here is that "count" is counting the number of answers and not the number of unique outcomes.
Current query looks a bit like:
select string_agg(DISTINCT value, ',' order by value) AS value, user_id,
COUNT(DISTINCT value)
FROM answers
GROUP BY user_id;
Looking for the unique answer combinations like the table shown below:
| value arr | count unique |
+----------------+--------------+
| pizza, tosti | 3 |
| fries, tosti | 1 |
| sushi | 1 | --> Hidden in perfect situation.
Tried a bunch of queries, both written and generated by tools. From super simplified to quite complex, I keep ending up with the answers being count instead of the unique combination accros users.
If this is a duplicate question, please re-direct me to it. Learned a lot these last few days, but haven't been able to find the answer yet.
Any help would be highly appreciated.
Here's what you need. Your almost there.
select t1.value, count(1) From (
select string_agg(DISTINCT value, ',' order by value) AS value, user_id
FROM answers
GROUP BY user_id) t1
group by t1.value;
You can try (this is for SQL Server):
select count(*), string_agg(value, ",")
within group (order by value) as count_unique
from answers
group by string_agg(value, ",")

Find sequence of choice in a column

There is a table where user_id is for each test taker, and choice is the answer for all the three questions. I would like to get all the different sequence of choices that test taker made and count the sequence. Is there a way to write sql query to achieve this? Thanks
----------------------------------
| user_id | Choice |
----------------------------------
| 1 | a |
----------------------------------
| 1 | b |
----------------------------------
| 1 | c |
----------------------------------
| 2 | b |
----------------------------------
| 2 | c |
----------------------------------
| 2 | a |
----------------------------------
Desire answer:
----------------------------------
| choice | count |
----------------------------------
| a,b,c | 1 |
----------------------------------
| b,c,a | 1 |
-----------------------------------
In BigQuery, you can use aggregation functions:
select choices, count(*)
from (select string_agg(choice order by ?) as choices, user_id
from t
group by user_id
) t
group by choices;
The ? is for the column that specifies the ordering of the table. Remember: tables represent unordered sets, so without such a column the choices can be in any order.
You can do something similar in SQL Server 2017+ using string_agg(). In earlier versions, you have to use an XML method, which is rather unpleasant.

how to get last non empty column of a series from a date set in SAS

I'm working on a SAS application and I've a data set as
dim_point
rk | ID | name | value_0 | value_1 | value_2 | value_3 | value_4
1 | one | one | val_0 | val_1 | val_2 | val_3 | .
2 | two | two | val_0 | val_1 | val_2 | . | .
3 | three | three | val_0 | . | . | . | .
4 | four | four | val_0 | val_1 | . | . | .
I want to get other columns and last non empty column of value as
want
rk | ID | name | value
1 | one | one | val_3
2 | two | two | val_2
3 | three | three | val_0
4 | four | four | val_1
code that I'm trying to do is
proc sql noprint;
create table want as
select rk, ID, name, name as value
from dim_point;
update want
set value = "";
quit;
I don't know how I can update value column with last non empty column value of value_ series?
Use coalesce in reverse order:
set value = coalesce(value_4,value_3,value_2,value_1,value_0);
You might need to use coalescec instead for character variables.
Whilst the coalesce solution is fine when you know all the columns, there may be situations where they can either vary, or are significantly more of them.
In that case, you can use an array :
data want ;
set have ;
array _v{*} value_: ; /* colon operator is a wildcard, loaded in the same order as the input dataset */
/* Loop backwards over the array, stop when we find a non-missing value */
do i = dim(_v) to 1 by -1 until (not missing(value)) ;
if not missing(_v{i}) then value = _v{i} ;
end ;
drop i ;
run ;

How to get top 3 frequencies in MySQL?

In MySQL I have a table called "meanings" with three columns:
"person" (int),
"word" (byte, 16 possible values)
"meaning" (byte, 26 possible values).
A person assigns one or more meanings to each word:
person word meaning
-------------------
1 1 4
1 2 19
1 2 7 <-- Note: second meaning for word 2
1 3 5
...
1 16 2
Then another person, and so on. There will be thousands of persons.
I need to find for each of the 16 words the top three meanings (with their frequencies). Something like:
+--------+-----------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+-----------------+------------------+-----------------+
| 1 | meaning 5 (35%) | meaning 19 (22%) | meaning 2 (13%) |
| 2 | meaning 8 (57%) | meaning 1 (18%) | meaning 22 (7%) |
+--------+-----------------+------------------+-----------------+
...
Is it possible to solve this with a single MySQL query?
Well, if you group by word and meaning, you can easily get the % of people who use each word/meaning combination out of the dataset.
In order to limit the number of meanings for each word returned, you will need create some sort of filter per word/meaning combination.
Seems like you just want the answer to your homework, so I wont post more than this, but this should be enough to get you on the right track.
Of course you can do
SELECT * FROM words WHERE word = 2 ORDER BY meaning DESC LIMIT 3
But this is cheating since you need to create a loop.
Im working on a better solution
I believe the problem I had a while ago looks similar. I ended up with the #counter thing.
Note about the problem
Let's suppose there is only one person, who says:
+--------+----------------+
| Person | Word | Meaning |
+--------+----------------+
| 1 | 1 | 7 |
| 1 | 1 | 3 |
| 1 | 2 | 8 |
+--------+----------------+
The report should read:
+--------+------------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+------------------+------------------+-----------------+
| 1 | meaning 7 (100%) | meaning 3 (100%) | NULL |
| 2 | meaning 8 (100%) | NULL | NULL |
+--------+------------------+------------------+-----------------+
The following is not OK (50% frequency is absurd in a population of one person):
+--------+------------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+------------------+------------------+-----------------+
| 1 | meaning 7 (50%) | meaning 3 (50%) | NULL |
| 2 | meaning 8 (100%) | NULL | NULL |
+--------+------------------+------------------+-----------------+
The intended meaning of the frequencies is "How many people think this meaning corresponds to that word"?
So it's not merely about counting "cases", but about counting persons in the table.