Padding to the result of a DISTINCT Sqlite query - sql

I searched and figured out that I could use either substr with || or a printf statement with format specifiers in order to add padding to the results, but that doesn't seem to work if I had DISTINCT in the sqlite query.
I've a table called timeLapse that looks like so:
+----+-------+-----------+
| ID | Time | Status |
+----+-------+-----------+
| 1 | 0.001 | Initiated |
| 1 | 0.002 | Cranked |
| 3 | 0.002 | Initiated |
| 2 | 0.002 | Initiated |
| 2 | 0.003 | Cranked |
+----+-------+-----------+
I could query the distinct IDs with something like SELECT distinct(ID) FROM timeLapse as IDs, which returns this:
+-----+
| IDs |
+-----+
| 1 |
| 2 |
| 3 |
+-----+
However, I would like to pad the resultant distinct rows like so:
+----------+
| IDs |
+----------+
| Object-1 |
| Object-2 |
| Object-3 |
+----------+
My query SELECT substr('Object-' || DISTINCT(ID), 10, 10) as IDs FROM timeLapse results in an error:
"[17:22:47] Error while executing SQL query on database 'machining': near "distinct": syntax error"
Could someone please help me understand what am I doing wrong here? I am enormously thankful for your time and help.

get distinct() first before using substr() function.
select substr('Object-' || t1.ID, 1, 10) as IDs
from (SELECT DISTINCT(ID) ID FROM timeLapse) t1
see sqlfiddle

All credits to the user named ϻᴇᴛᴀʟ, as I only understood from their answer that I should have a sub-query within this query where the DISTINCT should go into.
This resolves my problem:
select printf('Object-%s', t1.ID) as IDs
FROM (SELECT DISTINCT(id) ID FROM timeLapse) t1

Related

SELECTing Related Rows Based on a Single Row Match

I have the following table running on Postgres SQL 9.5:
+---+------------+-------------+
|ID | trans_id | message |
+---+------------+-------------+
| 1 | 1234567 | abc123-ef |
| 2 | 1234567 | def234-gh |
| 3 | 1234567 | ghi567-ij |
| 4 | 8902345 | ced123-ef |
| 5 | 8902345 | def234-bz |
| 6 | 8902345 | ghi567-ij |
| 7 | 6789012 | abc123-ab |
| 8 | 6789012 | def234-cd |
| 9 | 6789012 | ghi567-ef |
|10 | 4567890 | abc123-ab |
|11 | 4567890 | gex890-aj |
|12 | 4567890 | ghi567-ef |
+---+------------+-------------+
I am looking for the rows for each trans_id based on a LIKE query, like this:
SELECT * FROM table
WHERE message LIKE '%def-234%'
This, of course, returns just three rows, the three that match my pattern in the message column. What I am looking for, instead, is all the rows matching that trans_id in groups of messages that match. That is, if a single row matches the pattern, get all the rows with the trans_id of that matching row.
That is, the results would be:
+---+------------+-------------+
|ID | trans_id | message |
+---+------------+-------------+
| 1 | 1234567 | abc123-ef |
| 2 | 1234567 | def234-gh |
| 3 | 1234567 | ghi567-ij |
| 4 | 8902345 | ced123-ef |
| 5 | 8902345 | def234-bz |
| 6 | 8902345 | ghi567-ij |
| 7 | 6789012 | abc123-ab |
| 8 | 6789012 | def234-cd |
| 9 | 6789012 | ghi567-ef |
+---+------------+-------------+
Notice rows 10, 11, and 12 were not SELECTed because there was not one of them that matched the %def-234% pattern.
I have tried (and failed) to write a sub-query to get the all the related rows when a single message matches a pattern:
SELECT sub.*
FROM (
SELECT DISTINCT trans_id FROM table WHERE message LIKE '%def-234%'
) sub
WHERE table.trans_id = sub.trans_id
I could easily do this with two queries, but the first query to get a list of matching trans_ids to include in a WHERE trans_id IN (<huge list of trans_ids>) clause would be very large, and would not be a very inefficient way of doing this, and I believe there exists a way to do it with a single query.
Thank you!
This will do the job I think :
WITH sub AS (
SELECT trans_id
FROM table
WHERE message LIKE '%def-234%'
)
SELECT *
FROM table JOIN sub USING (trans_id);
Hope this help.
Try this:
SELECT ID, trans_id, message
FROM (
SELECT ID, trans_id, message,
COUNT(*) FILTER (WHERE message LIKE '%def234%')
OVER (PARTITION BY trans_id) AS pattern_cnt
FROM mytable) AS t
WHERE pattern_cnt >= 1
Using a FILTER clause in the windowed version of COUNT function we can get the number of records matching the predefined pattern within each trans_id slice. The outer query uses this count to filter out irrelevant slices.
Demo here
You can do this.
WITH trans
AS
(SELECT DISTINCT trans_id
FROM t1
WHERE message LIKE '%def234%')
SELECT t1.*
FROM t1,
trans
WHERE t1.trans_id = trans.trans_id;
I think this will perform better. If you have enough data, you can do an explain on both Sub query and CTE and compare the output.

Implicit Flattening in BigQuery

When does BigQuery flatten an intermediate result set? I was under the impression that it was only when FLATTEN was invoked, but I've encountered an example where the result is flattened without a FLATTEN.
This is the case - this base query returns one record:
select count(*) from publicdata:samples.trigrams
where ngram = 'der Griindung im'
+-----+
| f0_ |
+-----+
| 1 |
+-----+
When queried, you can see that the record has a repeated field that is repeated twice.
select * from publicdata:samples.trigrams
where ngram = 'der Griindung im'
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
| ngram | first | second | third | fourth | fifth | cell_value | cell_volume_count | cell_volume_fraction | cell_page_count | cell_match_count | cell_sample_id | cell_sample_text | cell_sample_title | cell_sample_subtitle | cell_sample_authors | cell_sample_url |
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
| der Griindung im | der | Griindung | im | NULL | NULL | 2007 | 54 | 0.008746355685131196 | 54 | 54 | NULL | NULL | NULL | NULL | NULL | NULL |
| der Griindung im | der | Griindung | im | NULL | NULL | 2008 | 47 | 0.007612568837058633 | 47 | 47 | NULL | NULL | NULL | NULL | NULL | NULL |
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
When I add a filter on cell.value, I get two records instead of one - but I never flattened so I'm not sure about the behavior here. My expectation is that this would return the same output as the previous COUNT above. It doesn't:
select count(*) from publicdata:samples.trigrams
where ngram = 'der Griindung im' and cell.value in ('2007', '2008')
+-----+
| f0_ |
+-----+
| 2 |
+-----+
What this means is that while I expect select * from publicdata:samples.trigrams where ngram = 'der Griindung im' and select * from publicdata:samples.trigrams where ngram = 'der Griindung im' and cell.value in ('2007', '2008') to return the same output, they don't because one is implicitly flattened and the other is not. While this may not seem like a huge issue, this could matter significantly if it was part of a nested query that expected an intermediate result to be flattened or repeated.
Under what conditions does BigQuery flatten results without an explicit FLATTEN?
Let me answer first, how to get correct count in this case:
So instead of
SELECT COUNT(*)
FROM [publicdata:samples.trigrams]
WHERE ngram = 'der Griindung im'
AND cell.value IN ('2007', '2008')
with result of
+-----+
| f0_ |
+-----+
| 2 |
+-----+
you should do
SELECT COUNT(*)
FROM [publicdata:samples.trigrams]
WHERE ngram = 'der Griindung im'
OMIT RECORD IF EVERY(cell.value NOT IN ('2007', '2008'))
with result of
+-----+
| f0_ |
+-----+
| 1 |
+-----+
as I think what you expected
Secondly - Under what conditions does BigQuery flatten results without an explicit FLATTEN?
I think (just my guess baseed on BQ behavior observation) every time you explicitelly reference record's field within clauses like SELECT or WHERE , it gets automatically flattened for you. Using FLATTEN operator helps "control" this process.
Short story: use count(0) instead of count(*). (You get 1 instead of 2.)
count(*) behaves strangely with repeated fields. It looks like the results are flattened, but if that were really the case, this should also affect count(0). I've asked about this here, but I haven't so far received a full explanation.

Get row with max value in Hive/SQL?

I'm new to Hive/SQL, and I'm stuck on a fairly simple problem. My data looks like:
+------------+--------------------+-----------------------+
| carrier_iD | meandelay | meancanceled |
+------------+--------------------+-----------------------+
| EV | 13.795802119653473 | 0.028584251044292006 |
| VX | 0.450591016548463 | 2.364066193853424E-4 |
| F9 | 10.898001378359766 | 0.00206753962784287 |
| AS | 0.5071547420965062 | 0.0057404326123128135 |
| HA | 1.2031093279839498 | 5.015045135406214E-4 |
| 9E | 8.147899230704216 | 0.03876067292247866 |
| B6 | 9.45383857757506 | 0.003162096314343487 |
| UA | 8.101511665305816 | 0.005467725574605967 |
| FL | 0.7265068895709532 | 0.0041141513746490044 |
| WN | 7.156119279121648 | 0.0057419058192869415 |
| DL | 4.206288692245839 | 0.005123990066804269 |
| YV | 6.316802855264404 | 0.029304029304029346 |
| US | 3.2221527095063736 | 0.007984031936127766 |
| OO | 6.954715814690328 | 0.02596499362466706 |
| MQ | 9.74568222216328 | 0.025628100708354324 |
| AA | 8.720522654298968 | 0.019242775597574157 |
+------------+--------------------+-----------------------+
I want Hive to return the row with the meanDelay max value. I have:
SELECT CAST(MAX(meandelay) as FLOAT) FROM flightinfo;
which indeed returns the max (I use cast because my values are saved as STRING). So then:
SELECT * FROM flightinfo WHERE meandelay = (SELECT CAST(MAX(meandelay) AS FLOAT) FROM flightinfo);
I get the following error:
FAILED: ParseException line 1:44 cannot recognize input near 'select' 'cast' '(' in expression specification
Use the windowing and analytics functions
SELECT carrier_id, meandelay, meancanceled
FROM
(SELECT carrier_id, meandelay, meancanceled,
rank() over (order by cast(meandelay as float) desc) as r
FROM table) S
WHERE S.r = 1;
This will also solve the problem if more than one row has the same max value, you'll get all the rows as result. If you just want a single row change rank() to row_number() or add another term to the order by.
use join instead.
SELECT a.* FROM flightinfo a left semi join
(SELECT CAST(MAX(meandelay) AS FLOAT)
maxdelay FROM flightinfo)b on (a.meandelay=b.maxdelay)
You can use the collect_max UDF from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem, passing in a value of 1, meaning that you only want the single max value.
select array_index( map_keys( collect_max( carrier_id, meandelay, 1) ), 0 ) from flightinfo;
Also, I've read somewhere that the Hive max UDF does allow you to access other fields on the row, but I think its easier just to use collect_max.
I don't think your sub-query is allowed ...
A quick look here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
states:
As of Hive 0.13 some types of subqueries are supported in the WHERE
clause. Those are queries where the result of the query can be treated
as a constant for IN and NOT IN statements (called uncorrelated
subqueries because the subquery does not reference columns from the
parent query):

difference between select * and select table name

This is the basic question about sql statements.
What is the difference between
SELECT * FROM "Users"
and
SELECT "Users".* FROM "Users"
[TableName].[column] is usually used to pinpoint the table you wish to use when two tables a present in a join or a complex statement and you want to define which column to use out of the two with the same name.
It's most common use is in a join though, for a basic statement such as the one above there is no difference and the output will be the same.
In your case there is no difference. It emerges, when you are selecting from multiple tables. * takes data from all the tables, TABLE_NAME.* - all the data from this table. Suppose, we have a database with 2 tables:
mysql> SELECT * FROM report;
+----+------------+
| id | date |
+----+------------+
| 1 | 2013-05-01 |
| 2 | 2013-06-02 |
+----+------------+
mysql> SELECT * FROM sites_to_report;
+---------+-----------+---------------------+------+
| site_id | report_id | last_run | rows |
+---------+-----------+---------------------+------+
| 1 | 1 | 2013-05-01 16:20:21 | 1 |
| 1 | 2 | 2013-05-03 16:20:21 | 1 |
| 2 | 2 | 2013-05-03 14:21:47 | 1 |
+---------+-----------+---------------------+------+
mysql> SELECT
-> *
-> FROM
-> report
-> INNER JOIN
-> sites_to_report
-> ON
-> sites_to_report.report_id=report.id;
+----+------------+---------+-----------+---------------------+------+
| id | date | site_id | report_id | last_run | rows |
+----+------------+---------+-----------+---------------------+------+
| 1 | 2013-05-01 | 1 | 1 | 2013-05-01 16:20:21 | 1 |
| 2 | 2013-06-02 | 1 | 2 | 2013-05-03 16:20:21 | 1 |
| 2 | 2013-06-02 | 2 | 2 | 2013-05-03 14:21:47 | 1 |
+----+------------+---------+-----------+---------------------+------+
mysql> SELECT
-> report.*
-> FROM
-> report
-> INNER JOIN
-> sites_to_report
-> ON
-> sites_to_report.report_id=report.id;
+----+------------+
| id | date |
+----+------------+
| 1 | 2013-05-01 |
| 2 | 2013-06-02 |
| 2 | 2013-06-02 |
+----+------------+
In the case of example given by you, there is no difference between them when it comes to semantics.When it comes to performance it might be too little... just parsing two different length strings....
But, it is only true for the example given by you. Where as in queries where multiple tables are involved tableName.* disambiguate the table from which table we want to select all columns.
Example:
If you have two tables TableA and TableB. Let's suppose that they have column with same names that is Name. If you want to specify from which table you want to select Name column. Table-name qualifier helps.
`select TableA.Name, TableB.Name where TableA.age=TableB.age`
That's all I can say.
The particular examples specified would return the same result and have the same performance. There would be no difference in that respect, therefore.
However, in some SQL products, difference in interpreting * and alias.* has effect, in particular, on what else you can add to the query. More specifically, in Oracle, you can mix an alias.* with other expressions being returned as columns, i.e. this
SELECT "Users".*, SomeColumn * 2 AS DoubleValue FROM "Users"
would work. At the same time, * must stand on its own, meaning that the following
SELECT *, SomeColumn * 2 AS DoubleValue FROM "Users"
would be illegal.
For the examples you provided, the only difference is in syntax. What both of the queries share is that they are really bad. Select * is evil no matter how you write it and can get you into all kinds of trouble. Get into the habit of listing the columns you want to have included in your result set.

SQL LIKE (Reverse)

I'm kinda confused on how this thing can be done.
I have a database table having these values:
| id | file_type | code | position |
-------------------------------------
| 1 | Order | SO | 1 |
| 2 | Order | 1-SO | 7 |
| 3 | Order | 1_SO | 7 |
Now, I want to get the position and the file type of my filename so I come up with this query:
SET #FileName = '1-SO1234567890.pdf'
SELECT *
FROM tbl_FileTypes
WHERE CHARINDEX(code,#FileName)> 0
Sadly, I'm getting two results here, I only need the data with the "1-SO" result.
| id | file_type | code | position |
-------------------------------------
| 1 | Order | SO | 1 |
| 2 | Order | 1-SO | 7 |
I believe that my WHERE Clause causes this to happen.
Is there any better way for me to get my desired results?
Thank you very much in advance.
You might want to use SUBSTRING instead (assuming T-SQL):
WHERE code = SUBSTRING(#FileName,1, LEN(code));
Which checks if the first n-chars of FileName equal a given code.
DEMO
well
SET #FileName = '1-SO1234567890.pdf'
SELECT *
FROM tbl_FileTypes
WHERE CHARINDEX(code,#FileName) = 1
would work in this case but if you had a code '1-SOS' it's going to fail again.
Since you are cheking left most characters, you could also use LEFT() function as below. Also use TOP (1) to get a single record and consider ordering as needed.
SELECT TOP (1) *
FROM tbl_FileTypes
WHERE LEFT(#FileName,LEN(code)) = code
--ORDER BY yourCol