How to devide a string with regexp_extract per blank space (SQL- Athena) - sql

i'm currently working on splitting a message from our webserverlog in rows
for example:
my message (datatype string) looks like this:
at=info method=GET path="/v1/..." host=web.com request_id=a3d71fa9-9501-4bfe-8462-54301a976d74 fwd="xxx.xx" dyno=web.1 connect=1ms service=167ms status=200 bytes=1114
and i want to cut these into rows:
path | service | connect | method | status | fwd | dyno |
------ | ------- | -------- | ------ | ------ | ------- | ------ |
/v1/...| 167 | 1 | GET | 200 | xxx.xxx | web.1 |
i played around with the regexp_extract function (for the first time) on Amazon Athena in Standard SQL and already got a few row out of the string , but im struggling with a few rows.
When i try to get the for example cut the dyno out of the string im getting more info than i needed
REGEXP_EXTRACT (message,'dyno=[^,]+[a-z]')AS dyno
-> dyno=web.2 connect=0ms service=192ms status=200 bytes
i want have dyno=web.1 as a result & then extract again
it would be nice if i cut the string from the start ("dyno=") till the blank space before "connect=" but i couldn't found the right option in the sites i read.
How do i write the options to get the right piece of the string?

Piggybagging on Sebastian's comment, I agree that \S+ should be the solution to go forward with. So the query would look like this:
select REGEXP_EXTRACT (message,'dyno=(\S+)',1) AS dyno
from (
select
'at=info method=GET path="/v1/..." host=web.com request_id=a3d71fa9-9501-4bfe-8462-54301a976d74 fwd="xxx.xx" dyno=web.1 connect=1ms service=167ms status=200 bytes=1114' message
)

If you don't have spaces within your values (as in key-values pairs), then there is an easy solution.
select msg['at'] as "at"
,msg['method'] as "method"
,msg['path'] as "path"
,msg['host'] as "host"
,msg['request_id'] as "request_id"
,msg['fwd'] as "fwd"
,msg['dyno'] as "dyno"
,msg['connect'] as "connect"
,msg['service'] as "service"
,msg['status'] as "status"
,msg['bytes'] as "bytes"
from (select split_to_map (message,' ','=') as msg
from mytable
)
;
at | method | path | host | request_id | fwd | dyno | connect | service | status | bytes
------+--------+-----------+---------+--------------------------------------+----------+-------+---------+---------+--------+-------
info | GET | "/v1/..." | web.com | a3d71fa9-9501-4bfe-8462-54301a976d74 | "xxx.xx" | web.1 | 1ms | 167ms | 200 | 1114

Related

Tracking Growth of a Metric over Time In TimescaleDB

I'm currently running timescaleDB. I have a table that looks similar to the following
one_day | name | metric_value
------------------------+---------------------
2022-05-30 00:00:00+00 | foo | 400
2022-05-30 00:00:00+00 | bar | 200
2022-06-01 00:00:00+00 | foo | 800
2022-06-01 00:00:00+00 | bar | 1000
I'd like a query that returns the % growth and raw growth of metric, so something like the following.
name | % growth | growth
-------------------------
foo | 200% | 400
bar | 500% | 800
I'm fairly new to timescaleDB and not sure what the most efficient way to do this is. I've tried using LAG, but the main problem I'm facing with that is OVER (GROUP BY time, url) doesn't respect that I ONLY want to consider the same name in the group by and can't seem to get around it. The query works fine for a single name.
Thanks!
Use LAG to get the previous value for the same name using the PARTITION option:
lag(metric_value,1,0) over (partition by name order by one_day)
This says, when ordered by 'one_day', within each 'name', give me the previous (the second parameter to LAG says 1 row) value of 'metric_value'; if there is no previous row, give me '0'.

SQL syntax for removing a specific row [time] from a specific group [symbol]

I'm running up against the edge of my SQL Query knowledge and could use a point in the right direction. (I am using Presto, but ideally that shouldn't matter because Presto uses common SQL syntax.)
What I would like to do is always exclude the 9:31:00 [QuoteTime] ONLY on the 'VIX' Symbol. If possible, I would only like to exclude the 9:31:00 VIX [QuoteTime] row only IF the [Bid] AND [Ask] are 0.
I have looked into the HAVING clause as an alternative/addition to the WHERE clause. I was not successful in integrating this to my query.
I have looked into the LIKE clause in an effort to find the 9:31:00 time. I'm also concerned about my ability to write this efficiently.
My struggle is combining them to create the correct and most efficient query.
Here is my data:
+---------------+--------+---------+---------+
| QuoteTime | Symbol | Bid | Ask |
+---------------+--------+---------+---------+
| 09:31:00 | VIX | 0 | 0 |
| 09:32:00 | VIX | 13.24 | 13.24 |
| 09:33:00 | VIX | 13.21 | 13.21 |
| 09:31:00 | SPX | 2889.36 | 2894.18 |
| 09:32:00 | SPX | 2889.15 | 2892.99 |
| 09:33:00 | SPX | 2889.89 | 2892.71 |
| 09:31:00 | NDX | 7616.64 | 7616.64 |
| 09:32:00 | NDX | 7612.13 | 7612.13 |
| 09:33:00 | NDX | 7613.32 | 7613.32 |
+---------------+--------+---------+---------+
Here is my current query:
SELECT QuoteTime, Symbol, ((Bid+Ask)/2) as MidPoint
FROM schema.tablename
WHERE (Symbol IN ('SPX', 'VIX'))
Below is my nuclear option. I don't like it because it may (unbeknownst to me) remove other rows which contain 0s on other symbols at other times:
SELECT QuoteTime, Symbol, ((Bid+Ask)/2) as MidPoint
FROM schema.tablename
WHERE (Symbol IN ('SPX', 'VIX')) AND Bid != 0 AND Ask != 0
You want to select all rows for symbols 'SPX' and 'VIX', but exclude 9:31:00 - VIX - 0 - 0. There are several ways to express this. One way has been shown in fa06's answer.
SELECT quotetime, symbol, ((bid+ask)/2) as midpoint
FROM schema.tablename
WHERE symbol in ('SPX', 'VIX')
AND (quotetime, symbol, bid, ask) NOT IN ((time '09:31', 'VIX', 0, 0));
(EDIT: You say that this doesn't work for you. It may be that presto doesn't yet support the IN clause with tuples.)
Another is:
SELECT quotetime, symbol, ((bid+ask)/2) as midpoint
FROM schema.tablename
WHERE symbol IN ('SPX', 'VIX')
AND (quote_time <> time '09:31' OR symbol <> 'VIX' OR bid <> 0 OR ask <> 0);
Another is:
SELECT quotetime, symbol, ((bid+ask)/2) as midpoint
FROM schema.tablename
WHERE symbol IN ('SPX', 'VIX')
AND NOT (quote_time = time '09:31' AND symbol = 'VIX' AND bid = 0 AND ask = 0);
(If one of the columns can be null then you must consider this in the query, too. E.g. (ask <> 0 OR ask IS NULL).)
You can try below -
SELECT QuoteTime, Symbol, ((Bid+Ask)/2) as MidPoint
FROM schema.tablename
WHERE (QuoteTime, Symbol,Bid,Ask) not in (('09:31:00','VIX',0,0))

Querying FIX protocol message files using Apache Drill

I was able to query FIX messages as a csv using delimiter as '\u0001', but the results had tag=value in each of the columns, like so:
Expected:
---------
8 |
---------
FIX.4.4|
FIX.4.4|
FIX.4.4|
FIX.4.4|
FIX.4.4|
---------
Actual:
-----------
EXPR$1 |
-----------
8=FIX.4.4|
8=FIX.4.4|
8=FIX.4.4|
8=FIX.4.4|
8=FIX.4.4|
-----------
How do I query FIX protocol message files using Apache Drill to achieve the above expected result?
Will this need a custom storage format implementation?
You can contribute to Apache Drill and develop "FIX Protocol" storage format plugin.
Also you can try to parse your string values and extract the result from it by SQL:
0: jdbc:drill:> SELECT split(a, '=')[0] as `key`, split(a, '=')[1] as `value` FROM (VALUES('8=FIX.4.4')) t(a);
+------+----------+
| key | value |
+------+----------+
| 8 | FIX.4.4 |
+------+----------+

How to join between table DurationDetails and Table cost per program

How to design database for tourism company to calculate cost of flight and hotel per every program tour based on date ?
what i do is
Table - program
+-----------+-------------+
| ProgramID | ProgramName |
+-----------+-------------+
| 1 | Alexia |
| 2 | Amon |
| 3 | Sfinx |
+-----------+-------------+
every program have more duration may be 8 days or 15 days only
it have two periods only 8 days or 15 days .
so that i do duration program table have one to many with program .
Table - ProgramDuration
+------------+-----------+---------------+
| DurationNo | programID | Duration |
+------------+-----------+---------------+
| 1 | 1 | 8 for Alexia |
| 2 | 1 | 15 for Alexia |
+------------+-----------+---------------+
And same thing to program amon program and sfinx program 8 and 15 .
every program 8 or 15 have fixed details for every day as following :
Table Duration Details
+------+--------+--------------------+-------------------+
| Days | Hotel | Flight | transfers |
+------+--------+--------------------+-------------------+
| Day1 | Hilton | amsterdam to luxor | airport to hotel |
| Day2 | Hilton | | AbuSimple musuem |
| Day3 | Hilton | | |
| Day4 | Hilton | | |
| Day5 | Hilton | Luxor to amsterdam | |
+------+--------+--------------------+-------------------+
every program determine starting by flight date so that
if flight date is 25/06/2017 for program alexia 8 days it will be as following
+------------+-------+--------+----------+
| Date | Hotel | Flight | Transfer |
+------------+-------+--------+----------+
| 25/06/2017 | 25 | 500 | 20 |
| 26/06/2017 | 25 | | 55 |
| 27/06/2017 | 25 | | |
| 28/06/2017 | 25 | | |
| 29/06/2017 | 25 | 500 | |
+------------+-------+--------+----------+
And this is actually what i need how to make relations ship to join costs with program .
for flight and hotel costs as above ?
for 5 days cost will be 1200
25 is cost per day for hotel Hilton
500 is cost for flight
20 and 55 is cost per transfers
image display what i need
relation between duration and cost
Truthfully, I don't fully understand exactly what you're trying to accomplish. Your description is not clear, your tables seem to be missing information / contain information that should not be in your tables, and the way that I'm understanding your description doesn't really make sense based on the UI screenshot that you shared.
It looks like you're working on an application for a travel agency which will allow agents to create an itinerary for a trip. They can give this trip a name (so if a particular package is a hit with customers, they can just offer the "Alexa" package), and the utility will calculate the total estimated cost of the trip. If I understand correctly, the trips will be either 8, or 15 days long.
Personally, I would delete the "ProgramDuration" table altogether. If there are two versions of the Alexa trip at index 1, then you're going to run into all manners of issues. I can get into the details of why this is a bad idea, but unless you're really hung up on having this ProgramDuration table, it's not worth the time. You should add a "duration" field to your "program" table, and assign a new ProgramID for each different duration version of the "Alexa" program.
Your table "Duration details" also misses the mark. Your fields in this table will make it harder to add new features to your application down the line. You should have a field "ProgramID," which we will use to join this table against the program table later. You should have a field "Day" which obviously indicates the day in the itinerary. You should have only one more field "ItemID." We're going to use the "ItemID" field to join your itinerary against a new items table we're going to create.
Your items table is where you define all of the items that can possibly appear in an itinerary. Your current itinerary table has three possible "types" of expenses, flights, hotels, and transfers. What if your travel agents want to start adding meal expenditures into their itineraries / budgets? What about activities that cost money? What about currency exchange fees? What about items that your clientele will need before their trip (wall adapters, luggage, etc.)? In your items table, you will have fields for an ItemID, ItemName, ItemUnitPrice, and ItemType. A possible item is as follows:
ItemID: 1, ItemName: Night At The Hilton, ItemUnitPrice: 300, ItemType: Lodging
Using the "SELECT [Column] AS [Alias]" syntax with some CTEs or subqueries and the JOIN operator, we can easily reconstitute a table that looks like your "Program Duration Details" table, but we will be afforded considerably more flexibility to add or remove things later down the line.
In the interests of security and programmability, I would also add a table called "ItemTypeTable" with a single field "TypeName." You can use this table to prevent unauthorized users from defining new item types, and you can use this table to create drop down menus, navigation, and all manners of other useful features. There might be cleaner implementations, but this shouldn't represent a serious performance or size hit.
All in all, at the risk of being somewhat rude, it seems like you're trying to take on a rather large, sophisticated task with a very rudimentary understanding of basic relational database design and implementation. If you are doing this in a professional context, I would strongly encourage you to consider consulting with another professional that may be more experienced in this area.

select with multiple conditions,

Hi guys
I am quite new for the SQL, this site as well.
I have data as below, i want to select all records except failure"BTM dead" for Train "1101", which means only the records 1, 3 will not be selected.
*record* | *Train* | *Failure* |
1 | 1101 | BTM dead |
2 | 1101 | relay failure |
3 | 1101 | BTM dead |
4 | 2101 | relay failure |
5 | 2101 | BTM dead |
6 | 2101 | relay failure |
Here is what I tried..
SELECT failure_table.record, failure_table.Train, failure_table.Failure
FROM failure_table
WHERE failure_table.Train <> 1101 And failure_table.Failure <> "BTM dead";
but turns out that only records 4,6 selected.
can I have a suggestion on that please? what statement would it be?
Thank you!
SELECT failure_table.record, failure_table.Train, failure_table.Failure
FROM failure_table
WHERE NOT (failure_table.Train = 1101 And failure_table.Failure = 'BTM dead')
Sometimes the easiest way is to use NOT.
So because you know you want everything except when train = 1101 and failure = 'BTM' simply state that and then tell sql you want the opposite by saying NOT. Also note you need single quotes when identifying a string not double quotes or it will think it is a column.
You used AND instead of OR:
SELECT failure_table.record, failure_table.Train, failure_table.Failure
FROM failure_table
WHERE failure_table.Train <> 1101 OR failure_table.Failure <> 'BTM dead';