Regular expression parsing pseudo-columnized data in SQL - sql

I'm working on a project where we're trying to parse information out of reroute advisories issued by the FAA. These advisories are issued as free text with a loosely-identified structure, with the goal being to allow viewers to print them out on a single sheet of paper.
The area of most interest is the final portion of the advisory that contains specific information related to a given reroute - the origin and destination airport as well as the specific required route that applies to any flight between them. Here's an example:
ORIG DEST ROUTE
---- --------------- ---------------------------
FRG MCO TPA PIE SRQ WAVEY EMJAY J174 SWL CEBEE
FMY RSW APF WETRO DIW AR22 JORAY HILEY4
What I'd like to do is be able to parse this into three entries like this:
ORIG: FRG
DEST: MCO TPA PIE SRQ FMY RSW APF
ROUTE: WAVEY EMJAY J174 SWL CEBEE WETRO DIW AR22 JORAY HILEY4
Here are the three code segments I'm currently using to parse this portion of the advisories:
Origin
regexp_substr(route_1,'^(([A-Z0-9]|\(|\)|\-)+\s)+')
Destination
regexp_substr(route_1,'(([A-Z0-9]|\(|\)|\-)+\s)+',1,2)
Route String
regexp_substr(route_1, '\s{2,}>?([A-Z0-9]|>|<|\s|:)+<?$')
While these expressions can deal with the majority of situations where the Origin and Destination portions are all on the first line, they cannot deal with the example provided earlier. Does anyone know how I might be able to successfully parse the text in my original example?

Proof of concept.
Input file is a plain text file (with no tabs, only spaces). Vertically it is divided into three columns, of fixed width: 20 characters, 20 characters, whatever is left (till the end of line).
The first two rows are always populated, with the headers and -----. These two rows can be ignored. Then the rest of the file (reading down the page) are "grouped" by the latest non-empty string in the ORIG column.
The input file looks like this:
ORIG DEST ROUTE
---- --------------- ---------------------------
FRG MCO TPA PIE SRQ WAVEY EMJAY J174 SWL CEBEE
FMY RSW APF WETRO DIW AR22 JORAY HILEY4
ABC SFD RRE BAC TRIO SBL CRT
POLDA FARM OLE BID ORDG BALL
BINT LFV
YYT PSS TRI BABA TEN NINE FIVE
COL DMV
SAL PRT DUW PALO VR22 NOL3
Notice the empty lines between blocks, the empty DEST in one block (I handle that, although perhaps in the OP's problem that is not possible), and the different number of rows used by DEST and ROUTE in some cases.
The file name is inp.txt and it resides in a directory which I have made known to Oracle: create directory sandbox as 'c:\app\sandbox'. (First I had to grant create any directory to <myself>, while logged in as SYS.)
The output looks like this:
ORIG DEST ROUTE
----- --------------------------- ------------------------------------------------------
FRG MCO TPA PIE SRQ FMY RSW APF WAVEY EMJAY J174 SWL CEBEE WETRO DIW AR22 JORAY HILEY4
ABC SFD RRE BAC TRIO SBL CRT
POLDA FARM OLE BID ORDG BALL BINT LFV
YYT PSS TRI BABA COL DMV TEN NINE FIVE
SAL PRT DUW PALO VR22 NOL3
I did this in two steps. First I created a helper table, INP, with four columns (RN number, ORIG varchar2(20), DEST varchar2(20), ROUTE varchar2(20)) and I imported from the text file through a procedure. Then I processed this further and used the output to populate the final table. It is very unlikely that this is the most efficient way to do this (and perhaps there are very good reasons not to do it the way I did); I have no experience with UTL_FILE and importing text files into Oracle in general. I did this for two reasons: to learn, and to show it can be done.
The procedure to import the text file into the helper table:
Create or Replace PROCEDURE read_inp is
f UTL_FILE.FILE_TYPE;
s VARCHAR2(200);
rn number := 1;
BEGIN
f := UTL_FILE.FOPEN('SANDBOX','inp.txt','r', 200);
LOOP
BEGIN
UTL_FILE.GET_LINE(f,s);
INSERT INto inp (rn, orig, dest, route)
VALUES
(rn, trim(substr(s, 1, 20)), trim(substr(s, 21, 20)), trim(substr(s, 41)));
END;
rn := rn + 1;
END LOOP;
exception
when no_data_found then
utl_file.fclose(f);
END;
/
exec read_inp
/
And the further processing (after creating the REROUTE table):
create table reroute ( orig varchar2(20), dest varchar2(4000), route varchar2(4000) );
insert into reroute
select max(orig),
trim(listagg(dest , ' ') within group (order by rn)),
trim(listagg(route, ' ') within group (order by rn))
from (
select rn, orig, dest, route, count(orig) over (order by rn) as grp
from inp
where rn >= 3
)
group by grp
;

Related

Group by Hive Script Different csv files

Very new to hive scripting.
I have 5 csv files with stock information for AAPL, AMZN, FB, GOOG, and NFLX. Within each file the columns are Date, Open, High, Low, Close, Adj Close and Volume. I am trying to modify the script to display the one date for which trading volume across the 4 software companies was the greatest (in total). I know I need to Group by marketDate, sum the volume and sort appropriately.
sample of 1 of the files (all are like this)
Currently my code is as follows:
------------------------------------------------------------------
-- need when you write a single data file
------------------------------------------------------------------
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
------------------------------------------------------------------
--------------------------------------------------------
--- create a table with the stock data ---
--------------------------------------------------------
DROP TABLE IF EXISTS stockPrices;
CREATE EXTERNAL TABLE stockPrices(
marketDate STRING,
open DECIMAL(12,6),
high DECIMAL(12,6),
low DECIMAL(12,6),
close DECIMAL(12,6),
adjClose DECIMAL(12,6),
volume BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '${INPUT}/stocks/data'
TBLPROPERTIES ("skip.header.line.count"="1");
--------------------------------------------------------
--- list the contents of the stockPrices table
--- including the virtual field INPUT__FILE__NAME to help identify the stock ticker
--- NOTE: INPUT__FILE__NAME requires 2 underscores before and after FILE (not 1 underscore)
--------------------------------------------------------
SELECT INPUT__FILE__NAME, * FROM stockPrices LIMIT 10;
--------------------------------------------------------
--- output summary of stock data ---
--------------------------------------------------------
INSERT OVERWRITE DIRECTORY '${OUTPUT}/stocks/output'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT INPUT__FILE__NAME, SUM(volume), MIN(adjClose), MAX(adjClose) FROM stockPrices GROUP BY INPUT__FILE__NAME;
How can I modify the code o display the one date for which trading volume across the 4 software companies was the greatest (in total) for my output summary?
Thank you in advance

Not exists - should be returning something but isn't

I'm trying to make sure the selected info isn't in string values of another table when it's selected from a different table. Here's the abbreviated data info/examples in a nutshell:
create table #temp_VMC
(t_id varchar(25),
CID int,
conc_id int,
str_val nvarchar(3000))
here's an example of data in that table...note that there are cases in this table that don't have a cm and we don't care about those ones
t_id CID conc_id str_val
678 76543 501000 0070
789 80000 560000 0030
890 90000 530000 0001
o_info
subject body
Manual Task Created: 000789 <><Created by Mary Smith>
stamp <><question 1 true>
select * from #temp_vmc um
where not exists
(
select * from o_info with (NOLOCK) where subject LIKE '%Manual Task Created: '+um.t_id+'%'
)
So I know that the select * from o_info doesn't turn up anything, so if my select * from #temp_vmc has one row, shouldn't I still get one row with them together?
*Update: I tried selecting this, and nothing turns up, so the where not exists should return the row I know is in temp_vmc:
select * from o_info with (NOLOCK) where subject LIKE '%'+um.t_id+'%'
It looks like I'm using not exists correctly according to this link: not exists
In the EXISTS correltated subquery, you are searching for
LIKE '%Manual Task Created: '+um.t_id+'%'
Given your data, that becomes
LIKE '%Manual Task Created: 678%'
LIKE '%Manual Task Created: 789%'
LIKE '%Manual Task Created: 890%'
However, the data you are searching would appear to be in the form of
Manual Task Created: 000789
that is, with leading zeroes. Thus, the data will never match. This is probably throwing your query logic off.

Split Text into Table Rows with Read-Only Permissions

I am a read-only user for a database with he following problem:
Scenario:
Call center employees for a company submit tickets to me through our database on behalf of our clients. The call center includes alphanumeric lot numbers of an exact length in their message for me to troubleshoot. Depending on how many times a ticket is updated, there could be several messages for one ticket, each of them having zero or more of these alphanumeric lot numbers embedded in the message. I can access all of these messages with Oracle SQL and SQL Tools.
How can I extract just the lot numbers to make a single-column table of all the given lot numbers?
Example Data:
-- Accessing Ticket 1234 --
SELECT *
FROM communications_detail
WHERE ticket_num = 1234;
-- Results --
TICKET_NUM | MESSAGE_NUM | MESSAGE
------------------------------------------------------------------------------
1234 | 1 | A customer recently purchased some products with
| | a lot number of vwxyz12345 and wants to know if
| | they have been recalled.
------------------------------------------------------------------------------
1234 | 2 | Same customer found lots vwxyz23456 and zyxwv12345
| | in their storage as well and would like those checked.
------------------------------------------------------------------------------
1234 | 3 | These lots have not been recalled. Please inform
| | the client.
So-Far:
I am able to isolate the lot numbers of a constant string with the following code, but it gets put into standard output and not a table format.
DECLARE
msg VARCHAR2(200) := 'Same customer found lots xyz23456 and zyx12345 in their storage as well and would like those checked.';
cnt NUMBER := regexp_count(msg, '[[:alnum:]]{10}');
BEGIN
IF cnt > 0 THEN
FOR i IN 1..cnt LOOP
Dbms_Output.put_line(regexp_substr(msg, '[[:alnum:]]{10}', 1, i));
END LOOP;
END IF;
END;
/
Goals:
Output results into a table that can itself be used as a table in a larger query statement.
Somehow be able to apply this to all of the messages associated with the original ticket.
Update: Changed the example lot numbers from 8 to 10 characters long to avoid confusion with real words in the messages. The real-world scenario has much longer codes and very specific formatting, so a more complex regular expression will be used.
Update 2: Tried using a table variable instead of standard output. It didn't error, but it didn't populate my query tab... This may just be user error...!
DECLARE
TYPE lot_type IS TABLE OF VARCHAR2(10);
lots lot_type := lot_type();
msg VARCHAR2(200) := 'Same customer found lots xyz23456 and zyx12345 in their storage as well and would like those checked.';
cnt NUMBER := regexp_count(msg, '[[:alnum:]]{10}');
BEGIN
IF cnt > 0 THEN
FOR i IN 1..cnt LOOP
lots.extend();
lots(i) := regexp_substr(msg, '[[:alnum:]]{10}', 1, i);
END LOOP;
END IF;
END;
/
This is a regex format which matches the LOT mask you provided: '[a-z]{3}[0-9]{5}' . Using something like this will help you avoid the false positives you mention in your question.
Now here is a read-only, pure SQL solution for you.
with cte as (
select 'Same customer found lots xyz23456 and zyx12345 in their storage as well and would like those checked.' msg
from dual)
select regexp_substr(msg, '[a-z]{3}[0-9]{5}', 1, level) as lotno
from cte
connect by level <= regexp_count(msg, '[a-z]{3}[0-9]{5}')
;
I'm using the WITH clause just to generate the data. The important thing is the the use of the CONNECT BY operator which is part of Oracle's hierarchical data syntax but here generates a table from one row. The pseudo-column LEVEL allows us to traverse the string and pick out the different occurrences of the regex pattern.
Here's the output:
SQL> r
1 with cte as ( select 'Same customer found lots xyz23456 and zyx12345 in their storage as well and would like those checked.' msg from dual)
2 select regexp_substr(msg, '[a-z]{3}[0-9]{5}', 1, level) as lotno
3 from cte
4 connect by level <= regexp_count(msg, '[a-z]{3}[0-9]{5}')
5*
LOTNO
----------
xyz23456
zyx12345
SQL>

How to load grouped data with SSIS

I have a tricky flat file data source. The data is grouped, like this:
Country City
U.S. New York
Washington
Baltimore
Canada Toronto
Vancouver
But I want it to be this format when it's loaded in to the database:
Country City
U.S. New York
U.S. Washington
U.S. Baltimore
Canada Toronto
Canada Vancouver
Anyone has met such a problem before? Got a idea to deal with it?
The only idea I got now is to use the cursor, but the it is just too slow.
Thank you!
The answer by cha will work, but here is another in case you need to do it in SSIS without temporary/staging tables:
You can run your dataflow through a Script Transformation that uses a DataFlow-level variable. As each row comes in the script checks the value of the Country column.
If it has a non-blank value, then populate the variable with that value, and pass it along in the dataflow.
If Country has a blank value, then overwrite it with the value of the variable, which will be last non-blank Country value you got.
EDIT: I looked up your error message and learned something new about Script Components (the Data Flow tool, as opposed to Script Tasks, the Control Flow tool):
The collection of ReadWriteVariables is only available in the
PostExecute method to maximize performance and minimize the risk of
locking conflicts. Therefore you cannot directly increment the value
of a package variable as you process each row of data. Increment the
value of a local variable instead, and set the value of the package
variable to the value of the local variable in the PostExecute method
after all data has been processed. You can also use the
VariableDispenser property to work around this limitation, as
described later in this topic. However, writing directly to a package
variable as each row is processed will negatively impact performance
and increase the risk of locking conflicts.
That comes from this MSDN article, which also has more information about the Variable Dispenser work-around, if you want to go that route, but apparently I mislead you above when I said you can set the value of the package variable in the script. You have to use a variable that is local to the script, and then change it in the Post-Execute event handler. I can't tell from the article whether that means that you will not be able to read the variable in the script, and if that's the case, then the Variable Dispenser would be the only option. Or I suppose you could create another variable that the script will have read-only access to, and set its value to an expression so that it always has the value of the read-write variable. That might work.
Yes, it is possible. First you need to load the data to a table with an IDENTITY column:
-- drop table #t
CREATE TABLE #t (id INTEGER IDENTITY PRIMARY KEY,
Country VARCHAR(20),
City VARCHAR(20))
INSERT INTO #t(Country, City)
SELECT a.Country, a.City
FROM OPENROWSET( BULK 'c:\import.txt',
FORMATFILE = 'c:\format.fmt',
FIRSTROW = 2) AS a;
select * from #t
The result will be:
id Country City
----------- -------------------- --------------------
1 U.S. New York
2 Washington
3 Baltimore
4 Canada Toronto
5 Vancouver
And now with a bit of recursive CTE magic you can populate the missing details:
;WITH a as(
SELECT Country
,City
,ID
FROM #t WHERE ID = 1
UNION ALL
SELECT COALESCE(NULLIF(LTrim(#t.Country), ''),a.Country)
,#t.City
,#t.ID
FROM a INNER JOIN #t ON a.ID+1 = #t.ID
)
SELECT * FROM a
OPTION (MAXRECURSION 0)
Result:
Country City ID
-------------------- -------------------- -----------
U.S. New York 1
U.S. Washington 2
U.S. Baltimore 3
Canada Toronto 4
Canada Vancouver 5
Update:
As Tab Alleman suggested below the same result can be achieved without the recursive query:
SELECT ID
, COALESCE(NULLIF(LTrim(a.Country), ''), (SELECT TOP 1 Country FROM #t t WHERE t.ID < a.ID AND LTrim(t.Country) <> '' ORDER BY t.ID DESC))
, City
FROM #t a
BTW, the format file for your input data is this (if you want to try the scripts save the input data as c:\import.txt and the format file below as c:\format.fmt):
9.0
2
1 SQLCHAR 0 11 "" 1 Country SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "\r\n" 2 City SQL_Latin1_General_CP1_CI_AS

Renumber reference variables in text columns

Background
For a data entry project, a user can enter variables using a short-hand notation:
"Pour i1 into a flask."
"Warm the flask to 25 degrees C."
"Add 1 drop of i2 to the flask."
"Immediately seek cover."
In this case i1 and i2 are reference variables, where the number refers to an ingredient. The text strings are in the INSTRUCTION table, the ingredients the INGREDIENT table.
Each ingredient has a sequence number for sorting purposes.
Problem
Users may rearrange the ingredient order, which adversely changes the instructions. For example, the ingredient order might look as follows, initially:
seq | label
1 | water
2 | sodium
The user adds another ingredient:
seq | label
1 | water
2 | sodium
3 | francium
The user reorders the list:
seq | label
1 | water
2 | francium
3 | sodium
At this point, the following line is now incorrect:
"Add 1 drop of i2 to the flask."
The i2 must be renumbered (because ingredient #2 was moved to position #3) to point to the original reference variable:
"Add 1 drop of i3 to the flask."
Updated Details
This is a simplified version of the problem. The full problem can have lines such as:
"Add 1 drop of i2 to the o3 of i1."
Where o3 is an object (flask), and i1 and i2 are water and sodium, respectively.
Table Structure
The ingredient table is structured as follows:
id | seq | label
The instruction table is structured as follows:
step
Algorithm
The algorithm I have in mind:
Repeat for all steps that match the expression '\mi([0-9]+)':
Break the step into word tokens.
For each token:
If the numeric portion of the token matches the old sequence number, replace it with the new sequence number.
Recombine the tokens and update the instruction.
Update the ingredient number.
Update
The algorithm may be incorrect as written. There could be two reference variables that must change. Consider before:
seq | label
1 | water
2 | sodium
3 | caesium
4 | francium
And after (swapping sodium and caesium):
seq | label
1 | water
2 | caesium
3 | sodium
4 | francium
Every i2 in every step must become i3; similarly i3 must become i2. So
"Add 1 drop of i2 to the flask, but absolutely do not add i3."
Becomes:
"Add 1 drop of i3 to the flask, but absolutely do not add i2."
Code
The code to perform the first two parts of the algorithm resembles:
CREATE OR REPLACE FUNCTION
renumber_steps(
p_ingredient_id integer,
p_old_sequence integer,
p_new_sequence integer )
RETURNS void AS
$BODY$
DECLARE
v_tokens text[];
BEGIN
FOR v_tokens IN
SELECT
t.tokens
FROM (
SELECT
regexp_split_to_array( step, '\W' ) tokens,
regexp_matches( step, '\mi([0-9]+)' ) matches
FROM
instruction
) t
LOOP
RAISE NOTICE '%', v_tokens;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
Question
What is a more efficient way to solve this problem (i.e., how would you eliminate the looping constructs), possibly leveraging PostgreSQL-specific features, without a major revision to the data model?
Thank you!
System Details
PostgreSQL 9.1.2.
You have to take care that you don't change ingredients and seq numbers back and forth. I introduce a temporary prefix for ingredients and negative numbers for seq for that purpose and exchange them for permanent values when all is done.
Could work like this:
CREATE OR REPLACE FUNCTION renumber_steps(_old int[], _new int[])
RETURNS void AS
$BODY$
DECLARE
_prefix CONSTANT text := ' i'; -- prefix, incl. leading space
_new_prefix CONSTANT text := ' ###'; -- temp prefix, incl. leading space
i int;
o text;
n text;
BEGIN
IF array_upper(_old,1) <> array_upper(_new,1) THEN
RAISE EXCEPTION 'Array length mismatch!';
END IF;
FOR i IN 1 .. array_upper(_old,1) LOOP
IF _old[i] <> _new[i] THEN
o := _prefix || _old[i] || ' '; -- leading and trailing blank!
-- new instruction are temporarily prefixed with new_marker
n := _new_prefix || _new[i] || ' ';
UPDATE instruction
SET step = replace(step, o, n) -- replace all instances
WHERE step ~~ ('%' || o || '%');
UPDATE ingredient
SET seq = _new[i] * -1 -- temporarily negative
WHERE seq = _old[i];
END IF;
END LOOP;
-- finally replace temp. prefix
UPDATE instruction
SET step = replace(step, _new_prefix, _prefix)
WHERE step ~~ ('%' || _new_prefix || '%');
-- .. and temp. negative seq numbers
UPDATE ingredient
SET seq = seq * -1
WHERE seq < 0;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT;
Call:
SELECT renumber_steps('{2,3,4}'::int[], '{4,3,2}'::int[]);
The algorithm requires ...
... that ingredients in the steps are delimited by spaces.
... that there are no permanent negative seq numbers.
_old and _new are ARRAYs of the old and new instruction.seq of ingredients that change position. The length of both arrays has to match, or an exception will be raised. It can contain seq that don't change. Nothing will happen to those.
Requires PostgreSQL 9.1 or later.
I think your model is problematic... you should have the "real name (id)" (i1, o3 etc.) FIXED after creation and have a second field in the ingredient table providing the "sorting". The user enters the "sorting name" and you immediately replace it with the "real name" (id) on saving the entered data into the step table.
When you read it from the step table you just replace/map the "real name" (id) with the current "sorting name" for display purposes if need be...
This way you don't have to change the data already in the step table for everytime someone changes the sorting which is a complex and expensive operation IMHO - it is prone to concurrency problems too...
The above option reduces the whole problem to a mapping operiton (table ingredient) on INSERT/UPDATE/SELECT (table step) for the one entry currently worked on - it doesn't mess with any other entries already there.