Group by Hive Script Different csv files - hive

Very new to hive scripting.
I have 5 csv files with stock information for AAPL, AMZN, FB, GOOG, and NFLX. Within each file the columns are Date, Open, High, Low, Close, Adj Close and Volume. I am trying to modify the script to display the one date for which trading volume across the 4 software companies was the greatest (in total). I know I need to Group by marketDate, sum the volume and sort appropriately.
sample of 1 of the files (all are like this)
Currently my code is as follows:
------------------------------------------------------------------
-- need when you write a single data file
------------------------------------------------------------------
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
------------------------------------------------------------------
--------------------------------------------------------
--- create a table with the stock data ---
--------------------------------------------------------
DROP TABLE IF EXISTS stockPrices;
CREATE EXTERNAL TABLE stockPrices(
marketDate STRING,
open DECIMAL(12,6),
high DECIMAL(12,6),
low DECIMAL(12,6),
close DECIMAL(12,6),
adjClose DECIMAL(12,6),
volume BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '${INPUT}/stocks/data'
TBLPROPERTIES ("skip.header.line.count"="1");
--------------------------------------------------------
--- list the contents of the stockPrices table
--- including the virtual field INPUT__FILE__NAME to help identify the stock ticker
--- NOTE: INPUT__FILE__NAME requires 2 underscores before and after FILE (not 1 underscore)
--------------------------------------------------------
SELECT INPUT__FILE__NAME, * FROM stockPrices LIMIT 10;
--------------------------------------------------------
--- output summary of stock data ---
--------------------------------------------------------
INSERT OVERWRITE DIRECTORY '${OUTPUT}/stocks/output'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT INPUT__FILE__NAME, SUM(volume), MIN(adjClose), MAX(adjClose) FROM stockPrices GROUP BY INPUT__FILE__NAME;
How can I modify the code o display the one date for which trading volume across the 4 software companies was the greatest (in total) for my output summary?
Thank you in advance

Related

Regular expression parsing pseudo-columnized data in SQL

I'm working on a project where we're trying to parse information out of reroute advisories issued by the FAA. These advisories are issued as free text with a loosely-identified structure, with the goal being to allow viewers to print them out on a single sheet of paper.
The area of most interest is the final portion of the advisory that contains specific information related to a given reroute - the origin and destination airport as well as the specific required route that applies to any flight between them. Here's an example:
ORIG DEST ROUTE
---- --------------- ---------------------------
FRG MCO TPA PIE SRQ WAVEY EMJAY J174 SWL CEBEE
FMY RSW APF WETRO DIW AR22 JORAY HILEY4
What I'd like to do is be able to parse this into three entries like this:
ORIG: FRG
DEST: MCO TPA PIE SRQ FMY RSW APF
ROUTE: WAVEY EMJAY J174 SWL CEBEE WETRO DIW AR22 JORAY HILEY4
Here are the three code segments I'm currently using to parse this portion of the advisories:
Origin
regexp_substr(route_1,'^(([A-Z0-9]|\(|\)|\-)+\s)+')
Destination
regexp_substr(route_1,'(([A-Z0-9]|\(|\)|\-)+\s)+',1,2)
Route String
regexp_substr(route_1, '\s{2,}>?([A-Z0-9]|>|<|\s|:)+<?$')
While these expressions can deal with the majority of situations where the Origin and Destination portions are all on the first line, they cannot deal with the example provided earlier. Does anyone know how I might be able to successfully parse the text in my original example?
Proof of concept.
Input file is a plain text file (with no tabs, only spaces). Vertically it is divided into three columns, of fixed width: 20 characters, 20 characters, whatever is left (till the end of line).
The first two rows are always populated, with the headers and -----. These two rows can be ignored. Then the rest of the file (reading down the page) are "grouped" by the latest non-empty string in the ORIG column.
The input file looks like this:
ORIG DEST ROUTE
---- --------------- ---------------------------
FRG MCO TPA PIE SRQ WAVEY EMJAY J174 SWL CEBEE
FMY RSW APF WETRO DIW AR22 JORAY HILEY4
ABC SFD RRE BAC TRIO SBL CRT
POLDA FARM OLE BID ORDG BALL
BINT LFV
YYT PSS TRI BABA TEN NINE FIVE
COL DMV
SAL PRT DUW PALO VR22 NOL3
Notice the empty lines between blocks, the empty DEST in one block (I handle that, although perhaps in the OP's problem that is not possible), and the different number of rows used by DEST and ROUTE in some cases.
The file name is inp.txt and it resides in a directory which I have made known to Oracle: create directory sandbox as 'c:\app\sandbox'. (First I had to grant create any directory to <myself>, while logged in as SYS.)
The output looks like this:
ORIG DEST ROUTE
----- --------------------------- ------------------------------------------------------
FRG MCO TPA PIE SRQ FMY RSW APF WAVEY EMJAY J174 SWL CEBEE WETRO DIW AR22 JORAY HILEY4
ABC SFD RRE BAC TRIO SBL CRT
POLDA FARM OLE BID ORDG BALL BINT LFV
YYT PSS TRI BABA COL DMV TEN NINE FIVE
SAL PRT DUW PALO VR22 NOL3
I did this in two steps. First I created a helper table, INP, with four columns (RN number, ORIG varchar2(20), DEST varchar2(20), ROUTE varchar2(20)) and I imported from the text file through a procedure. Then I processed this further and used the output to populate the final table. It is very unlikely that this is the most efficient way to do this (and perhaps there are very good reasons not to do it the way I did); I have no experience with UTL_FILE and importing text files into Oracle in general. I did this for two reasons: to learn, and to show it can be done.
The procedure to import the text file into the helper table:
Create or Replace PROCEDURE read_inp is
f UTL_FILE.FILE_TYPE;
s VARCHAR2(200);
rn number := 1;
BEGIN
f := UTL_FILE.FOPEN('SANDBOX','inp.txt','r', 200);
LOOP
BEGIN
UTL_FILE.GET_LINE(f,s);
INSERT INto inp (rn, orig, dest, route)
VALUES
(rn, trim(substr(s, 1, 20)), trim(substr(s, 21, 20)), trim(substr(s, 41)));
END;
rn := rn + 1;
END LOOP;
exception
when no_data_found then
utl_file.fclose(f);
END;
/
exec read_inp
/
And the further processing (after creating the REROUTE table):
create table reroute ( orig varchar2(20), dest varchar2(4000), route varchar2(4000) );
insert into reroute
select max(orig),
trim(listagg(dest , ' ') within group (order by rn)),
trim(listagg(route, ' ') within group (order by rn))
from (
select rn, orig, dest, route, count(orig) over (order by rn) as grp
from inp
where rn >= 3
)
group by grp
;

Split Text into Table Rows with Read-Only Permissions

I am a read-only user for a database with he following problem:
Scenario:
Call center employees for a company submit tickets to me through our database on behalf of our clients. The call center includes alphanumeric lot numbers of an exact length in their message for me to troubleshoot. Depending on how many times a ticket is updated, there could be several messages for one ticket, each of them having zero or more of these alphanumeric lot numbers embedded in the message. I can access all of these messages with Oracle SQL and SQL Tools.
How can I extract just the lot numbers to make a single-column table of all the given lot numbers?
Example Data:
-- Accessing Ticket 1234 --
SELECT *
FROM communications_detail
WHERE ticket_num = 1234;
-- Results --
TICKET_NUM | MESSAGE_NUM | MESSAGE
------------------------------------------------------------------------------
1234 | 1 | A customer recently purchased some products with
| | a lot number of vwxyz12345 and wants to know if
| | they have been recalled.
------------------------------------------------------------------------------
1234 | 2 | Same customer found lots vwxyz23456 and zyxwv12345
| | in their storage as well and would like those checked.
------------------------------------------------------------------------------
1234 | 3 | These lots have not been recalled. Please inform
| | the client.
So-Far:
I am able to isolate the lot numbers of a constant string with the following code, but it gets put into standard output and not a table format.
DECLARE
msg VARCHAR2(200) := 'Same customer found lots xyz23456 and zyx12345 in their storage as well and would like those checked.';
cnt NUMBER := regexp_count(msg, '[[:alnum:]]{10}');
BEGIN
IF cnt > 0 THEN
FOR i IN 1..cnt LOOP
Dbms_Output.put_line(regexp_substr(msg, '[[:alnum:]]{10}', 1, i));
END LOOP;
END IF;
END;
/
Goals:
Output results into a table that can itself be used as a table in a larger query statement.
Somehow be able to apply this to all of the messages associated with the original ticket.
Update: Changed the example lot numbers from 8 to 10 characters long to avoid confusion with real words in the messages. The real-world scenario has much longer codes and very specific formatting, so a more complex regular expression will be used.
Update 2: Tried using a table variable instead of standard output. It didn't error, but it didn't populate my query tab... This may just be user error...!
DECLARE
TYPE lot_type IS TABLE OF VARCHAR2(10);
lots lot_type := lot_type();
msg VARCHAR2(200) := 'Same customer found lots xyz23456 and zyx12345 in their storage as well and would like those checked.';
cnt NUMBER := regexp_count(msg, '[[:alnum:]]{10}');
BEGIN
IF cnt > 0 THEN
FOR i IN 1..cnt LOOP
lots.extend();
lots(i) := regexp_substr(msg, '[[:alnum:]]{10}', 1, i);
END LOOP;
END IF;
END;
/
This is a regex format which matches the LOT mask you provided: '[a-z]{3}[0-9]{5}' . Using something like this will help you avoid the false positives you mention in your question.
Now here is a read-only, pure SQL solution for you.
with cte as (
select 'Same customer found lots xyz23456 and zyx12345 in their storage as well and would like those checked.' msg
from dual)
select regexp_substr(msg, '[a-z]{3}[0-9]{5}', 1, level) as lotno
from cte
connect by level <= regexp_count(msg, '[a-z]{3}[0-9]{5}')
;
I'm using the WITH clause just to generate the data. The important thing is the the use of the CONNECT BY operator which is part of Oracle's hierarchical data syntax but here generates a table from one row. The pseudo-column LEVEL allows us to traverse the string and pick out the different occurrences of the regex pattern.
Here's the output:
SQL> r
1 with cte as ( select 'Same customer found lots xyz23456 and zyx12345 in their storage as well and would like those checked.' msg from dual)
2 select regexp_substr(msg, '[a-z]{3}[0-9]{5}', 1, level) as lotno
3 from cte
4 connect by level <= regexp_count(msg, '[a-z]{3}[0-9]{5}')
5*
LOTNO
----------
xyz23456
zyx12345
SQL>

Multi-Row Per Record SQL Statement

I'm not sure this is possible but my manager wants me to do it...
Using the below picture as a reference, is it possible to retrieve a group of records, where each record has 2 rows of columns?
So columns: Number, Incident Number, Vendor Number, Customer Name, Customer Location, Status, Opened and Updated would be part of the first row and column: Work Notes would be a new row that spans the width of the report. Each record would have two rows. Is this possible with a GROUP BY statement?
Record 1
Row 1 = Number, Incident Number, Vendor Number, Customer Name, Customer Location, Status, Opened and Updated
Row 2 = Work Notes
Record 2
Row 1 = Number, Incident Number, Vendor Number, Customer Name, Customer Location, Status, Opened and Updated
Row 2 = Work Notes
Record n
...
I don't think that possible with the built in report engine. You'll need to export the data and format it using something else.
You could have something similar to what you want on short description (list report, group by short description), but you can't group by work notes so that's out.
One thing to note is that the work_notes field is not actually a field on the table, the work_notes field is of type journal_input, which means it's really just a gateway to the actual underlying data model. "Modifying" work_notes actually just inserts into sys_journal_field.
sys_journal_field is the table which stores the work notes you're looking for. Given a sys_id of an incident record, this URL will give you all journal field entries for that particular record:
/sys_journal_field_list.do?sysparm_query=name=task^element_id=<YOUR_SYS_ID>
You will notice this includes ALL journal fields (comments + work_notes + anything else), so if you just wanted work notes, you could simply add a query against element thusly:
/sys_journal_field_list.do?sysparm_query=name=task^element=work_notes^element_id=<YOUR_SYS_ID>
What this means for you!
While you can't separate a physical row into multiple logical rows in the UI, in the case of journal fields you can join your target table against the sys_journal_field table using a Database View. This deviates from your goal in that you wouldn't get a single row for all work notes, but rather an additional row for each matched work note.
Given an incident INC123 with 3 work notes, your report against the Database View would look kind of like this:
Row 1: INT123 | markmilly | This is a test incident |
Row 2: INT123 | | | Work note #1
Row 3: INT123 | | | Work note #2
Row 4: INT123 | | | Work note #3

Flat File Import: Remove Data

(Posted a similar question earlier but HR department changed conditions today)
Our HR department has an automated export from our SAP system in the form of a flat file. The information in the flat file looks like so.
G/L Account 4544000 Recruiting/Job Search
Company Code 0020
--------------------------
| Posting Date| LC amnt|
|------------------------|
| 01/01/2013 | 406.25 |
| 02/01/2013 | 283.33 |
| 03/21/2013 |1,517.18 |
--------------------------
G/L Account 4544000 Recruiting/Job Search
Company Code 0020
--------------------------
| Posting Date| LC amnt|
|------------------------|
| 05/01/2013 | 406.25 |
| 06/01/2013 | 283.33 |
| 07/21/2013 |1,517.18 |
--------------------------
When I look at the data in the SSIS Flat File Source Connection all of the information is in a single column. I have tried to use the Delimiter set to Pipe but it will not separate the data, I assume due to the nonessential information at the top and middle of the file.
I need to remove the data at the top and middle and then have the Date and Total split into two separate columns.
The goal of this is to separate the data so that I can get a single SUM for the running year.
Year Total
2013 $5123.25
I have tried to do this in SSIS but I cant seem to separate the columns or remove the data. I want to avoid a script task as I am not familiar with the code or operation of that component.
Any assistance would be appreciated.
I would create a temp table that can import the whole flat file, after that do filter on SQL level
An example
Create TABLE tmp (txtline VARCHAR(MAX))
BCP or SSIS file into tmp table
Run Query like this to get result ( you may need adjust string length to fit your flat file)
WITH cte AS (
SELECT
CAST(SUBSTRING(txtline,2,10) AS DATE) AS PostingDate,
CAST(REPLACE(REPLACE(SUBSTRING(txtline,15,100),'|',''),',','') AS NUMERIC(19,4)) AS LCAmount
FROM tmp
WHERE ISDATE(SUBSTRING(txtline,2,10)) = 1
)
SELECT
YEAR(PostingDate),
SUM(LCAmount)
FROM cte
GROUP BY YEAR(PostingDate)
maybe you could use MS-Excel to open the flat file, using pipe-character as the delimeter, and then create a CSV from that, if needed.
Short of a script task/component (or a full-blown custom SSIS component), I don't think you'll be able to parse that specific format in SSIS. The Flat File Connection Manager does allow you to select how many rows of your text file are headers to be skipped, but the format you're showing has multiple sections (and thus multiple headers). There's also the issue of the horizontal lines, which the Flat File Connection won't be able to properly handle.
I'd first see if there's any way to get a normal CSV file with this data out of SAP. If that turns out to be impossible, then you'll need some sort of custom code to strip out the excess text.

List unique values from column based on an ID Field

Warning I'm a Newbie so Sorry if there is anything wrong with the question or the explanation ...
I have a table 'XYZ' with a list of Attachments (OrigFileName) and a field UniqueAttchID that is also on the header table 'ABC' to record the link so you can query which Attachments relate to the record.
I need to bring the results of all records where UniqueAttchID is equal on the Header and add them back into the header 'ABC' a field called 'udAttch' which is a Memo field formatted with a , separator
This is to get around a limitation of the reporting functionality available to me as I can only use an actual field from the Database not a related table.
Current Setup:-
XYZ Table
UniqueAttchID OrigFileName
---------- -------------
18181818181 | Filename1
18181818181 | Filename2
18181818181 | Filename3
18181818182 | Filename1
ABC Table -
Description|Field2|UniqueAttchID|
test item |test |18181818181
Test item 2|test2 |18181818182
Desired result:-
(XYZ table would remain unchanged)
ABC Table -
Description|Field2|UniqueAttchID|udAttch|
test item |test |18181818181 |Filename1, Filename2, Filename3|
Test item 2|test2 |18181818182 |Filename1|
I've tried using COALESCE however this doesn't give me a separate record for each UniqueAttchID just one for all records, and SELECT DISTINCT only produced the first record in OrigFileName
I can then generate a Stored Procedure to run as required and update the record when new files are added as attachments.
Please try:
select
*,
STUFF((SELECT ',' + OrigFileName
FROM XYZ b WHERE b.UniqueAttchID=a.UniqueAttchID
FOR XML PATH(''),type).value('.','nvarchar(max)'),1,1,'') AS [udAttch]
From ABC a