Hive query result based on header record - hive

I have a signle file like below which is contains the data combined from 4 different files into a single file from the source system.
The NEWFILE= is the separator of the data. For example all the data after the line NEWFILE=STUDENT and till the line NEWFILE=SUBJECT belongs to STUDENT file.
The issue is we don't have any pattern to separate the records of each file.
Also the source system cannot separate the file into 4 files.
I need to load this single input file and separate the records as per the header of the record.
What i did is loaded the data into a Hive table and tried the ROW_NUMBER & Random function.
I thought of using the ROW_NUMBER function to identify the row of each header and then filter the records in between the header rows, but ROW_NUMBER function output is not same as actual line order of the file. Due to this a row belonging to STUDENT may be assigned to SUBJECT.
I can't use the random function as it also doesn't give the actual row number
The file content data is given below
NEWFILE=STUDENT
100 XYZ
101 ABC
102 DEF
NEWFILE=SUBJECT
1 ENGLISH
2 MATHS
NEWFILE=TEACHERS
110 AAAAAAAA
111 BBBBBBB
222 CCCCCCC
333 DDDDDD
NEWFILE=CLASSES
1 CLASS-1
2 CLASS-2
Please advise how can I achieve my the desired output.

create external table myfile (rec string)
row format delimited
fields terminated by ','
tblproperties ('serialization.last.column.takes.rest'='true')
;
select rec
,ifn
,ifn_newfile_seq
,row_number () over
(
partition by ifn_newfile_seq
order by boif
) as ifn_newfile_rec_seq
from (select rec
,input__file__name as ifn
,block__offset__inside__file as boif
,count(case when rec like 'NEWFILE=%' then 1 end) over
(
partition by input__file__name
order by block__offset__inside__file
) as ifn_newfile_seq
from myfile
) l
;
+------------------+----------------------------------------------+-----------------+---------------------+
| rec | ifn | ifn_newfile_seq | ifn_newfile_rec_seq |
+------------------+----------------------------------------------+-----------------+---------------------+
| NEWFILE=STUDENT | file:/home/cloudera/local_db/myfile/file.txt | 1 | 1 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 100 XYZ | file:/home/cloudera/local_db/myfile/file.txt | 1 | 2 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 101 ABC | file:/home/cloudera/local_db/myfile/file.txt | 1 | 3 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 102 DEF | file:/home/cloudera/local_db/myfile/file.txt | 1 | 4 |
+------------------+----------------------------------------------+-----------------+---------------------+
| NEWFILE=SUBJECT | file:/home/cloudera/local_db/myfile/file.txt | 2 | 1 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 1 ENGLISH | file:/home/cloudera/local_db/myfile/file.txt | 2 | 2 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 2 MATHS | file:/home/cloudera/local_db/myfile/file.txt | 2 | 3 |
+------------------+----------------------------------------------+-----------------+---------------------+
| NEWFILE=TEACHERS | file:/home/cloudera/local_db/myfile/file.txt | 3 | 1 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 110 AAAAAAAA | file:/home/cloudera/local_db/myfile/file.txt | 3 | 2 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 111 BBBBBBB | file:/home/cloudera/local_db/myfile/file.txt | 3 | 3 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 222 CCCCCCC | file:/home/cloudera/local_db/myfile/file.txt | 3 | 4 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 333 DDDDDD | file:/home/cloudera/local_db/myfile/file.txt | 3 | 5 |
+------------------+----------------------------------------------+-----------------+---------------------+
| NEWFILE=CLASSES | file:/home/cloudera/local_db/myfile/file.txt | 4 | 1 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 1 CLASS-1 | file:/home/cloudera/local_db/myfile/file.txt | 4 | 2 |
+------------------+----------------------------------------------+-----------------+---------------------+
| 2 CLASS-2 | file:/home/cloudera/local_db/myfile/file.txt | 4 | 3 |
+------------------+----------------------------------------------+-----------------+---------------------+

Related

Postgresql: Group rows in a row and add array

Hi i have a table like this;
+----+----------+-------------+
| id | room_id | house_id |
+----+----------+-------------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 3 | 1 |
| 4 | 1 | 2 |
| 5 | 2 | 2 |
| 6 | 3 | 2 |
| 7 | 1 | 3 |
| 8 | 2 | 3 |
| 9 | 3 | 3 |
+----+-------+----------------+
and i want to create a view like this
+----+----------+-------------+
| id | house_id | rooms |
+----+----------+-------------+
| 1 | 1 | [1,2,3] |
| 2 | 2 | [1,2,3] |
| 3 | 3 | [1,2,3] |
+----+-------+----------------+
i tried many ways but i cant gruop them in one line
Thanks for any help.
You can use array_agg():
select house_id, array_agg(room_id order by room_id) as rooms
from t
group by house_id;
If you want the first column to be incremental, you can use row_number():
select row_number() over (order by house_id) as id, . . .

SQL - Create number of categories based on pre-defined number of splits

I am using BigQuery, and trying to assign categorical values to each of my records, based on the number of 'splits' assigned to it.
The table has a cumulative count of records, grouped at the STR level - i.e., if there are 4 SKUs at 2 STR, the SKUs will be labeled 1,2,3,4. Each STR is assigned a SPLIT value, so if the STR has a SPLIT value of 2, I want it to split its SKUs into 2 categories. I want to create another column that would assign SKUs labeled 1-2 as '1', and SKUs labeled 3-4 as '2'. (The actual data is on a much larger scale, but thought this would be easier.)
+-----+------+---------------+--------+
| STR | SKU | SKU_ROW_COUNT | SPLITS |
+-----+------+---------------+--------+
| 1 | 1230 | 1 | 3 |
| 1 | 1231 | 2 | 3 |
| 1 | 1232 | 3 | 3 |
| 1 | 1233 | 4 | 3 |
| 1 | 1234 | 5 | 3 |
| 1 | 1235 | 6 | 3 |
| 2 | 1310 | 1 | 2 |
| 2 | 1311 | 2 | 2 |
| 2 | 1312 | 3 | 2 |
| 2 | 1313 | 4 | 2 |
| 3 | 2345 | 1 | 1 |
| 3 | 2346 | 2 | 1 |
| 3 | 2347 | 3 | 1 |
+-----+------+---------------+--------+
The SPLITS column is dynamic, ranging from 1 to 3. The number of SKUs in each category should be relatively equal, but that's not a priority as much as just the number of groups that are created. Ideally, the final table with the new column (HOST_NUMBER) would look something like this:
+-----+------+---------------+--------+-------------+
| STR | SKU | SKU_ROW_COUNT | SPLITS | HOST_NUMBER |
+-----+------+---------------+--------+-------------+
| 1 | 1230 | 1 | 3 | 1 |
| 1 | 1231 | 2 | 3 | 1 |
| 1 | 1232 | 3 | 3 | 2 |
| 1 | 1233 | 4 | 3 | 2 |
| 1 | 1234 | 5 | 3 | 3 |
| 1 | 1235 | 6 | 3 | 3 |
| 2 | 1310 | 1 | 2 | 1 |
| 2 | 1311 | 2 | 2 | 1 |
| 2 | 1312 | 3 | 2 | 2 |
| 2 | 1313 | 4 | 2 | 2 |
| 3 | 2345 | 1 | 1 | 1 |
| 3 | 2346 | 2 | 1 | 1 |
| 3 | 2347 | 3 | 1 | 1 |
+-----+------+---------------+--------+-------------+
You can use window functions and arithmetics:
select
t.*,
1 + floor((sku_row_count - 1) * splits / count(*) over(partition by str)) host_number
from mytable t
order by sku
Actually, ntile() seems to do exactly what you want - and you don't even need the sku_row_count column (which basically mimics row_number() anyway):
select
t.*,
ntile(splits) over(partition by str order by sku) host_number
from mytable t
order by sku
If the ordering of the values in the groups doesn't matter, just use modulo arithmetic:
select t.*, (SKU_ROW_COUNT % SPLITS) as split_group
from t
Below is for BigQuery Standard SQL
#standardSQL
SELECT *, 1 + MOD(SKU_ROW_COUNT, SPLITS) AS HOST_NUMBER
FROM `project.dataset.table`

Join and Group Three Tables On Multiple Criteria - SQL

I am trying to join three separate tables based on certain criteria. Here are table examples:
TABLE A
+----+------------+----------+---------+
| id | entry num | line num | inv line|
+----+------------+----------+---------+
| 1 | 1 | 1 | 1 |
| 2 | 1 | 1 | 2 |
| 3 | 2 | 1 | 1 |
| 4 | 2 | 2 | 1 |
| 5 | 3 | 1 | 1 |
| 6 | 3 | 1 | 2 |
| 7 | 3 | 1 | 3 |
+----+------------+--------+-----------+
TABLE B
+----+------------+----------+---------+
| id | entry num | line num | code |
+----+------------+----------+---------+
| 1 | 1 | 1 | 100 |
| 2 | 2 | 1 | 370 |
| 3 | 2 | 2 | 120 |
| 4 | 3 | 1 | 300 |
+----+------------+--------+-----------+
TABLE C
+----+------------+--------+-----------+
| id | rate | amt | code |
+----+------------+--------+-----------+
| 1 | 25% | $50 | 100 |
| 2 | 50% | $20 | 370 |
| 3 | 50% | $25 | 120 |
| 4 | 30% | $150 | 300 |
+----+------------+----------+---------+
I need the final table to look like this, but I am at a loss on how to write the syntax:
FINAL TABLE
+----+------------+----------+---------+---------+---------+---------+
| id | entry num | line num | inv line| code | rate | amt |
+----+------------+----------+---------+---------+---------+---------+
| 1 | 1 | 1 | 1 | 100 | 25% | $50 |
| 2 | 1 | 1 | 2 | 100 | 25% | $50 |
| 3 | 2 | 1 | 1 | 370 | 50% | $20 |
| 4 | 2 | 2 | 1 | 120 | 50% | $25 |
| 5 | 3 | 1 | 1 | 300 | 30% | $150 |
| 6 | 3 | 1 | 2 | 300 | 30% | $150 |
| 7 | 3 | 1 | 3 | 300 | 30% | $150 |
+----+------------+----------+---------+---------+---------+---------+
Ultimately, I need table A and B joined where both entry num and line num match, but then I need to show each individual row for the inv line number.
For example, entry num 3 / line num 1 will has 3 invoice numbers. All entry num 3/ line num 1 will have the code 300, 30% rate, and $150 amount, but I need to visibly see that there are 3 invoice lines.
I've tried to join tables, group them, and get total counts, but to no avail. Thanks for your help!
I think that you need to create joins between TableA and Table B on EntryNum and LineNum, and then between TableB and TableC on Code. Your SQL should look like:
SELECT A.ID, A.EntryNum, A.LineNum, A.InvLine, B.Code, C.Rate, C.Amt
FROM TableC AS C INNER JOIN (TableB AS B INNER JOIN TableA AS A ON (B.LineNum = A.LineNum) AND (B.EntryNum = A.EntryNum))
ON C.Code = B.Code;
Which produces the result that you want:
Regards,

Update columns based on record count and count total

I need some help writing a script to update a table.
The table has the following:
| StudentID | Name | Record | Label |
| 1 | Ed | 1 | 1 |
| 1 | Ed | 1 | 1 |
| 1 | Ed | 1 | 1 |
| 1 | Ed | 1 | 1 |
| 2 | Bob | 1 | 1 |
| 2 | Bob | 1 | 1 |
| 2 | Bob | 1 | 1 |
| 2 | Bob | 1 | 1 |
I would like to update the Record and Label columns, so that the query would increment the Record column from 1 to n for the same StudentId. The Label column would also need to be updated to display the record # of total number of records for that StudentId.
The result for Ed should be:
| StudentID | Name | Record | Label |
| 1 | Ed | 1 | 1 of 4 |
| 1 | Ed | 2 | 2 of 4 |
| 1 | Ed | 3 | 3 of 4 |
| 1 | Ed | 4 | 4 of 4 |
Hoping someone can help me with this.

Ask about query in sql server

i have table like this:
| ID | id_number | a | b |
| 1 | 1 | 0 | 215 |
| 2 | 2 | 28 | 8952 |
| 3 | 3 | 10 | 2000 |
| 4 | 1 | 0 | 215 |
| 5 | 1 | 0 |10000 |
| 6 | 3 | 10 | 5000 |
| 7 | 2 | 3 |90933 |
I want to sum a*b where id_number is same, what the query to get all value for every id_number? for example the result is like this :
| ID | id_number | result |
| 1 | 1 | 0 |
| 2 | 2 | 523455 |
| 3 | 3 | 70000 |
This is a simple aggregation query:
select id_number, sum(a*b)
from t
group by id_number
I'm not sure what the first column is for.