Parsing Array<String> in Hive using OPEN CSV SEREDE

Parsing Array<String> in Hive using OPEN CSV SEREDE - hive

I am getting data in below format :
100|15|N-PS-GL-PSJOB|1,A|JFGLFX48|"AAAA"|102
100|15|N-PS-GL-PSJOB|2,A|JFGLFX48|"AAEE"|102
100|15|N-PS-GL-PSJOB|1,A|JFGLFX48|"AXXX"|102
100|15|N-PS-GL-PSJOB|2,A|JFGLFX48|"ABCH"|102
I need to implement
parse with "|" and split fourth column value using ','
remove quotes
I used array as datatype for 4th column and opencsvserede to remove quotes.But how can I pass split method(COLLECTION ITEMS TERMINATED BY ',') inside opencsv serede.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES("separatorChar" = "|","quoteChar" = "\"")
Thanks
Surya

There could be couple of ways to solve this. The best option will be to pre-process this data in a hive landing table and then load the resultant data to a destination hive table that may be ORC/PARQUET file type that yields you better performance than plain text read via serde.
Create Landing table pointing to your text data
create external table landing (c1 string, c2 string, c3 string, c4 string, c5 string, c6 string, c7 string) row format delimited fields terminated by '|' location '/user/hdfs/landing';
Create destination table
create external table destination(c1 string,c2 string,c3 string,c4 string,c5 string,c6 string ,c7 string,c8 string) stored as orc location '/user/hdfs/destination';
Insert data to destination selecting records from landing
insert into destination select c1, c2, c3, split(c4,',')[0] , split(c4,',')[1] ,c5, regexp_replace(c6,'\"','') , c7 from landing;
You can re-use the collection logic if the 4th column is supposed to be a collection. For sake of simplicity, I split them to 2 separate string columns.
If you would just want to create a view on the first table, that should also be fine. But a view is still going to execute on text data which is relatively slow than a ORC/Parquet format.
Results
select * from landing
100 15 N-PS-GL-PSJOB 1,A JFGLFX48 "AAAA" 102
100 15 N-PS-GL-PSJOB 2,A JFGLFX48 "AAEE" 102
100 15 N-PS-GL-PSJOB 1,A JFGLFX48 "AXXX" 102
100 15 N-PS-GL-PSJOB 2,A JFGLFX48 "ABCH" 102
select * from destination
100 15 N-PS-GL-PSJOB 1 A JFGLFX48 AAAA 102
100 15 N-PS-GL-PSJOB 2 A JFGLFX48 AAEE 102
100 15 N-PS-GL-PSJOB 1 A JFGLFX48 AXXX 102
100 15 N-PS-GL-PSJOB 2 A JFGLFX48 ABCH 102
Other options being :-
select c1, c2, c3, split(c4,',')[0] , split(c4,',')[1] ,c5, regexp_replace(c6,'\"','') , c7 from t1;
Or if you want to hide the complexity of the replace and split from the query, just create a view with the same query
create view landingview as select c1, c2, c3, split(c4,',')[0] , split(c4,',')[1] ,c5, regexp_replace(c6,'\"','') , c7 from landing;
Both of which yields the same result
100 15 N-PS-GL-PSJOB 1 A JFGLFX48 AAAA 102
100 15 N-PS-GL-PSJOB 2 A JFGLFX48 AAEE 102
100 15 N-PS-GL-PSJOB 1 A JFGLFX48 AXXX 102
100 15 N-PS-GL-PSJOB 2 A JFGLFX48 ABCH 102

Related

create a new table from 2 other tables

If I want to merge the table with 2 other tables b,c
where table a contains columns:( Parent, Style, Ending_Date, WeekNum, Net_Requirment)
tables and calculate how much is required to make product A in a certain date.
The table should like the BOM (Bill of Material)
Can it be applied by pandas?
table b represent the demand for product A per date:
Style Date WeekNum Quantity
A 24/11/2019 0 600
A 01/12/2019 1 500
table c represent Details and quantity used to make product A:
Parent Child Q
A A1 2
A1 A11 3
A1 A12 2
so table a should be filled like this:
Parent Child Date WeekNum Net_Quantity
A A1 24/11/2019 0 1200
A1 A11 24/11/2019 0 3600
A1 A12 24/11/2019 0 2400
A A1 01/12/2019 1 1000
A1 A11 01/12/2019 1 3000
A1 A12 01/12/2019 1 2000

Welcome, in order to properly merge these tables and the rest you would have to have a common key to merge on. What you could do is add said key to each table like this:
data2 = {'Parent':['A','A1','A1'], 'Child':['A1','A11','A12'],
'Q':[2,3,2], 'Style':['A','A','A']}
df2 = pd.DataFrame(data2)
After this you can do a left join on the first table and then you can have multiple rows for the same date. So essentially this:
(notice if you do a left join, your left table will create as many duplicate rows as needed tu suffice the matching key on the right table)
data = {'Style':['A','A'], 'Date':['24/11/2019', '01/12/2019'],
'WeekNum':[0,1], 'Quantity':[600,500]}
df = pd.DataFrame(data)
mergeDf = df.merge(df2,how='left', left_on='Style', right_on='Style')
mergeDf
Then to calculate:
test['Net_Quantity'] = test.Quantity*test.Q
test.drop(['Q'], axis = 1,inplace=True)
result:
Style Date WeekNum Quantity Parent Child Net_Quantity
0 A 24/11/2019 0 600 A A1 1200
1 A 24/11/2019 0 600 A1 A11 1800
2 A 24/11/2019 0 600 A1 A12 1200
3 A 01/12/2019 1 500 A A1 1000
4 A 01/12/2019 1 500 A1 A11 1500
5 A 01/12/2019 1 500 A1 A12 1000

Consolidating 5 table into one

So I have 5 tables which need to go into Result; 2-digit, 3-digit, 4-digit, 5-digit, and 6-digit. They are of the same structure. Would the following code accomplish the task:
Insert into Result select * from 2-digit, 3-digit, 4-digit, 5-digit, 6-digit;
Or does it need to look like this
Insert into Result select * from 2-digit, select * from 3-digit, select * from 4-digit,select * from 5-digit,select * from 6-digit;
Below is some sample data. The desired result is to simply consolidate these three tables into one with no manipulation of the data or rows. the end result should have 12 rows.
2 digit
x job code employment
32 10 4569
32 11 4521
3 digit
x job code employment
32 101 1203
32 102 3366
32 111 1000
32 112 3521
4 digit
32 1011 1203
32 1025 1000
32 1028 2366
32 1111 500
32 1112 500
32 1123 2899
32 1124 45
32 1125 577

You can solve this problem with SELECT . . . UNION:
SELECT * FROM table1
UNION ALL
SELECT * FROM table2
UNION ALL
SELECT * FROM table3
. . . and so on
If the table structures are not exactly the same (they have a different number of columns or the columns have slightly different names, or are in a different order) then you will have to replace * with the column names explicitly listed.
Having the five tables in the first place is a flaw in the database design -- why not keep them all in one table and then just SELECT out the rows you want.

use SUBSTRING()
INSERT INTO Results
SELECT SUBSTRING('102100',1,2) [2-digits],
SUBSTRING('102100',1,3) [3-digits],
SUBSTRING('102100',1,4) [4-digits],
SUBSTRING('102100',1,5) [5-digits],
'102100' [6-digits]
FROM tablename
Result
2-digits 3-digits 4-digits 5-digits 6-digits
10 102 1021 10210 102100
substitute '102100' with the column name you have from your table

SQL selecting values between two columns with a list

I'm attempting to find rows given a list of values where one of the values is in a range between two of the columns, as an example:
id column1 column2
1 1 5
2 6 10
3 11 15
4 16 20
5 21 25
...
99 491 495
100 496 500
I'd like to give a list of values, e.g. (23, 83, 432, 334, 344) which would return the rows
id column1 column2
5 21 25
17 81 85
87 431 435
67 331 335
69 341 345
The only way I can think of doing this so far has been to split each into it's own call by doing
SELECT * FROM TableA WHERE (column1 < num1 AND num1 < column2)
However this scales quite poorly when the list of numbers is around several million.
Is there any better way of doing this?
Thanks for the help.

Putting millions of numbers into the SQL command itself would be unwieldy.
Obviously, you have to put the numbers into a (temporary) table.
Then you can just join the two tables:
SELECT *
FROM TableA JOIN TempTable
ON TempTable.Value BETWEEN TableA.column1 AND TableA.column2;

Using sql query to print the result in a serialized format

I have a database table like this:
C1 C2 C3
---------------------
81 1 10
81 2 20
81 3 30
82 1 40
82 2 50
82 3 60
Note that it has no primary key.
I want to run a query which prints C1 and the various occurrences of C3 values with it. It basically gives me the output in a serialised format. I mean something like this :
81 10 20 30
82 40 50 60
The one approach I can think of is using a rownum but am not sure if that;s the way to go about it. Is there a better way for doing this ?

The query will depend on DBMS you use.
In MySQL, you can use group_concat function:
select c1, group_concat(c3 separator ' ')
from t
group by c1;

Rename data from Oracle Column

I would like to rename certain data from an oracle table. Lets assume the data in table "Random Items" has the form
Day Item Total
12/3 102 12
12/3 423 28
12/4 102 48
I would like to rename the Item number to a specific string so when I grab the data from the table the output will look like
Day Item Total
12/3 Shoe 12
12/3 Orange 28
12/4 Shoe 48
so Shoe = 102 and Orange = 423
I have no writing writes to the tables. I've looked at commands such as rename, synonym and replace but they all rename a specific table or column. I would like to reverence the data in the table.
Thank you

select day, case ITEM when 102 then 'shoe'
when 423 then 'orange'
end itemname, total
from items

try using decode like:
select day, decode(item, '102', 'Shoe', '423', 'Orange',...), total from items

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Parsing Array<String> in Hive using OPEN CSV SEREDE - hive

Related

create a new table from 2 other tables

Consolidating 5 table into one

SQL selecting values between two columns with a list

Using sql query to print the result in a serialized format

Rename data from Oracle Column

Categories

Resources