Unable to extract product price from column in HIVE table

Unable to extract product price from column in HIVE table - hive

I am trying to load data into a regular table from a temporary table in HIVE. Below are the first few rows from the temporary table.
ProdNo,ProdName,ProdMfg,ProdQOH,ProdPrice,ProdNextShipDate
P0036566,17 inch Color Monitor,ColorMeg Inc.,12,$169.00,2013-02-20
P0036577,19 inch Color Monitor,ColorMeg Inc.,10,$319.00,2013-02-20
P1114590,R3000 Color Laser Printer,Connex,5,$699.00,2013-01-22
I am using the below code to do so
insert overwrite table product
SELECT
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) ProdNo,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) ProdName,
regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) ProdMfg,
regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) ProdQOH,
regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) ProdPrice,
regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) ProdNextShipDate
from product_temp;
After I run the above code all the columns in the regular table are perfect except for the ProdPrice column which has all values as NULL. So how do I extract the price from the temporary table without the $ symbol and load it into the regular table? Below is the current output where ProdPrice is null.
ProdNo ProdName ProdMfg ProdQOH ProdPrice date
P0036566 17 inch Color Monitor ColorMeg Inc. 12 null 2013-02-20
P0036577 19 inch Color Monitor ColorMeg Inc. 10 null 2013-02-20
Here is the product table structure
CREATE TABLE `product`(
`prodno` string,
`prodname` string,
`prodmfg` string,
`prodqoh` int,
`prodprice` string,
`prodnextshipdate` date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
LOCATION
'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sales_db.db/product'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'last_modified_by'='maria_dev',
'last_modified_time'='1488149236',
'numFiles'='1',
'numRows'='11',
'rawDataSize'='516',
'totalSize'='650',
'transient_lastDdlTime'='1488149304')
Thanks

You are trying to insert text e.g. $169.00 to numeric field.
Hive handle this kind of mismatches by inserting NULL values.
Change ProdPrice to string or remove the '$' symbol (and if other currencies are available, save the currency symbol in additional column)

insert overwrite table product
select val[0],val[1],val[2],val[3],val[4],val[5]
from (select split (col_value,',') as val from product_temp) t
Demo
create table product_temp (col_value string);
insert into product_temp values
('P0036566,17 inch Color Monitor,ColorMeg Inc.,12,$169.00,2013-02-20')
,('P0036577,19 inch Color Monitor,ColorMeg Inc.,10,$319.00,2013-02-20')
,('P1114590,R3000 Color Laser Printer,Connex,5,$699.00,2013-01-22' )
;
select val[0] as ProdNo
,val[1] as ProdName
,val[2] as ProdMfg
,val[3] as ProdQOH
,val[4] as ProdPrice
,val[5] as ProdNextShipDate
from (select split (col_value,',') as val
from product_temp
) t
;
+----------+---------------------------+---------------+---------+-----------+------------------+
| prodno | prodname | prodmfg | prodqoh | prodprice | prodnextshipdate |
+----------+---------------------------+---------------+---------+-----------+------------------+
| P0036566 | 17 inch Color Monitor | ColorMeg Inc. | 12 | $169.00 | 2013-02-20 |
| P0036577 | 19 inch Color Monitor | ColorMeg Inc. | 10 | $319.00 | 2013-02-20 |
| P1114590 | R3000 Color Laser Printer | Connex | 5 | $699.00 | 2013-01-22 |
+----------+---------------------------+---------------+---------+-----------+------------------+

Related

SQL Server : drop zeros from col1 and concat with col2 into new View

I need to reconcile article1 (top) and article2 tables into a View displaying differences. But before that I need to drop all zeros from column 'type'. Create new ID column equals to filenumber + type so the resulting column should be use as index. All columns share same data type
Columns needed:
ID
C0016
C0029
C00311

You can utilize below script in SQL Server to get the format you want:
Reference SO post on removing padding 0
SELECT CONCAT(filenumber,type) AS filenumber, type, cost
FROM
(
SELECT
filenumber,
SUBSTRING(type, PATINDEX('%[^0]%',type),
LEN(type)- PATINDEX('%[^0]%',type)+ 1) AS type, cost
FROM
(
VALUES
('C001','00006',40),
('C002','00009',80),
('C003','00011',120)
) as t(filenumber,type, cost)
) AS t
Resultset
+------------+------+------+
| filenumber | type | cost |
+------------+------+------+
| C0016 | 6 | 40 |
| C0029 | 9 | 80 |
| C00311 | 11 | 120 |
+------------+------+------+

You can use try_convert() :
alter table table_name
add id as concat(filenumber, try_convert(int, type)) persisted -- physical storage
If you want a view :
create view veiw_name
as
select t.*, concat(filenumber, try_convert(int, type)) as id
from table t;
try_convert() will return null whereas conversation fails.

N to N relationship using three source tables

I'm a student and I'm making a project, but I have a problem, that I cannot resolve. I've got one table with rows imported from csv file, one table with facts (GunViolences) and one with dimension (Categories - dead and injured). In the original table (imported from csv file) I've got table description, and there are long strings, which have substrings dead or injured. And now I have to connect in some way table of facts with a table of dimensions, depending on category of violence. I've created another table called ViolenceCategories to connect these tables by IDs, but I don't know how can I fill this table.
Structure:
Table FromCSV
id, date, description, address
1,12-01-2002, Shot|shotgun, address1
2,19-04-2003, injured, address2
3, 21-10-2004, shot|injured, address3
Table GunViolence
id, date, address
1, 12-01-2002, address1
2,19-04-2003, address2
3, 21-10-2004, address3
Table DimCategories
id, category
1, shot
2, injured
Table ViolenceCategories
idFact, idDim
1,1
2,2
3,2
3,1
How can I fill table VIolenceCategories?
EDIT
I've created another table to separate values of column with description
Table DimDescription
id, desc1, desc2
1, Shot, shotgun
2, injured, null
3, shot, injured

In SQL Server 2016 and newer you can use table operator STRING_SPLIT() to achieve this. That function will split a delimited string in a record into new rows for each substring. Consider:
SELECT *
FROM FromCSV
CROSS APPLY STRING_SPLIT(description, '|') AS descriptions
+----+------------+---------------------+----------+-------------+
| id | date | description | address | value |
+----+------------+---------------------+----------+-------------+
| 1 | 12-01-2002 | Shot|shotgun | address1 | Shot |
| 1 | 12-01-2002 | Shot|shotgun | address1 | shotgun |
| 2 | 19-04-2003 | three shots|injured | address2 | three shots |
| 2 | 19-04-2003 | three shots|injured | address2 | injured |
| 3 | 21-10-2004 | shot|injured | address3 | shot |
| 3 | 21-10-2004 | shot|injured | address3 | injured |
+----+------------+---------------------+----------+-------------+
SQLFiddle here
Turning this into an Insert statement to fill your ViolenceCategories table would look like:
INSERT INTO ViolenceCategories
SELECT t1.id, t2.id
FROM
(
SELECT id, value
FROM FromCSV
CROSS APPLY STRING_SPLIT(description, '|') AS descriptions
) t1
INNER JOIN DimCategories t2 ON t1.value = t2.category
SQLFiddle here

Crosstab splitting results due to presence of unrelated field

I'm using postgres 9.1 with tablefunc:crosstab
I have a table with the following structure:
CREATE TABLE marketdata.instrument_data
(
dt date NOT NULL,
instrument text NOT NULL,
field text NOT NULL,
value numeric,
CONSTRAINT instrument_data_pk PRIMARY KEY (dt , instrument , field )
)
This is populated by a script that fetches data daily. So it might look like so:
| dt | instrument | field | value |
|------------+-------------------+-----------+-------|
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1 |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT | 2 |
I then use the following crosstab query to pivot the table:
select dt, instrument, vol, oi
FROM crosstab($$
select dt, instrument, field, value
from marketdata.instrument_data
where field = 'PX_VOLUME' or field = 'OPEN_INT'
$$::text, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$::text
) vol(dt date, instrument text, vol numeric, oi numeric);
Running this I get the result:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | 2 |
The problem:
When running this with lot of real data in the table, I noticed that for some fields the function was splitting the result over two rows:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | |
| 2014-05-23 | SGX.MiniJGB.2014U | | 2 |
I checked that the dt and instrument fields were identical and produced a work-around by grouping the ouput of the crosstab.
Analysis
I've discovered that it's the presence of one other entry in the input table that causes the output to be split over 2 rows. If I have the input as follows:
| dt | instrument | field | value |
|------------+-------------------+-----------+-------|
| 2014-04-23 | EUX.Bund.2014M | PX_VOLUME | 0 |
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1 |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT | 2 |
I get:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-04-23 | EUX.Bund.2014M | 0 | |
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | |
| 2014-05-23 | SGX.MiniJGB.2014U | | 2 |
Where it gets really weird...
If I recreate the above input table manually then the output is as we would expect, combined into a single row.
If I run:
update marketdata.instrument_data
set instrument = instrument
where instrument = 'EUX.Bund.2014M'
Then again, the output is as we would expect, which is surprising as all I've done is set the instrument field to itself.
So I can only conclude that there is some hidden character/encoding issue in that Bund entry that is breaking crosstab.
Are there any suggestions as to how I can determine what it is about that entry that breaks crosstab?
Edit:
I ran the following on the raw table to try and see any hidden characters:
select instrument, encode(instrument::bytea, 'escape')
from marketdata.bloomberg_future_data_temp
where instrument = 'EUX.Bund.2014M';
And got:
| instrument | encode |
|----------------+----------------|
| EUX.Bund.2014M | EUX.Bund.2014M |

Two problems.
1. ORDER BY is required.
The manual:
In practice the SQL query should always specify ORDER BY 1,2 to ensure that the input rows are properly ordered, that is, values with the same row_name are brought together and correctly ordered within the row.
With the one-parameter form of crosstab(), ORDER BY 1,2 would be necessary.
2. One column with distinct values per group.
The manual:
crosstab(text source_sql, text category_sql)
source_sql is a SQL statement that produces the source set of data.
...
This statement must return one row_name column, one category column,
and one value column. It may also have one or more "extra" columns.
The row_name column must be first. The category and value columns must
be the last two columns, in that order. Any columns between row_name
and category are treated as "extra". The "extra" columns are expected
to be the same for all rows with the same row_name value.
Bold emphasis mine. One column. It seems like you want to form groups over two columns, which does not work as you desire.
Related answer:
Pivot on Multiple Columns using Tablefunc
The solution depends on what you actually want to achieve. It's not in your question, you silently assumed the function would do what you hope for.
Solution
I guess you want to group on both leading columns: (dt, instrument). You could play tricks with concatenating or arrays, but that would be slow and / or unreliable. I suggest a cleaner and faster approach with a window function rank() or dense_rank() to produce a single-column unique value per desired group. This is very cheap, because ordering rows is the main cost and the order of the frame is identical to the required order anyway. You can remove the added column in the outer query if desired:
SELECT dt, instrument, vol, oi
FROM crosstab(
$$SELECT dense_rank() OVER (ORDER BY dt, instrument) AS rnk
, dt, instrument, field, value
FROM marketdata.instrument_data
WHERE field IN ('PX_VOLUME', 'OPEN_INT')
ORDER BY 1$$
, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$
) vol(rnk int, dt date, instrument text, vol numeric, oi numeric);
More details:
PostgreSQL Crosstab Query

You could run a query that replaces irregular characters with an asterisk:
select regexp_replace(instrument, '[^a-zA-Z0-9]', '*', 'g')
from marketdata.instrument_data
where instrument = 'EUX.Bund.2014M'
Perhaps the instrument = instrument assignment discards trailing whitespace. That would also explain why where instrument = 'EUX.Bund.2014M' matches two values that crosstab sees as different.

Cross table into normal table

I am getting the data from the Excel sheet on monthly , need to import data into the Access table as same as excel sheet. Then I need to transform the Access table (Input table) into the Normal Access table( Output table)...
Kindly provide suggestions
how to create a dynamic input table in the access ..since new column can add or remove in the excel sheet (ex :201211)
How to convert input table into Output table in the access.
Input Table : ( Column name : Product | Location | 201209 | 201210 )
Product | Location | 201209 | 201210
X | DK | 10 | 12
y | DK | 10 | 12
Output table :
Product | Location | Date | Quantity
X | DK |201209 | 10
X | DK |201209 | 12
Y | DK |201210 | 10
Y | DK |201210 | 12
My input table contains more columns ( ex : 201208 , 201209, 201210 ....... 201402)

You could get your desired output from a query like this:
Select Product, Location, '201209' as [Date], Table.[201209] as Quantity from Table
UNION
Select Product, Location, '201210' as [Date], Table.[201210] as Quantity from Table
You mentioned that your column names could change. You could get around this by creating a VBA subroutine which would look at the TableDef, and construct a SQL query for all the columns which would then insert records into a table.

Selecting a record based on integer being in an array field

I have a database of houses. Within the houses mssql database record is a field called areaID. A house could be in multiple areas so an entry could be as follows in the database:
+---------+----------------------+-----------+-------------+-------+
| HouseID | AreaID | HouseType | Description | Title |
+---------+----------------------+-----------+-------------+-------+
| 21 | 17, 32, 53 | B | data | data |
+---------+----------------------+-----------+-------------+-------+
| 23 | 23, 73 | B | data | data |
+---------+----------------------+-----------+-------------+-------+
| 24 | 53, 12, 153, 72, 153 | B | data | data |
+---------+----------------------+-----------+-------------+-------+
| 23 | 23, 53 | B | data | data |
+---------+----------------------+-----------+-------------+-------+
If I open a page that called for houses only in area 53 how would I search for it. I know in MySQL you can use find_in_SET but I am using Microsoft SQL Server 2005.

If your formatting is EXACTLY
N1, N2 (e.g.) one comma and space between each N
Then use this WHERE clause
WHERE ', ' + AreaID + ',' LIKE '%, 53,%'
The addition of the prefix and suffix makes every number, anywhere in the list, consistently wrapped by comma-space and suffixed by comma. Otherwise, you may get false positives with 53 appearing in part of another number.
Note
A LIKE expression will be anything but fast, since it will always scan the entire table.
You should consider normalizing the data into two tables:
Tables become
House
+---------+----------------------+----------+
| HouseID | HouseType | Description | Title |
+---------+----------------------+----------+
| 21 | B | data | data |
| 23 | B | data | data |
| 24 | B | data | data |
| 23 | B | data | data |
+---------+----------------------+----------+
HouseArea
+---------+-------
| HouseID | AreaID
+---------+-------
| 21 | 17
| 21 | 32
| 21 | 53
| 23 | 23
| 23 | 73
..etc
Then you can use
select * from house h
where exists (
select *
from housearea a
where h.houseid=a.houseid and a.areaid=53)

2 options, change the id's of AreaId so that you can use the & operator OR create a table that links the House and Area's....

What datatype is AreaID?
If it's a text field you could something like
WHERE (
AreaID LIKE '53,%' -- Covers: multi number seq w/ 53 at beginning
OR AreaID LIKE '% 53,%' -- Covers: multi number seq w/ 53 in middle
OR AreaID LIKE '% 53' -- Covers: multi number seq w/ 53 at end
OR AreaID = '53' -- Covers: single number seq w/ only 53
)
Note: I haven't used SQL-Server in some time, so I'm not sure about the operators. PostgreSQL has a regex function, which would be better at condensing that WHERE statement. Also, I'm not sure if the above example would include numbers like 253 or 531; it shouldn't but you still need to verify.
Furthermore, there are a bunch of functions that iterate through arrays, so storing it as an array vs text might be better. Finally, this might be a good example to use a stored procedure, so you can call your homebrewed function instead of cluttering your SQL.

Use a Split function to convert comma-separated values into rows.
CREATE TABLE Areas (AreaID int PRIMARY KEY);
CREATE TABLE Houses (HouseID int PRIMARY KEY, AreaIDList varchar(max));
GO
INSERT INTO Areas VALUES (84);
INSERT INTO Areas VALUES (24);
INSERT INTO Areas VALUES (66);
INSERT INTO Houses VALUES (1, '84,24,66');
INSERT INTO Houses VALUES (2, '24');
GO
CREATE FUNCTION dbo.Split (#values varchar(512)) RETURNS table
AS
RETURN
WITH Items (Num, Start, [Stop]) AS (
SELECT 1, 1, CHARINDEX(',', #values)
UNION ALL
SELECT Num + 1, [Stop] + 1, CHARINDEX(',', #values, [Stop] + 1)
FROM Items
WHERE [Stop] > 0
)
SELECT Num, SUBSTRING(#values, Start,
CASE WHEN [Stop] > 0 THEN [Stop] - Start ELSE LEN(#values) END) Value
FROM Items;
GO
CREATE VIEW dbo.HouseAreas
AS
SELECT h.HouseID, s.Num HouseAreaNum,
CASE WHEN s.Value NOT LIKE '%[^0-9]%'
THEN CAST(s.Value AS int)
END AreaID
FROM Houses h
CROSS APPLY dbo.Split(h.AreaIDList) s
GO
SELECT DISTINCT h.HouseID, ha.AreaID
FROM Houses h
INNER JOIN HouseAreas ha ON ha.HouseID = h.HouseID
WHERE ha.AreaID = 24

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Unable to extract product price from column in HIVE table - hive

You are trying to insert text e.g. $169.00 to numeric field. Hive handle this kind of mismatches by inserting NULL values. Change ProdPrice to string or remove the '$' symbol (and if other currencies are available, save the currency symbol in additional column)

Related

SQL Server : drop zeros from col1 and concat with col2 into new View

N to N relationship using three source tables

Crosstab splitting results due to presence of unrelated field

Cross table into normal table

Selecting a record based on integer being in an array field

Categories

Resources