Loading null values to Hive table - hive

I have a .txt file that has the following rows:
Steve,1 1 1 1 1 5 10 20 10 10 10 10
when i created an external table, loaded the data and select *, i got null values. Please help how to show the number values instead of null. I very much appreciate the help!
create external table Teller(Name string, Bill array<int>)
row format delimited
fields terminated by ','
collection items terminated by '\t'
stored as textfile
location '/user/training/hive/Teller';
load data local inpath'/home/training/hive/input/*.txt' overwrite into table Teller;
output:
Steve [null]

It seems the integers are separated by spaces and not tabs
bash
hdfs dfs -mkdir -p /user/training/hive/Teller
echo Steve,1 1 1 1 1 5 10 20 10 10 10 10 | hdfs dfs -put - /user/training/hive/Teller/data.txt
hive
hive> create external table Teller(Name string, Bill array<int>)
> row format delimited
> fields terminated by ','
> collection items terminated by ' '
> stored as textfile
> location '/user/training/hive/Teller';
OK
Time taken: 0.417 seconds
hive> select * from teller;
OK
Steve [1,1,1,1,1,5,10,20,10,10,10,10]

Related

How to extract characters from a string stored as json data and place them in dynamic number of columns in SQL Server

I have a column of string in SQL Server that stores JSON data with all the braces and colons included.
My problem is to extract all the key and value pairs and store them in separate columns with the key as the column header. What makes this challenging is that every record has different number of key/value pairs.
For example in the image below showing 3 records, the first record has 5 key/value pairs- EndUseCommunityMarket of 2, EndUseProvincial Market of 0, and so on. The second record has 1 key/value pair, and the third record has two key/value pairs.
If I have to show how I want this in excel it would be like:
I have seen some SQL code examples that does something similar but for a fixed number of columns, unlike this one it varies for every record.
Please I need a SQL statement that can achieve this as I am working on thousands of records.
Below is this data copied from sql server:
catch_ext
{"NfdsFadMonitoring":{"EndUseEaten":1}}
{"NfdsFadMonitoring":{"EndUseCommunityMarket":3}}
{"NfdsFadMonitoring":{"SpeciesComment":"","EndUseCommunityMarket":2}}
{"NfdsFadMonitoring":{"SpeciesComment":"mix reef fis","EndUseEaten":31}}
{"NfdsFadMonitoring":{"SpeciesComment":"10 fish with a total of 18kg","EndUseCommunityMarket":0,"EndUseProvincialMarket":0,"EndUseUrbanMarket":8,"EndUseEaten":1,"EndUseGivenAway":1}}
{"NfdsFadMonitoring":{"SpeciesComment":"mix reef fis","EndUseEaten":18}}
I expect you don't want to dynamically create a table, instead you probably want to create a property mapping table. Here is a quick overview of the design.
Object table -- this stores the base information about your object
============
ID -- unique id field for every object.
Name
Property types table -- this stores all the property types
====================
Property_Type_ID -- unique type id
Description -- describes property
Object2Property -- stores the values for each property
===============
ObjectID -- the object
Property_Type_ID -- the property type
Value -- the value.
Using a model like this lets your properties be as dynamic as you wish but you don't have to create columns dynamically -- something that is hard and error prone.
using your specific example the tables would look like this
OBJECT
ID NAME
1 WHAOO
2 RED SNAMPPER
3 KAWAKAWA
Property Types
ID DESC
1 EndUseCommunityMarket
2 EndUseProvincialMarket
3 EndUseUrbanMarket
4 EndUseEaten
5 EndUseGivenAway
6 Comment
Map
ObjID TypeID Value
1 1 2
1 2 0
1 3 0
1 4 0
1 5 0
2 2 50
3 3 8
3 5 1
A. ROWS
Dynamic columns are a lot like rows.
You could use OPENJSON (Transact-SQL)
DECLARE #json2 NVARCHAR(4000) = N'{"NfdsFadMonitoring":{"SpeciesComment":"10 fish with a total of 18kg","EndUseCommunityMarket":0,"EndUseProvincialMarket":0,"EndUseUrbanMarket":8,"EndUseEaten":1,"EndUseGivenAway":1}}';
SELECT [key], value
FROM OPENJSON(#json2,'lax $.NfdsFadMonitoring')
Output
key value
SpeciesComment 10 fish with a total of 18kg
EndUseCommunityMarket 0
EndUseProvincialMarket 0
EndUseUrbanMarket 8
EndUseEaten 1
EndUseGivenAway 1
Your inputs
CREATE TABLE ForEloga (Id int,Json nvarchar(max));
Insert into ForEloga Values
(1,'{"NfdsFadMonitoring":{"EndUseEaten":1}}'),
(2,'{"NfdsFadMonitoring":{"EndUseCommunityMarket":3}}'),
(3,'{"NfdsFadMonitoring":{"SpeciesComment":"","EndUseCommunityMarket":2}}'),
(4,'{"NfdsFadMonitoring":{"SpeciesComment":"mix reef fis","EndUseEaten":31}}'),
(5,'{"NfdsFadMonitoring":{"SpeciesComment":"10 fish with a total of 18kg","EndUseCommunityMarket":0,"EndUseProvincialMarket":0,"EndUseUrbanMarket":8,"EndUseEaten":1,"EndUseGivenAway":1}}'),
(6,'{"NfdsFadMonitoring":{"SpeciesComment":"mix reef fis","EndUseEaten":18}}');
SELECT Id, [key], value
FROM ForEloga CROSS APPLY OPENJSON(Json,'lax $.NfdsFadMonitoring')
Output
Id key value
1 EndUseEaten 1
2 EndUseCommunityMarket 3
3 SpeciesComment
3 EndUseCommunityMarket 2
4 SpeciesComment mix reef fis
4 EndUseEaten 31
5 SpeciesComment 10 fish with a total of 18kg
5 EndUseCommunityMarket 0
5 EndUseProvincialMarket 0
5 EndUseUrbanMarket 8
5 EndUseEaten 1
5 EndUseGivenAway 1
6 SpeciesComment mix reef fis
6 EndUseEaten 18
B. COLUMNS: CROSS APPLY WITH WITH
If you know all possible properties then I recommend CROSS APPLY with WITHas shown in Example 3 - Join rows with JSON data stored in table cells using CROSS APPLY in OPENJSON (Transact-SQL).
SELECT store.title, location.street, location.lat, location.long
FROM store
CROSS APPLY OPENJSON(store.jsonCol, 'lax $.location')
WITH (
street varchar(500),
postcode varchar(500) '$.postcode',
lon int '$.geo.longitude',
lat int '$.geo.latitude'
) AS location
Try this:
Table Schema:
CREATE TABLE #JsonValue(sp_name VARCHAR(100),catch_ext VARCHAR(1000))
INSERT INTO #JsonValue VALUES ('WAHOO','{"NfdsFadMonitoring":{"EndUseEaten":1}}')
INSERT INTO #JsonValue VALUES ('RUBY SNAPPER','{"NfdsFadMonitoring":{"EndUseCommunityMarket":3}}')
INSERT INTO #JsonValue VALUES ('KAWAKAWA','{"NfdsFadMonitoring":{"SpeciesComment":"","EndUseCommunityMarket":2}}')
INSERT INTO #JsonValue VALUES ('XXXXXXXX','{"NfdsFadMonitoring":{"SpeciesComment":"mix reef fis","EndUseEaten":31}}')
INSERT INTO #JsonValue VALUES ('YYYYYYYY','{"NfdsFadMonitoring":{"SpeciesComment":"10 fish with a total of 18kg","EndUseCommunityMarket":0,"EndUseProvincialMarket":0,"EndUseUrbanMarket":8,"EndUseEaten":1,"EndUseGivenAway":1}}')
INSERT INTO #JsonValue VALUES ('ZZZZZZZZZZ','{"NfdsFadMonitoring":{"SpeciesComment":"mix reef fis","EndUseEaten":18}}')
Query:
SELECT sp_name
,ISNULL(MAX(CASE WHEN [Key]='EndUseCommunityMarket' THEN Value END),'')EndUseCommunityMarket
,ISNULL(MAX(CASE WHEN [Key]='EndUseProvincialMarket' THEN Value END),'')EndUseProvincialMarket
,ISNULL(MAX(CASE WHEN [Key]='EndUseUrbanMarket' THEN Value END),'')EndUseUrbanMarket
,ISNULL(MAX(CASE WHEN [Key]='EndUseEaten' THEN Value END),'')EndUseEaten
,ISNULL(MAX(CASE WHEN [Key]='EndUseGivenAway' THEN Value END),'')EndUseGivenAway
FROM(
SELECT sp_name, [key], value
FROM #JsonValue CROSS APPLY OPENJSON(catch_ext,'$.NfdsFadMonitoring')
)D
GROUP BY sp_name
Output:
sp_name EndUseCommunityMarket EndUseProvincialMarket EndUseUrbanMarket EndUseEaten EndUseGivenAway
------------- --------------------- ---------------------- ----------------- ----------- ---------------
KAWAKAWA 2
RUBY SNAPPER 3
WAHOO 1
XXXXXXXX 31
YYYYYYYY 0 0 8 1 1
ZZZZZZZZZZ 18
Hope this will help you.

How to data in SQL Server from excel file with same column names

I need to import an excel file to SQL Server in multiples tables.
I want to insert the column PARTNUMBER, and PART_DESCRIPTION,
in a table called PartNumbers, so that's ok, I can do it.
But I don't know how to can I do dynamically, or how to insert the another columns ("SUB_ITEM1", "SUB_ITEM2", "SUB_ITEM3"), because,
I want to insert for example, in a table called Results,
I want to insert in each row information like this:
SUB_ITEM PASS FAILED IdPartNumber
Test 1 0 1
Test2 0 1 1
Test3 0 0 1
Test 0 1 2
Test2 1 0 2
Test3 1 1 2
How can I do it?, I'm using OPEN ROWSET, To read the excel file from SQL Server,
And if I only put a query like ("SELECT * FROM MyExcelSheet"), it gonna give me error because
I have repeated columns ("PASS", and "FAILED").

External Tables (HIVE) Choose only a few columns from a file

How can I create a external table setting only a few columns from a file?
Ex: In archive I have six columns, A,B,C,D,E,F. But in my table i want only A, C, F.
Is It possible?
I do not know of a way to selectively include columns from HDFS files for an external table. Depending on your use case, it may be sufficient to define a view based on the external table to only include the columns you want. For example, given the following silly example of an external table:
hive> CREATE EXTERNAL TABLE ext_table (
> A STRING,
> B STRING,
> C STRING,
> D STRING,
> E STRING,
> F STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> LOCATION '/tmp/ext_table';
OK
Time taken: 0.401 seconds
hive> SELECT * FROM ext_table;
OK
row_1_col_A row_1_col_B row_1_col_C row_1_col_D row_1_col_E row_1_col_F
row_2_col_A row_2_col_B row_2_col_C row_2_col_D row_2_col_E row_2_col_F
row_3_col_A row_3_col_B row_3_col_C row_3_col_D row_3_col_E row_3_col_F
Time taken: 0.222 seconds, Fetched: 3 row(s)
Then create a view to only include the columns you want:
hive> CREATE VIEW filtered_ext_table AS SELECT A, C, F FROM ext_table;
OK
Time taken: 0.749 seconds
hive> DESCRIBE filtered_ext_table;
OK
a string
c string
f string
Time taken: 0.266 seconds, Fetched: 3 row(s)
hive> SELECT * FROM filtered_ext_table;
OK
row_1_col_A row_1_col_C row_1_col_F
row_2_col_A row_2_col_C row_2_col_F
row_3_col_A row_3_col_C row_3_col_F
Time taken: 0.301 seconds, Fetched: 3 row(s)
Another way to achieve what you want would require that you have the ability to modify the HDFS files backing your external table - if the columns you are interested in are all near the beginning of each line, then you can define your external table to capture only the first 3 columns (without regard for how many more columns are actually in the file). For example, with the same data file as above:
hive> DROP TABLE IF EXISTS ext_table;
OK
Time taken: 1.438 seconds
hive> CREATE EXTERNAL TABLE ext_table (
> A STRING,
> B STRING,
> C STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> LOCATION '/tmp/ext_table';
OK
Time taken: 0.734 seconds
hive> SELECT * FROM ext_table;
OK
row_1_col_A row_1_col_B row_1_col_C
row_2_col_A row_2_col_B row_2_col_C
row_3_col_A row_3_col_B row_3_col_C
Time taken: 0.727 seconds, Fetched: 3 row(s)
I found answer here
create table tmpdc_ticket(
SERVICE_ID CHAR(144),
SERVICE_TYPE CHAR(50),
CUSTOMER_NAME CHAR(200),
TELEPHONE_NO CHAR(144),
ACCOUNT_NUMBER CHAR(144),
FAULT_STATUS CHAR(50),
BUSINESS_GROUP CHAR(100)
)
organization external(
type oracle_loader
default directory sample_directory
access parameters(
records delimited by newline
nologfile
skip 1
fields terminated by '|'
missing field values are null
(DUMMY_1,
DUMMY_2,
SERVICE_ID CHAR(144),
SERVICE_TYPE CHAR(50),
CUSTOMER_NAME CHAR(200),
TELEPHONE_NO CHAR(144),
ACCOUNT_NUMBER CHAR(144),
FAULT_STATUS CHAR(50),
BUSINESS_GROUP CHAR(100)
)
)
location(sample_directory:'sample_file.txt')
)
reject limit 1
noparallel
nomonitoring;

Extra spaces after loading from a text file with SQL*Loader

I'm using Oracle 11g, and I'm trying to load data from a text file with SQL*Loader
Here is a sample of the data (there are much more columns):
123456789876543212,100,333,432,02/05/2014,02/05/2014,02/05/2014,1.1,AA
I want to load the data into the DB first as a VARCHAR2, and then to convert them to the correct datatype in the DB, with a query. It's much more easy in my opinion.
Here is my table (MyTable):
create table MyTable
(
A varchar2(500)
B varchar2(500)
C varchar2(500)
D varchar2(500)
E varchar2(500)
F varchar2(500)
G varchar2(500)
H varchar2(500)
I varchar2(500)
)
Here is my loading script:
load data
infile 'D:\MyFile.txt'
into table MyTable
fields terminated by ','
trailing nullcols
(
A char(4000),
B char(4000),
C char(4000),
D char(4000),
E char(4000),
F char(4000),
G char(4000),
H char(4000),
I char(4000)
)
Here is how the data looks like after being loaded into the DB.
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2,1 0 0,3 3 3,4 3 2,0 2 / 0 5 / 2 0 1 4,0 2 / 0 5 / 2 0 1 4,0 2 / 0 5 / 2 0 1 4, 1 . 1,A A
Why does my data look like this? What are these spaces? I don't have a lot of experience with data loading.
I'm guessing that the problem is the data types of the table in the DB and in the loading file. What is the right way to defined such as data? I want to load the data as is into the DB. I'll make the conversation in the DB with a query. Please note that the first column has 18 digits.
The normal reason for "spaces" being inserted between every character after loading is because there is a nul (ASCII 0) after every character in your original text file. If you look at your file in a text editor in Hexadecimal you should be able to see this (it'll be represented as 00). You can also look at your table using the DUMP() function.
Without extra parameters, DUMP() is a useful function that returns the data-type code of the data you pass it, the length of the data in bytes and the internal representation of ''expr''. There are a few other options which are explained in the documentation.
From the below you'll see that the data-type code is 96, which represents a CHAR., the length is 1 i.e. the string is 1 byte long and the internal representation is 97, which is the ASCII code for a.
SQL> select dump('a')
2 from dual;
DUMP('A')
----------------
Typ=96 Len=1: 97
In your case you're expecting a code of 0 for nuls.
I'd go back to your supplier and tell them to remove the characters, after you've double checked, as you won't be able to tell whether they're actual nul characters or part of a multi-byte character. I've previously written about the strategies for removing nuls from the database should you be unable to get the file fixed.

hive count and count distinct not correct

I have a table in Hive that has 20 columns and I want to count unique records and all records per hour.
Table looks like:
CREATE EXTERNAL TABLE test1(
log_date string,
advertiser_creatives_id string,
cookieID string,
)
STORED AS ORC
LOCATION "/day1orc"
tblproperties ("orc.compress"="ZLIB");
And my query like this:
SELECT Hour(log_date),
Count(DISTINCT cookieid) AS UNIQUE,
Count(1) AS impressions
FROM test1
GROUP BY Hour(log_date);
But the results are not correct. I have about 70 million entries and when I do a sum of impressions I only get like 8 million so I suspect the distinct takes too many columns in account.
So how can I fix this so that I get the correct amount of impressions?
** Extra information **
hive.vectorized.execution.enabled is undefined so it is not active.
The same query in TEXT format returns even less rows (about 2.7 million)
Result of COUNT(*): 70643229
Result of COUNT(cookieID): 70643229
Result of COUNT(DISTINCT cookieID): 1440195
Cheers
I have an example,may be useful for you.I think you "row format delimited fields terminated by" has some problems .
I have a text,seperate by "\t",like below:
id date value
1 01-01-2014 10
1 03-01-2014 05
1 07-01-2014 40
1 05-01-2014 20
2 05-01-2014 10
but I only create a table have 2 columns, like below:
use tmp ;
create table sw_test(id string,td string) row format delimited fields terminated by '\t' ;
LOAD DATA LOCAL INPATH '/home/hadoop/b.txt' INTO TABLE sw_test;
How do you think the result of "select td from sw_test ;"
NOT
td
01-01-2014 10
03-01-2014 05
07-01-2014 40
05-01-2014 20
05-01-2014 10
BUT
td
01-01-2014
03-01-2014
07-01-2014
05-01-2014
05-01-2014
So,I think you cookie contains some special column include your defined seperator.
I hope this can help you .
good luck!