I have a dataset in a csv file . I have created a table in HIVE with two fields id and user id. And using select query I retrieved my data . It shows id , user id and NULL. I want only id and user id to be displayed . Could anyone please help me to solve this issue
Sample data;
116 Justin
582 Ivan
.....
.....
queries:
hive> create table hive_comments (id string, userid string) row format delimited fields terminated by ',' ;
hive> load data local inpath 'home/edureka/Documents/Project/dataDec-12-2015.csv' into table hive_comments;
hive> select * from hive_comments;
Result:
116 Justin NULL
582 Ivan NULL
.....
.....
How to remove null from this. Please help me to solve this issue.
Many Thanks
Provide a single row of your input data ..we can only guess otherwise.
Related
I have a user table users containing id , name and information of type jsonb
User Table
id
name
information
1001
Alice
{"1":"Google","2":"1991-02-08"}
1002
Bob
{"1":"StackOverflow","3":"www.google.com"}
I have another Table having all the profile fields values named ProfileFields
profilefieldid
Value
1
Company
2
DateOfBirth
3
ProfileLink
The information jsonb column can only have keys present in the ProfileField Table.
You can expect the data is coming from a real world and the profile field will be updating.
I would like to output export this table in the format of
id
name
Company
DateOfBirth
ProfileLink
1001
Alice
Google
1991-02-08
1002
Bob
StackOverflow
www.google.com
My Trails :-
I was able to map profilefieldid with its respective values
SELECT
id ,
name ,
(SELECT STRING_AGG(CONCAT((SELECT "title" FROM "profile_fields" WHERE CAST("key" AS INTEGER)="id"),':',REPLACE("value",'"','')),',') FROM JSONB_EACH_TEXT("profile_fields")) "information"
FROM "users" ORDER BY "id";
I tried to use json_to record() but since the profilefield can have dynamic keys i was not able to come up with a solution because in the AS block i need to specify the columns in advance.
I sometimes encounter errors in Select Statement as Subquery returning more than 1 column.
Any suggestions and Solutions are greatly appreciated and welcomed.
Let me know if i need to improve my db structure , like its not in 2nd NormalForm or not well structured like that. Thank You
There is no way you can make this dynamic. A fundamental restriction of the SQL language is, that the number, names and data type of all columns of a query must be known before the database starts retrieving data.
What you can do though is to create a stored procedure that generates a view with the needed columns:
create or replace procedure create_user_info_view()
as
$$
declare
l_columns text;
begin
select string_agg(concat('u.information ->> ', quote_literal(profilefield_id), ' as ', quote_ident(value)), ', ')
into l_columns
from profile_fields;
execute 'drop view if exists users_view cascade';
execute 'create view users_view as select u.id, u.name, '||l_columns||' from users u';
end;
$$
language plpgsql;
After the procedure is executed, you can run select * from users_view and see all profile keys as columns.
If you want, you can create a trigger on the table profile_fields that re-creates the view each time the table is changed.
Online example
I have a table like this.
create table help(
id number primary key,
number_s integer NOT NULL);
I had to insert value 0 from id 1 and id 915 I solved this one in a simple way doing
update help set number_s=0 where id<=915;
This one was easy.
Now I have to set a numbers ( that change every row) from id 915 to last row.
I was doing
update help set number_s=51 where id=916;
update help set number_s=3 where id=917;
There are more than 1.000 row to be updated how can I do it very fast?
When I had this problem I used to use sequence to auto increment value like id (example
insert into help(id,number_s) values (id_sequence.nextval,16);
insert into help(id,number_s) values (id_sequence.nextval,48);
And so on but on this case it cannot be used because in this case id start from 915 and not 1...) How can I do it very fast? I hope it is clear the problem.
Since you have your ids and numbers in a file with a simple structure, it's a fairly small number, and assuming this is something you're going to do once, honestly what I would do would be to pull the file into Excel, use the text functions to build 1000 insert statements and cut and paste them wherever.
If those assumptions are incorrect, you could (1) use sqlldr to load this file into a temporary table and (2) run an update on your help table based on the rows in that temporary table.
As mentioned in previous answers and according to your comment that there is a file stored in your system, You can use the external table / SQL loader to achieve the result.
I am trying to show you the demo as follows:
-- Create an external table pointing to your file
CREATE TABLE "EXT_SEQUENCES" (
"ID" number ,
"number_s" number
)
ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "<directory name>" ACCESS PARAMETERS (
RECORDS DELIMITED BY NEWLINE
BADFILE 'bad_file.txt'
LOGFILE 'log_file.txt'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' MISSING FIELD VALUES ARE NULL
) LOCATION ( '<file name>' )
) REJECT LIMIT UNLIMITED;
-- Now update your help table
MERGE INTO help H
USING EXT_SEQUENCES
ON ( H.ID = E.ID)
WHEN MATCHED THEN
UPDATE SET H.number_s = E.number_s;
Note: You need to change the access parameters of the external table according to your actual data in the file.
Hope you will get proper direction now.
Cheers!!
I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?
Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1
As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.
I have a dataset (txt file) in which there are 10 columns from which, last column has string data separated by a tab. for example -> abcdef lkjhj pqrst...wxyz
I created a new table defining col 10 as STRING but after loading the data into this table and I verify the data it shows only abcdef populated in the last column and the rest are ignored.
Plz can someone help how do I load entire string of data in the hive table. Do I need to write UDF ?
Thanks in advance
My input data is as follows:
1,srinivas,courtthomas,memphis
2,vindhya,courtthomas,memphis
3,srinivas,courtthomas,kolkata
4,vindhya,courtthomas,memphis
And I have created the following queries:
create EXTERNAL table seesaw (id int,name string,location string) partitioned by (address string) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile LOCATION '/seesaw';
LOAD DATA INPATH '/sampledoc' OVERWRITE INTO TABLE seesaw PARTITION (address = 'Memphis');
when I try to fetch my query it comes as follows:
Select * from seesaw;
OK
1 srinivas courtthomas Memphis
2 vindhya courtthomas Memphis
3 srinivas courtthomas Memphis
4 vindhya courtthomas Memphis
I really don't understand how all the rows have been showing memphis at the end.
Read your code closely:
create EXTERNAL table seesaw (id int,name string,location string)
Notice that there are only three columns, id, name and location.
Your data, however,
1,srinivas,courtthomas,memphis
2,vindhya,courtthomas,memphis
3,srinivas,courtthomas,kolkata
4,vindhya,courtthomas,memphis
has four columns. Something's fishy here.
LOAD DATA INPATH '/sampledoc'
OVERWRITE INTO TABLE seesaw
PARTITION (address = 'Memphis');
you're asking to partition a category that only contains courtthomas by Memphis. The result is to little surprise not what you want.
If you are using external table, you will need to manually create folders for each partition, i.e in your case - create two folders [address=Memphis] and [address=kolkata] AND copy the corresponding input data files under the corresponding folder and then add the partitions to metadata as follows:
ALTER TABLE seesaw ADD PARTITION(address='Memphis');
ALTER TABLE seesaw ADD PARTITION(address='kolkata');
Refer this article for a simple example of how to do this - hive-external-table-with-partitions