Retrieving JSON raw file data from Hive table - hive

I have a JSON File. I want to move only selected fields to Hive table. So below is the statement I used to create a new table to import the data from JSON file to HIVE Table. While creating it doesn't give any error but when i use select * from JsonFile1 or count(*) from JsonFile1 I get error as Failed with exception java.io.IOException:java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
I have browsed over the internet stuck with this since few days. I can't find a solution. I checked in the HDFS. I see there is a table created and complete file imported as-is(not just the fields I selected but all of it). I just provided the sample data, the actual data contains like 50+ field names. creating all the column names is cumbersome. Is that what we need to do? Thank you in advance.
CREATE EXTERNAL TABLE JsonFile1(user STRUCT<id:BIGINT,description:STRING, followers_count:INT>)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION 'link/data';
I have data as below
{filter_level":"low",geo":null,"user":{"id":859264394,"description":"I don’t want it. Building #techteam, #LetsTalk!!! def#abc.com",
"contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name"krogmi",
"screen_name":"jkrogmi","id_str":"859264394",}}06:20:16 +0000 2012","default_profile_image":false,"followers_count":88,
"profile_sidebar_fill_color":"DDFFCC","screen_name":"abc_abc"}}

Answering my own question.
I have deleted the data in hdfs which I was pointing in the LOCATION '...', copied data again from local to hdfs and recreated the table again and it worked.
I am assuming that data was the problem.

Related

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

where can I find the avro.schema.url within the hive meta store?

I am trying to locate the property avro.schema.url that is part of the table meta data when a table is created by specifying the location to a avro schema file for some avro data in s3 or hdfs. I am able to see it in the output when I run the describe extended table command, but within the metastore database, where is this property stored? I searched the table_params for that particular table_id and did not find it ?
found it, its in SERDE_PARAMS table

Load data from Drill table into Hive Table

I have created a table using Drill and it is located at
/user/abc/drill/Drilltable.
Now I would like to load the data from DrillTable into HiveTable which is located at path
/user/hive/warehouse/userxyz.db
I am using below statement to load data
INSERT INTO TABLE HiveTable select * from DrillTable;
I get the error
Table not found
and I am bit confused how to let Hive know the path of Drill table.
What would be the right way to handle this?
Hive might be confused about the schema of the drill data as well as the location. If you're willing to experiment, try something like this:
Store the data in a Drill format you can model in Hive, CSV for example, as described in this post.
In Hive, create an external table that defines the schema and location of the textual data. You can then convert the external table to a managed table (optional). For example ....

Presto failed: com.facebook.presto.spi.type.VarcharType

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType
I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Hive Extended table

When we create using
Create external table employee (name string,salary float) row format delimited fields terminated by ',' location /emp
In /emp directory there are 2 emp files.
so when we run select * from employee, it get the data from both the file ad display.
What will be happen when there will be others file also having different kind of record which column is not matching with the employee table , so it will try to load all the files when we run "select * from employee"?
1.Can we specify the specific file name which we want to load?
2.Can we create other table also with the same location?
Thanks
Prashant
It will load all the files in emp directory even it doesn’t match with table.
for your first question. you can use Regex serde.if your data matches to regex.then it loads to the table.
regex for access log in hive serde
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
other options:I am pointing some links.these links has some ways.
when creating an external table in hive can I point the location to specific files in a direcotry?
https://issues.apache.org/jira/browse/HIVE-951
for your second question: yes we can create other tables also with the same location.
Here are your answers
1. If the data in the file dosent match with table format, hive doesnt throw an error. It tries to read the data as best as it could. If data for some columns are missing it will put NULL for them.
No we cannot specify the file name for any table to read data. Hive will consider all the files under the table directory.
Yes, we can create other tables with the same location.