How can I SERDE to build generic file ingestion into Hive? - hive

I need to build generic file ingestion into Hive. The files are very large (2GB+), can be fixed or comma-separated, ASCII or EBCDIC files. After trying various techniques using Talend, I am looking into SERDE. If I ingest the files as-is and use a schema file (containing ordinal position, column name, type, length), can I create a custom SERDE to de-serialize any input file into hive rows? How performant would it be?

Since asking this question, I found that I could use a COBOL custom SERDE.
I am also looking at regex SERDE for positional files.

Related

Parquet file with more than one schema

I am used to parquet file with a single schema. I came across a file which, seemingly has more than one schema. I used pandas to convert it to a CSV file. The result is some things like this:
table-1,table-2,table-3
0, {data for table-1} {dat for table-2} {data for table-3}
I read the parquet file format and it looks like a single parquet file has a single schema.
Does parquet support more than one schema in a single file?
No, the Parquet format only supports a single schema per file. This schema is written into the footer of the file and accounts for all sections of the file. You could probably reread the CSV file into pandas and save that as a Parquet file, but ultimately you will be better off when you save each table as a separate file. The latter should also be much more performant and space-efficient.

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

Hive ORC File Format

When we create an ORC table in hive we can see that the data is compressed and not exactly readable in HDFS. So how is Hive able to convert that compressed data into readable format which is shown to us when we fire a simple select * query to that table?
Thanks for suggestions!!
By using ORCserde while creating table. u have to provide package name for serde class.
ROW FORMAT ''.
What serde does is to serialize a particular format data into object which hive can process and then deserialize to store it back in hdfs.
Hive uses “Serde” (Serialization DeSerialization) to do that. When you create a table you mention the file format ex: in your case It’s ORC “STORED AS ORC” , right. Hive uses the ORC library(Jar file) internally to convert into a readable format. To know more about hive internals search for “Hive Serde” and you will know how the data is converted to object and vice-versa.

ROW FORMAT Serde in hive

I am using hadoop 2.0.4 and working in twitter sentiment analysis. I have used flume to ingest data but now the twitter data must be stored in hive table.
I have created a table but ROW FORMAT SERDE is giving error
'Unable to validate'
Kindly tell me how to proceed.
Are you using a custom SerDe?
Please refer to the below information provided in Language Manual of hive
You can create tables with a custom SerDe or using a native SerDe. A
native SerDe is used if ROW FORMAT is not specified or ROW FORMAT
DELIMITED is specified.
Hope the information is useful.
You can try adding this jar
hive-serdes-1.0-SNAPSHOT.jar
After adding the jar you can create an external hive table containing the tweet_id and the tweet_text which refers to the tweets directory for performing sentiment analysis like this.
create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
You can refer to the below link for performing sentiment analysis using hive.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/
Check weather you have added hive-serdes-1.0-SNAPSHOT.jar in your hive directory under lib folder. Your hive directory path will be the one which you have mentioned in your .bashrc file.

How to create format files using bcp from flat files

I want to use a format file to help import a comma delimited file using bulk insert. I want to know how you generate format files from a flat file source. The microsoft guidance on this subjects makes it seem as though you can only generate a format file from a SQL table. But I want it to look at text file and tell me what the delimiters are in that file.
Surely this is possible.
Thanks
The format file can, and usually does include more than just delimiters. It also frequently includes column data types, which is why it can only be automatically generated from the Table or view the data is being retrieved from.
If you need to find the delimiters in a flat file, I'm sure there are a number of ways to create a script that could accomplish that, as well as creating a format file.