Hive SerDe - One Record per File for External Tables - hive

How can we specify one record per file when creating external tables.
The data i have is of this format. one row per file and the format of the row is
compressed_ bytebuffer(jackson.write(java pojo))
So how do we specify that Hive should treat the content in the file as one Record and pass it to my SerDe.
I took a look in to the code of JSonSerde and looks like if i can get the entire blob to my SeDe serialize/deserialize methods then i just have to uncompress and the rest of JSonSerDe code will work fine for my case. Any suggestions/ideas on whether this approach will work?

Not to say you can't do this, but you're going against the grain of Hadoop. Don't think of things as a 1-file-to-1-input. Instead, load all of your input with some sort of record delimiter (normally a \n). Then let Hadoop drive.
As for the SerDe, Hive will read the record based on the delimiter of the source data. This means that the blob will be (should be) your compressed JSON. So start by extending the JSON SerDe. First you'd have to uncompress it, then just hand it to super of the JSON SerDe.
Again, you feel like you're going against the system architecture. Let Hive manage the compression for you. Load the data uncompressed and let the sub system manage compression by setting things like Snappy or LZO. This provides you options like block compression or recompression.

Related

Architectural design clarrification

I built an API in nodejs+express that allows reactjs clients to upload CSV files(maximum size is atmost 1GB) to the server.
I also wrote another API which when given the filename and row numbers in an array (ie array of row numbers ) as input, it selects the rows corresponding to the row numbers, from the previously stored files and writes it to another result file (writeStream).
Then th resultant file is piped back to the client(all via streaming).
Currently as you see I am using files(basically nodejs' read and write streams) to asynchronously manage this.
But I have faced srious latency (only 2 cores are used) and some memory leak (900mb consumption) when I have 15 requests, each supplying about 600 rows to retrieve from files of size approximately 150mb.
I also have planned an alternate design.
Basically, I will store the entire file as a SQL Table with row numbers as primary indexed key.
I will convert the user inputted array of row numbrs to a another table using sql unnest and then join both these tables to get the rows needed.
Then I will supply back the resultant table as a csv file to the client.
Would this architecture be better than the previous architecture?
Any suggestions from devs is highly appreciated.
Thanks.
Use the client to do all the heavy lifting by using the XLSX package for any manipulation of content. Then have API to save information about the transaction. This will remove upload to server and download from the server and help you provide better experience.

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.
https://arrow.apache.org/docs/python/dataset.html
https://issues.apache.org/jira/browse/ARROW-2659
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset

How to parse aerospike backup file to regenerate data?

In the backup file there are a lot of encoded values. How do I get back the original data.
For example there is
+ d q+LsiGs1gD9duJDbzQSXytajtCY=
which is of the format ["+"] [SP] ["d"] [SP] [{digest}] [LF] where q+LsiGs1gD9duJDbzQSXytajtCY= is the key digest. How would the get the primary key from this?
Also Map and List values are represented as opaque byte values. How do we restore the original Map and List?
I would currently need to do all this if I wanted to make a CSV dump out of the backup.
The tool asbackup is an open source tool, as is asrestore. The file format is described in the repo aerospike/aerospike-tools-backup on GitHub.
Alternatively, you could use the Kafka connector to move data from Aerospike to another database via Kafka.
The easiest way to do what you're looking for is still to write a program that scans the target namespace, and parses each record into a csv format. You can use predicate filtering to only get records whose last-update-time is greater than a specific timestamp, giving you the progressive backup you want. See the PredExp class of the Java client and its examples.

How to preserve Google Cloud Storage rows order in compressed files

We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,
As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.
is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.

customizing output from database and formatting it

Say you have an average looking database. And you want to generate a variety of text files (each with their own specific formatting - so the files may have rudimentary tables and spacing). So you'd be taking the data from the Database, transforming it in a specified format (while doing some basic logic) and saving it as a text file (you can store it in XML as an intermediary step).
So if you had to create 10 of these unique files what would be the ideal approach to creating these files? I suppose you can create classes for each type of transformation but then you'd need to create quite a few classes, and what if you needed to create another 10 more of these files (a year down the road)?
What do you think is a good approach to this problem? being able to maintain the customizability of the output file, yet not creating a mess of a code and maintenance effort?
Here is what I would do if I were to come up with a general approach to this vague question. I would write three pieces of code, independent of each other:
a) A query processor which can run a query on a given database and output results in a well-known xml format.
b) An XSL stylesheet which can interpret the well-known xml format in (a) and transform it to the desired format.
c) An XML-to-Text transformer which can read the files in (a) and (b) and put out the result.