How to parse aerospike backup file to regenerate data? - aerospike

In the backup file there are a lot of encoded values. How do I get back the original data.
For example there is
+ d q+LsiGs1gD9duJDbzQSXytajtCY=
which is of the format ["+"] [SP] ["d"] [SP] [{digest}] [LF] where q+LsiGs1gD9duJDbzQSXytajtCY= is the key digest. How would the get the primary key from this?
Also Map and List values are represented as opaque byte values. How do we restore the original Map and List?
I would currently need to do all this if I wanted to make a CSV dump out of the backup.

The tool asbackup is an open source tool, as is asrestore. The file format is described in the repo aerospike/aerospike-tools-backup on GitHub.
Alternatively, you could use the Kafka connector to move data from Aerospike to another database via Kafka.
The easiest way to do what you're looking for is still to write a program that scans the target namespace, and parses each record into a csv format. You can use predicate filtering to only get records whose last-update-time is greater than a specific timestamp, giving you the progressive backup you want. See the PredExp class of the Java client and its examples.

Related

Architectural design clarrification

I built an API in nodejs+express that allows reactjs clients to upload CSV files(maximum size is atmost 1GB) to the server.
I also wrote another API which when given the filename and row numbers in an array (ie array of row numbers ) as input, it selects the rows corresponding to the row numbers, from the previously stored files and writes it to another result file (writeStream).
Then th resultant file is piped back to the client(all via streaming).
Currently as you see I am using files(basically nodejs' read and write streams) to asynchronously manage this.
But I have faced srious latency (only 2 cores are used) and some memory leak (900mb consumption) when I have 15 requests, each supplying about 600 rows to retrieve from files of size approximately 150mb.
I also have planned an alternate design.
Basically, I will store the entire file as a SQL Table with row numbers as primary indexed key.
I will convert the user inputted array of row numbrs to a another table using sql unnest and then join both these tables to get the rows needed.
Then I will supply back the resultant table as a csv file to the client.
Would this architecture be better than the previous architecture?
Any suggestions from devs is highly appreciated.
Thanks.
Use the client to do all the heavy lifting by using the XLSX package for any manipulation of content. Then have API to save information about the transaction. This will remove upload to server and download from the server and help you provide better experience.

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.
https://arrow.apache.org/docs/python/dataset.html
https://issues.apache.org/jira/browse/ARROW-2659
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset

Save large BigQuery results to another project's BigQuery

I need to run a join query on BigQuery of one project, that may return large amount of data (that may not fit in VM's memeory), and then save the results in the BigQuery of another project.
Is there an easy way to do this without loading the data in VM, as data size can vary and VM may not have enough memory to load it?
One method is to bypass the VM for the operation and utilize Google Cloud Storage instead.
The process will look like following
Create a GS bucket that both projects has access to
Source project - Export the table to the GS bucket (this is possible from the web interface, pretty sure the CLI tools can do it to)
Destination project - Create a new table from the files in the GS bucket
to save result of query to a table in any project - you do not need to save it first to VM you should just set properly destination property and of course you need to have write permissions to dataset that contain that table!
Destination property can vary depend on client tool you use
for example, if you are using REST API's jobs.insert you should set below property
configuration.query.destinationTable nested object [Optional]
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results. This
property must be set for large results that exceed the maximum
response size.
configuration.query.destinationTable.datasetId string [Required]
The
ID of the dataset containing this table.
configuration.query.destinationTable.projectId string [Required]
The
ID of the project containing this table.
configuration.query.destinationTable.tableId string [Required]
The ID
of the table. The ID must contain only letters (a-z, A-Z), numbers
(0-9), or underscores (_). The maximum length is 1,024 characters.

Hive SerDe - One Record per File for External Tables

How can we specify one record per file when creating external tables.
The data i have is of this format. one row per file and the format of the row is
compressed_ bytebuffer(jackson.write(java pojo))
So how do we specify that Hive should treat the content in the file as one Record and pass it to my SerDe.
I took a look in to the code of JSonSerde and looks like if i can get the entire blob to my SeDe serialize/deserialize methods then i just have to uncompress and the rest of JSonSerDe code will work fine for my case. Any suggestions/ideas on whether this approach will work?
Not to say you can't do this, but you're going against the grain of Hadoop. Don't think of things as a 1-file-to-1-input. Instead, load all of your input with some sort of record delimiter (normally a \n). Then let Hadoop drive.
As for the SerDe, Hive will read the record based on the delimiter of the source data. This means that the blob will be (should be) your compressed JSON. So start by extending the JSON SerDe. First you'd have to uncompress it, then just hand it to super of the JSON SerDe.
Again, you feel like you're going against the system architecture. Let Hive manage the compression for you. Load the data uncompressed and let the sub system manage compression by setting things like Snappy or LZO. This provides you options like block compression or recompression.

customizing output from database and formatting it

Say you have an average looking database. And you want to generate a variety of text files (each with their own specific formatting - so the files may have rudimentary tables and spacing). So you'd be taking the data from the Database, transforming it in a specified format (while doing some basic logic) and saving it as a text file (you can store it in XML as an intermediary step).
So if you had to create 10 of these unique files what would be the ideal approach to creating these files? I suppose you can create classes for each type of transformation but then you'd need to create quite a few classes, and what if you needed to create another 10 more of these files (a year down the road)?
What do you think is a good approach to this problem? being able to maintain the customizability of the output file, yet not creating a mess of a code and maintenance effort?
Here is what I would do if I were to come up with a general approach to this vague question. I would write three pieces of code, independent of each other:
a) A query processor which can run a query on a given database and output results in a well-known xml format.
b) An XSL stylesheet which can interpret the well-known xml format in (a) and transform it to the desired format.
c) An XML-to-Text transformer which can read the files in (a) and (b) and put out the result.