Keeping track of Datalake schemas - data-lake

I have a general question about keeping track of schemas in Datalake. In various logs, I have some fields which exist in every log. There are other fields which differ by log types. My team has a consensus to only add field, and not delete existing fields.
We first bring in all the logs into AWS S3 in JSON format, and then transform the logs into PARQUET, and here the schema becomes important. For the fields which exist in every log, we force the original data types, for example id or date. For the other fields which differ in log types, they are converted into JSON STRING and save as a single column.
In this case, is there any tools that can be used to find out the exact schema of the data? AWS GLUE doesn't seem to offer a way to catalog this kind of data.
Or, in other case, please feel free to tell me an appropriate way of keeping track of schema evolution. Thanks much in advance!

Related

How to save millions of files in S3 so that arbitrary future searches on key/path values are fast

My company has millions of files in an S3 bucket, and every so often I have to search for files whose keys/paths contain some text. This is an extremely slow process because I have to iterate through all files.
I can't use prefix because the text of interest is not always at the beginning. I see other posts (here and here) that say this is a known limitation in S3's API. These posts are from over 3 years ago, so my first question is: does this limitation still exist?
Assuming the answer is yes, my next question is, given that I anticipate arbitrary regex-like searches over millions of S3 files, are there established best practices for workarounds? I've seen some people say that you can store the key names in a relational database, Elasticsearch, or a flat file. Are any of these approaches more common place than others?
Also, out of curiosity, why hasn't S3 supported such a basic use case in a service (S3) that is such an established core product of the overall AWS platform? I've noticed that GCS on Google Cloud has a similar limitation. Is it just really hard to do searches on key name strings well at scale?
S3 is an object store, conceptually similar to a file system. I'd never try to make a database-like environment based on file names in a file system nor would I in S3.
Nevertheless, if this is what you have then I would start by running code to get all of the current file names into a database of some sort. DynamoDB cannot query by regular expression but any of PostgreSQL, MySQL, Aurora, and ElasticSearch can. So start with listing every file and put the file name and S3 location into a database-like structure. Then, create a Lambda that is notified of any changes (see this link for more info) that will do the appropriate thing with your backing store when a file is added or deleted.
Depending on your needs ElasticSearch is super flexible with queries and possibly better suited for these types of queries. But traditional relational database can be made to work too.
Lastly, you'll need an interface to the backing store to query. That will likely require some sort of server. That could be a simple as API gateway to a Lambda or something far more complex.
You might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects in the bucket.
You could then load this file into a database, or even write a script to parse it. Or possibly even just play with it in Excel.

What is the best way to structure this database?

So I am in the process of building a database from my clients data. Each month they create roughly 25 csv's, which are unique by their topic and attributes, but they all have 1 thing in common; a registration number.
The registration number is the only common variable across all of these csv's.
My task is to move all of this into a database, for which I am leaning towards postgres (If anyone believes nosql would be best for this then please shout out!).
The big problem; structuring this within a database. Should I create 1 table per month that houses all the data, with column 1 being registration and column 2-200 being the attributes? Or should put all the csv's into postgres as they are, and then join them later?
I'm struggling to get my head around the method to structure this when there will be monthly updates to every registration, and we dont want to destroy historical data - we want to keep it for future benchmarks.
I hope this makes sense - I welcome all suggestions!
Thank you.
There are some ways where your question is too broad and asking for an opinion (SQL vs NoSQL).
However, the gist of the question is whether you should load your data one month at a time or into a well-developed data model. Definitely the latter.
My recommendation is the following.
First, design the data model around how the data needs to be stored in the database, rather than how it is being provided. There may be one table per CSV file. I would be a bit surprised, though. Data often wants to be restructured.
Second, design the archive framework for the CSV files.
You should archive all the incoming files in a nice directory structure with files from each month. This structure should be able to accommodate multiple uploads per month, either for all the files or some of them. Mistakes happen and you want to be sure the input data is available.
Third, copy (this is the Postgres command) the data into staging tables. This is the beginning of the monthly process.
Fourth, process the data -- including doing validation checks to load it into your data model.
There may be tweaks to this process, based on questions such as:
Does the data need to be available 24/7 even during the upload process?
Does a validation failure in one part of the data prevent uploading any data?
Are SQL checks (referential integrity and check) sufficient for validating the data?
Do you need to be able to "rollback" the system to any particular update?
These are just questions that can guide your implementation. They are not intended to be answered here.

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.
In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery
I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

Hadoop architecture for raw logs but also clicks and views

Not sure what architecture to use for the following data.
I'm looking at the following data formats and volumes:
raw API apache logs that hold info in the query strings (~15G per day)
JSON clicks and views for ads - about 3m entries per day.
This led me looking into options for setting up an HDFS cluster and use fluentd or flume to load the apache logs. This all looks good but what I don't understand is when or how I could parse the apache logs to extract info from the query strings and path. Eg: "/home/category1/?user=XXX&param1=YYY&param2=ZZZ" should be normalized to some info about the user "XXX" (that he visited "category1" while having the respective params). How I see it my options here are to store logs directly and then run a mapreduce job on all the cluster to parse each log line and ... store it on hdfs back. Isn't this a waste of resources since the operation goes all over the cluster each time? How about storing results it in Hbase ...?
Then there's the data that's JSON describing clicks and views for some ads. That should be stored in the same place and queried.
Query situations:
what a certain user has visited over the past day
all users with "param1" for the past X hours
There are so many tools available and I'm not really sure which might be of help, maybe you can help describe some in layman's terms.
Despite the storage usage, one significant advantage of storing the logs in their original (or almost original) format is that it provides the ability to handle future requirements. You won't be blocked with a rigid schema that was decided in a specific context. This is approach is also known a the Schema on Read strategy. You can find many articles on this topic. Here is one:
[https://www.techopedia.com/definition/30153/schema-on-read]
Now, regarding the json manipulation, I would suggest you to have a look at Spark because it provides very convenient mechanisms for that. In a few lines of code, you can easily load your json files into a data frame : the schema will automatically be inferred from the data. Then this data frame can be registered as a table in a Spark SQL context and queried directly using SQL. Much easier than raw json manipulation.
val df = spark.read.json(<your file>)
df.printSchema() // inspect the schema
df.registerTempTable ("mytable")
val df2 = sqlContext.sql("SELECT * form mytable")
Hope this help!

incremental updates documentation is not clear enough

I have a database where I need keep up with changes on Wikidata changes, and while I was looking for ways to do it, I found these three:
RSS
API Call
Socket.IO
I would like to know if there are other ways and which one is the best or recommended by Wikidata
The answer depends on how up to date you need to keep your database.
As up to date as possible
If you need to keep you database as up to date with Wikidata as possible then you will probably want to use a combination of the solutions that you have found.
Socket.IO will provide you with a stream of what has changed, but will not necessarily give you all of the information that you need.
(Note: there is an IRC stream that would allow you to do the same thing)
Based on the data provided by the stream you can then make calls to the Wikidata API retrieving the new data.
Of course this could result in lots of API calls, so make sure you batch them and also don't retrieve update immediately in case lots of changes occur in a row.
Daily or Weekly
As well as the 3 options you have listed above you also have the Database dumps!
https://www.wikidata.org/wiki/Wikidata:Database_download
The JSON & RDF dumps are generally recommended. The JSON dump contains the data exactly as it is stored. These dumps are made weekly.
The XML dumps are not guaranteed to have the same JSON format as the JSON dumps as they use the internal serialization format. However daily XML dumps are provided.