How to convert binary data under /var/lib/prometheus in readable JSON format - amazon-s3

Prometheus is used as monitoring tool for scraping application metrics , we want to store these metrics in object storage in readable format in JSON form so how can we do it ? but we see this data is stored in binary format under /var/lib/prometheus on prometheus server.
I checked various options on prometheus official docs but these are not useful since we want to store data in JSON format in Object storage like S3
https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage .
I checked thanos also but only prometheus can read and write from thanos it seems.

Related

File conversion in AWS

I am trying to find the most efficient way to process files in AWS.
Read a json, xml, csv from S3 bucket
Map it to another type of json, xml, csv
Save it to S3 bucket
Right now we are using Java with AWS lambdas but we write lots of code.
AWS Data Glue looks good but my experience with MS BizTalk is even better.
Is there any service that can help me with this?
There are many options available within AWS for reading from one file format and writing it to another file format in s3 bucket. Below are some options -
A) AWS SDK for Pandas (DataWrangler) which is an open source Python library from AWS ProServe. You can run this either from a Lambda, or any other server. It provides several out of the box connectors for reading, writing data from various sources and sinks. This option may be used if the volumes are low. It also provides the flexibility to use this from Amazon Lambda or any other server where the SDK can be installed.
B) AWS Glue either using Spark or Python which is a is a serverless data integration service. This also provides a drag and drop option using the Glue Studio to generate data pipelines using many out of the box transformations. One can control the processing windows by using the desired number of Data Processing Units (DPUs). It also has the Glue Workflow for orchestration.
C) EMR which is a PetaByte scale AWS Service that one can use for high volume distributed data processing, machine learning, interactive analytics using open source frameworks like Apache Spark.
Which option one would choose would depend on the use cases one is trying to solve and the requirements. Other factors like volume of data, processing window, low code\no code options, cost, etc. would help decide which option to leverage.

Azure Sentinel referencing large sets of data

I've been trying to find the most effective (elegant) solution to achieve what I'm trying to do. I'd like to hear from the community, thank you.
Situation:
Need to geo-enrich IP Address records on Sentinel. Example: Successful SigninLogs, since MSFT enrichment sometimes generates "Unknown" results in the IP enrichment maps.
External reference file (subnet, country_code, country_name) are available publicly, however the size and # of records are rather large. (~12MB, 200K+records).
Issue:
Tried using storage account blob to host the "reference table", apparently hitting the limit on max. blob size in Storage Account.
Looks like there are max. 30.000 records on Workbooks to read from external sources using 'externaldata' command. Hence, only partial reference data can be read and referred to.
Options considered:
Ingest the reference table into the log analytics workspace, do a join/lookup to this custom reference table for enrichment
Export the IP addresses from SigninLogs table to a blob storage, enrich the IP address using logicapps, and then put it back to a 'reference' blob storage. then read the 'reference' blob storage using 'externaldata' syntax.
Limitation Observed:
Came to a realization that Sentinel couldn't perform API call for enrichment from external data. (CMIIW). I've done similar stuff with Splunk, and we could enrich the data on the fly, by calling in multiple API calls to outside database.
Ingest the Data - As you've mentioned, ingest the data and join the tables. You would need to regularly ingest this though to ensure you can lookup the data within the desired time range (e.g. If you have an Analytics Rule, then this only looks up data for a 14 day period).
Use a Playbook - If you want the Geo-IP lookup post incident, you can perform this with a Logic App
Use Jupyter Notebooks - This have the flexibility to perform API calls against external locations and join the data to that hosted in Sentinel. An example notebook is the IP Explorer Notebook. Use Jupyter notebooks to hunt for security threats
Threat Intelligence - Microsoft enriches all imported threat intelligence indicators with GeoLocation and WhoIs data, which is displayed together with other indicator details.
Since March 2022, you can upload large CSV files into a Sentinel Watchlist. This way, you can upload a complete GeoIP database and perform ipv4_lookups. This blog post explains you how to do this: https://cryptsus.com/blog/enrich-geolocation-sentinel-siem.html

Azure Data Factory - load Application Insights logs to Data Lake Gen 2

I have Application Insights configured with a retention period for logs of three months and I want to load them using Data Factory pipelines, scheduled daily, to a Data Lake Gen 2 storage.
The purpose of doing this is to not lose data after the retention period passes and to have the data stored for future purposes - Machine Learning and Reporting, mainly.
I am trying to decide what format to use for storing these data, from the many formats available in Data Lake Gen 2, so if anyone has a similar design, any information or reference to documentation would be greater appreciated.
Per my experience, most format of the log files are .log files. If we want to keep file type and move them to Data Lake Gen 2, please use Binary format.
Binary format can help you move all the folder/sub-folder and all the files to other destination.
HTH.

Send Bigquery Data to rest endpoint

I want to send data from BigQuery (about 500K rows) to a custom endpoint via post method, how can I do this?
These are my options:
A PHP process to read and send the data (I have already tried this one, but it is too slow and the max execution time pops up).
I was looking for Google Cloud Dataflow, but I don't know Java.
Running it into Google Cloud Function, but I don't know how to send data via post.
Do you know another option?
As mentioned in the comments, 500k rows for a POST method is far too much data to be considered as an option.
Dataflow is a product oriented for pipelines development, intended to run several data transformations during its jobs. You can use BigQueryIO (with python sample codes) but, If you just need to migrate the data to a certain machine/endpoint, creating a Dataflow job will add complexity to your task.
The suggested approach is to export to a GCS bucket and then download the data from it.
For instance, if the size of Data that you are trying to retrieve is less than 1GB, you can export to a GCS bucket from the Command Line Interface like: bq extract --compression GZIP 'mydataset.mytable' gs://example-bucket/myfile.csv. Otherwise, you will need to export the data in more files using wildcard URI defining your bucket destination as indicated ('gs://my-bucket/file-name-*.json').
And finally, using gsutil command gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION] you will download the data from your bucket.
Note: you have more available ways to do that in the Cloud documentation links provided, including the BigQuery web UI.
Also, bear in mind that there are no charges for exporting data from BigQuery, but you do incur charges for storing the exported data in Cloud Storage. BigQuery exports are subject to the limits on export jobs.

Pull data from HTTP request API to Google Cloud

I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.