airbyte ETL ,connection between http API source and big query - google-bigquery

i have a task in hand, where I am supposed to create python based HTTP API connector for airbyte. connector will return a response which will contain some links of zip files.
each zip file contains csv file, which is supposed to be uploaded to the bigquery
now I have made the connector which is returning the URL of the zip file.
The main question is how to send the underlying csv file to the bigquery ,
i can for sure unzip or even read the csv file in the python connector, but i am stuck on the part of sending the same to the bigquery.
p.s if you guys can tell me even about sending the CSV to google cloud storage, that will be awesome too

When you are building an Airbyte source connector with the CDK your connector code must output records that will be sent to the destination, BigQuery in your case. This allows to decouple extraction logic from loading logic, and makes your source connector destination agnostic.
I'd suggest this high level logic in your source connector's implementation:
Call the source API to retrieve the zip's url
Download + unzip the zip
Parse the CSV file with Pandas
Output parsed records
This is under the assumption that all CSV files have the same schema. If not you'll have to declare one stream per schema.
A great guide, with more details on how to develop a Python connector is available here.
Once your source connector outputs AirbyteRecordMessages you'll be able to connect it to BigQuery and chose the best loading method according to your need (Standard or GCS staging).

Related

Apache Flink - write Parquet file to S3

I have a Flink streaming pipeline that reads the messages from Kafka, the message has s3 path to the log file. Using the Flink async IO I download the log file, parse & extract some key information from them. I now need to write this extracted data (Hashmap<String, String>) as Parquet file back to another bucket in S3. How do I do it?
I have completed till the transformation, I have used 1.15 flink version. The Parquet format writing is unclear or some methods seem to be deprecated.
You should be using the FileSink. There are some examples in the documentation, but here's an example that writes protobuf data in Parquet format:
final FileSink<ProtoRecord> sink = FileSink
.forBulkFormat(outputBasePath, ParquetProtoWriters.forType(ProtoRecord.class))
.withRollingPolicy(
OnCheckpointRollingPolicy.builder()
.build())
.build();
stream.sinkTo(sink);
Flink includes support for Protobuf and Avro. Otherwise you'll need to implement a ParquetWriterFactory with a custom implementation of the ParquetBuilder interface.
The OnCheckpointRollingPolicy is the default for bulk formats like Parquet. There's no need to specify that unless you go further and include some custom configuration -- but I added it to the example to illustrate how the pieces fit together.

Azure Data Factory HTTP Connector Data Copy - Bank Of England Statistical Database

I'm trying to use the HTTP connector to read a CSV of data from the BoE statistical database.
Take the SONIA rate for instance.
There is a download button for a CSV extract.
I've converted this to the following URL which downloads a CSV via web browser.
[https://www.bankofengland.co.uk/boeapps/database/_iadb-fromshowcolumns.asp?csv.x=yes&Datefrom=01/Dec/2021&Dateto=01/Dec/2021 &SeriesCodes=IUDSOIA&CSVF=TN&UsingCodes=Y][1]
Putting this in the Base URL it connects and pulls the data.
I'm trying to split this out so that I can parameterise some of it.
Base
https://www.bankofengland.co.uk/boeapps/database
Relative
_iadb-fromshowcolumns.asp?csv.x=yes&Datefrom=01/Dec/2021&Dateto=01/Dec/2021 &SeriesCodes=IUDSOIA&CSVF=TN&UsingCodes=Y
It won't fetch the data, however when it's all combined in the base URL it does.
I've tried to add a "/" at the start of the relative URL as well and that hasn't worked either.
According to the documentation ADF puts the "/" in for you "[Base]/[Relative]"
Does anyone know what I'm doing wrong?
Thanks,
Dan
[1]: https://www.bankofengland.co.uk/boeapps/database/_iadb-fromshowcolumns.asp?csv.x=yes&Datefrom=01/Dec/2021&Dateto=01/Dec/2021 &SeriesCodes=IUDSOIA&CSVF=TN&UsingCodes=Y
I don't see a way you could download that data directly as a csv file. The data seems to be manually copied from the site, using their Save as option.
They have used read-only block and hidden elements, I doubt there would any easy way or out of the box method within ADF web activity to help on this.
You can just manually copy-paste into a csv file.

What is the best way or tool to open to read a raw clickstream blob storage data in azure

I have clickstream blob storage about 800mb average file sizes and when i open the file it defaults to text file. How do i open and read the data possibly json format or column format. I would also like to understand if i can build an API to consume that data. I recently built an azure function app http trigger but the file is too large to open it up and the function times out. So any suggestion on those two would be appreciated
Thank you

Extract data fom Marklogic 8.0.6 to AWS S3

I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.
I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.
S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save
Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks

query regarding cloud file storage services- can i append data to an existing file

I am working to create an application where some files will be stored in Amazon S3/Rackspace Cloud Files/other similar cloud file storage providers.
There are a couple of scenarios where it would be easier for me, if I could append data to an existing file... Is this possible? Or do I have to download the file from Amazon S3, then append data to it, and finally upload the modified file back to Amazon S3?
There is no way to append anything to existing files in S3.
You will have to download it and upload it again after modifying.
If you wish though, you can always upload the new data with a tag (a timestamp or a counter), e.g. file_201201011344. So when reading files, you get all files mactching your pattern and append them on the client side.