How to read .zip files in Synapse spark notebooks - azure-synapse

I'm new to the synapse. I am stuck in a problem. I want to read the '.zip' file from an ADLS gen2 via spark notebooks. I Hope spark.read.csv doesn't support '.zip' compression. And I also tried to read using python zipFile libraries but it does not accept the ABFSS path. Thanks in advance.

I am fairly new to Synapse so this may not be a definitive answer, but I dont think it is possible.
I have tried the following approaches:
zipfile with abfss:// path (as you have)
zipfile with synfs:// path
shutil with synfs:// path
copy file to tempdir and use shutil (could not get this to work)

Related

Inspect Parquet in S3 from Command Line

I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?
I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.

CSV/Pickle Files Too Large to Commit to GitHub Repo

I'm working on committing a project I have been working on for awhile that we have not yet uploaded to GitHub. Most of it is Python Pandas where we are doing all our ETL work and saving to CSV's and Pickle files to then use in creating dashboards/running metrics on our data.
We are running into some issues with version control without using GitHub so want to get on top of that. I don't need version control on our CSV or Pickle files, but I can't change the file paths or everything will break. When I try to initially commit to the repo it won't let me because our pickle and CSV files are too big. Is there a way for me to commit the project and not upload the whole CSV/pickle files (the largest is ~10 GB).
I have this in my gitignore file, but still not letting me get around it. Thanks for any and all help!
*.csv
*.pickle
*.pyc
*.json
*.txt
__pycache__/MyScripts.cpython-38.pyc
.Git
.vscode/settings.json
*.pm
*.e2x
*.vim
*.dict
*.pl
*.xlsx

How to upload my dataset into Google Colab?

I have my dataset on my local device. Is there any way to upload this dataset into google colab directly.
Note:
I tried this code :
from google.colab import files
uploaded = files.upload()
But it loads file by file. I want to upload the whole dataset directly
Here's the workflow I used to upload a zip file and create a local data directory:
zip the file locally. Something like: $zip -r data.zip data
upload zip file of your data directory to colab using their (Google's) instructions.
from google.colab import files
uploaded = files.upload()
Once zip file is uploaded, perform the following operations:
import zipfile
import io
zf = zipfile.ZipFile(io.BytesIO(uploaded['data.zip']), "r")
zf.extractall()
Your data directory should now be in colab's working directory under a 'data' directory.
Zip or tar the files first, and then use tarfile or zipfile to unpack them.
Another way is to store all the dataset into a numpy object and upload to drive. There you can easily retrieve it. (zipping and unzipping also fine but I faced difficulty with it)

How to import .zip data to neo4j?

I have a CSV file which is zipped and stored on s3, I'm planning to import the file directly from the URL. I'm not able to find any way of doing that in Neo4j official docs.
LOAD CSV can do this. neo4j-import has the same underlying file reader and so can read zipped files directly, although it seems to be lacking URL support currently.

Sqoop Process S3/CSV to S3/Parquet

Is Sqoop able to target a directory or CSV file stored in S3 and then import to another S3 directory in Parquet? Have been testing and researching a bit but nothing. Any assistance would be appreciated.