How to import .zip data to neo4j? - amazon-s3

I have a CSV file which is zipped and stored on s3, I'm planning to import the file directly from the URL. I'm not able to find any way of doing that in Neo4j official docs.

LOAD CSV can do this. neo4j-import has the same underlying file reader and so can read zipped files directly, although it seems to be lacking URL support currently.

Related

How to read .zip files in Synapse spark notebooks

I'm new to the synapse. I am stuck in a problem. I want to read the '.zip' file from an ADLS gen2 via spark notebooks. I Hope spark.read.csv doesn't support '.zip' compression. And I also tried to read using python zipFile libraries but it does not accept the ABFSS path. Thanks in advance.
I am fairly new to Synapse so this may not be a definitive answer, but I dont think it is possible.
I have tried the following approaches:
zipfile with abfss:// path (as you have)
zipfile with synfs:// path
shutil with synfs:// path
copy file to tempdir and use shutil (could not get this to work)

CSV/Pickle Files Too Large to Commit to GitHub Repo

I'm working on committing a project I have been working on for awhile that we have not yet uploaded to GitHub. Most of it is Python Pandas where we are doing all our ETL work and saving to CSV's and Pickle files to then use in creating dashboards/running metrics on our data.
We are running into some issues with version control without using GitHub so want to get on top of that. I don't need version control on our CSV or Pickle files, but I can't change the file paths or everything will break. When I try to initially commit to the repo it won't let me because our pickle and CSV files are too big. Is there a way for me to commit the project and not upload the whole CSV/pickle files (the largest is ~10 GB).
I have this in my gitignore file, but still not letting me get around it. Thanks for any and all help!
*.csv
*.pickle
*.pyc
*.json
*.txt
__pycache__/MyScripts.cpython-38.pyc
.Git
.vscode/settings.json
*.pm
*.e2x
*.vim
*.dict
*.pl
*.xlsx

Is there a way to load turtle file directly from github?

I currently store a turtle file in GitHub. I download that file and then upload it via the Import -> Upload RDF files
I tried to upload it directly from GitHub using the 'Get RDF data from a url'. Didn't work. Am assuming this is not suited to be used in this way?

Copying large BigQuery Tables to Google Cloud Storage and subsequent local download

My goal is to locally save a BigQuery table to be able to perform some analyses. To save it locally, i tried to export it to Google Cloud Storage as a csv file. Alas the dataset is too big to move it as one file, thus it is splitted into many different files, looking like this:
exampledata.csv000000000000
exampledata.csv000000000001
...
Is there a way to put them back together again in the Google Cloud Storage? Maybe even change the format to csv?
My approach was to download it and try to change it manually. Clicking on it does not work, as it will save it as a BIN.file and is also very time consuming. Furthermore I do not know how to assemble them back together.
I also tried to get it via the gsutil command, and I was able to save them on my machine, but as zipped files. When unzipping with WinRar, it gives me exampledata.out files, which I do not know what to do with. Additionally I am clueless how to put them back together in one file..
How can I get the table to my computer, as one file, and as a csv?
The computer I am working with runs on Ubuntu, but I need to have the data on a Google Virtual Machine, using Windows Server 2012.
try using the following to merge all files into one from the windows command prompt
copy *.cs* merged.csv
Suggest you to save the file as .gzip file, then you can download it from Google Cloud easily as BIN file. If you get these splited files in bigquery as following:
Export Table -> csv format, compression as GZIP, URI: file_name*
Then you can combine them back by doing steps as below:
In windows:
add .zip at the end all these files.
use 7-zip to unzip the first .zip file, with name "...000000000000", then it will automatically detect all the rest .zip files. This is just like the normal way to unzip a splitted .zip file.
In Ubuntu:
I failed to unzip the file following the method I can find in internet. Will update the answer if I figure it out.

Scrapy crawl appends locally, replaces on S3?

I implemented a Scrapy project that is now working fine locally. Using the crawl command, each spider appended it's jsonlines to the same file if the file existed. When I changed the feed exporter to S3 using boto it now overwrites the entire file with the data from the last run spider instead of appending to the file.
Is there any way to enable the Scrapy/boto/S3 to append the jsonlines in to the file like it does locally?
Thanks
There is no way to append to a file in S3. You could enable versioning on the S3 bucket and then each time the file was written to S3, it would create a new version of the file. Then you could retrieve all versions of the file using the list_versions method of the boto Bucket object.
From reading the feed exporter code (https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/feedexport.py), the file exporter opens the specified file in append mode whilst the S3 exporter calls set_contents_from_file which presumably overwrites the original file.
The boto S3 documentation (http://boto.readthedocs.org/en/latest/getting_started.html) doesn't mention being able to modify stored files, so the only solution would be to create a custom exporter that stores a local copy of results that can be appended to first before copying that file to S3.