Is there a way to load turtle file directly from github? - graphdb

I currently store a turtle file in GitHub. I download that file and then upload it via the Import -> Upload RDF files
I tried to upload it directly from GitHub using the 'Get RDF data from a url'. Didn't work. Am assuming this is not suited to be used in this way?

Related

get zip files from one s3 bucket unzip them to another s3 bucket

I have zip files in one s3 bucket
I need to unzip them and copy the unzipped folder to another s3 bucket and keep the source path
for example - if in source bucket the zip file in under
"s3://bucketname/foo/bar/file.zip"
then in destination bucket it should be "s3://destbucketname/foo/bar/zipname/files.."
how can it be done ?
i know that it is possible somehow to do it with lambda so i wont have to download it locally but i have no idea how
thanks !
If your desire is to trigger the above process as soon as the Zip file is uploaded into the bucket, then you could write an AWS Lambda function
When the Lambda function is triggered, it will be passed the name of the bucket and object that was uploaded. The function should then:
Download the Zip file to /tmp
Unzip the file (Beware: maximum storage available: 500MB)
Loop through the unzipped files and upload them to the destination bucket
Delete all local files created (to free-up space for any future executions of the function)
For a general example, see: Tutorial: Using AWS Lambda with Amazon S3 - AWS Lambda
You can use AWS Lambda for this. You can also set an event notification in your S3 bucket so that a lambda function is triggered everytime a new file arrives. You can write a Python code that uses boto3 to connect to S3. Then you can read files into a buffer, and unzip them using these libraries, gzip them and then reupload to S3 in your desired folder/path:
import gzip
import zipfile
import io
with zipped.open(file, "r") as f_in:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There is also a tutorial here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9
Arguably Python is simpler to use for your Lambda, but if you are considering Java, I've made a library that manages unzipping of data in AWS S3 utilising stream download and multipart upload.
Unzipping is achieved without keeping data in memory or writing to disk. That makes it suitable for large data files - it has been used to unzip files of size 100GB+.
It is available in Maven Central, here is the GitHub link: nejckorasa/s3-stream-unzip

is there any way to load local data set folder directly from google drive to google colab?

see the image carefullyi couldn't load custom data folder from google drive to google colab.though i mounted google drive.like instead of MNIST data set i want to load my own image data set folder.i have tried pydrive wrapper.but i need simple solution.
suppose i have dataset of images inside google drive.how to load it to google colab?
from google.colab import drive
drive.mount('/content/gdrive')
then
with open('/content/gdrive/My Drive/foo.txt', 'w') as f:
f.write('Hello Google Drive!')
!cat /content/gdrive/My\ Drive/foo.txt
here insted of foo.txt i have an image folder called Dog inside ml-data folder.but i can't load it.how to load it in google colab directly from google drive as it is in my local hard drive.
To load data directly from the local machine, you need to follow these steps:
Go to files [left side menu]
Click on upload to session storage
Select file(s) from your machine to upload
It will prompt something indicating that file(s) will be available for the current session only, click ok.
The file(s) will be uploaded in the directory. Click on it (left and right-click both work the same).
And then;
Copy path & use it inside pd.read_csv() function.
Note: After the session is terminated, files will be lost from colab session. To use it again, you'll need to upload it again.
Many times, we prefer to it have all our data in a GitHub repository or in a google drive folder to fetch from there.
Reading many files from Google Drive through colab is going to be less performant and more unreliable than first copying a .zip or similar single file from Drive to the colab VM and unzipping it outside the drive mount directory, and then using that copy of the data.

How to import .zip data to neo4j?

I have a CSV file which is zipped and stored on s3, I'm planning to import the file directly from the URL. I'm not able to find any way of doing that in Neo4j official docs.
LOAD CSV can do this. neo4j-import has the same underlying file reader and so can read zipped files directly, although it seems to be lacking URL support currently.

Copying large BigQuery Tables to Google Cloud Storage and subsequent local download

My goal is to locally save a BigQuery table to be able to perform some analyses. To save it locally, i tried to export it to Google Cloud Storage as a csv file. Alas the dataset is too big to move it as one file, thus it is splitted into many different files, looking like this:
exampledata.csv000000000000
exampledata.csv000000000001
...
Is there a way to put them back together again in the Google Cloud Storage? Maybe even change the format to csv?
My approach was to download it and try to change it manually. Clicking on it does not work, as it will save it as a BIN.file and is also very time consuming. Furthermore I do not know how to assemble them back together.
I also tried to get it via the gsutil command, and I was able to save them on my machine, but as zipped files. When unzipping with WinRar, it gives me exampledata.out files, which I do not know what to do with. Additionally I am clueless how to put them back together in one file..
How can I get the table to my computer, as one file, and as a csv?
The computer I am working with runs on Ubuntu, but I need to have the data on a Google Virtual Machine, using Windows Server 2012.
try using the following to merge all files into one from the windows command prompt
copy *.cs* merged.csv
Suggest you to save the file as .gzip file, then you can download it from Google Cloud easily as BIN file. If you get these splited files in bigquery as following:
Export Table -> csv format, compression as GZIP, URI: file_name*
Then you can combine them back by doing steps as below:
In windows:
add .zip at the end all these files.
use 7-zip to unzip the first .zip file, with name "...000000000000", then it will automatically detect all the rest .zip files. This is just like the normal way to unzip a splitted .zip file.
In Ubuntu:
I failed to unzip the file following the method I can find in internet. Will update the answer if I figure it out.

Scrapy crawl appends locally, replaces on S3?

I implemented a Scrapy project that is now working fine locally. Using the crawl command, each spider appended it's jsonlines to the same file if the file existed. When I changed the feed exporter to S3 using boto it now overwrites the entire file with the data from the last run spider instead of appending to the file.
Is there any way to enable the Scrapy/boto/S3 to append the jsonlines in to the file like it does locally?
Thanks
There is no way to append to a file in S3. You could enable versioning on the S3 bucket and then each time the file was written to S3, it would create a new version of the file. Then you could retrieve all versions of the file using the list_versions method of the boto Bucket object.
From reading the feed exporter code (https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/feedexport.py), the file exporter opens the specified file in append mode whilst the S3 exporter calls set_contents_from_file which presumably overwrites the original file.
The boto S3 documentation (http://boto.readthedocs.org/en/latest/getting_started.html) doesn't mention being able to modify stored files, so the only solution would be to create a custom exporter that stores a local copy of results that can be appended to first before copying that file to S3.