uploading large file to GraphDB - graphdb

I want to upload whole of the Dbpedia data set to the GraphDB. I have installed am using Docker container and have run it successfully. Now the problem is, I am unable to upload the .nt files because the maximum size of the file allowed is 200 MB. Can anybody help?

You can surpass the limit by using Import -> Server files. You can change the directory of the server files by passing -Dgraphdb.workbench.importDirectory=path_to_directory when starting GraphDB.

Related

Trying to download zip file from drive to instance in colab hits quota limit

I am trying to download the zip file dataset from drive using
from google.colab import drive
drive.mount('/content/drive')
!cp "/content/drive/Shared Drives/infinity-drive/datasets/coco.zip" ./data
I get the error that one of the quota limits is exceeded for 3 days in consecutively and file copying ends in I/O error.
I went over all the solutions. File size is 18GB. Although I could not understand the directive
Use drive.google.com to download the file.. What does it have to do with triggering limis in colab? For people trying to download file to colab instance and then to their local machine? Any way/
The file is in archive format.
The folder it is in has no any other files in it.
The file is private, although I am not the manager of the shared drive.
I am baffled at exactly which quota I am hitting. It does not say to me but I am 99% sure I should not be hitting any considering I can download the entire thing to my local machine. I can't keep track of exactly what is happening because even within the same day, it is able to copy about 3-5 GB of the file(even the file size is changing).
This is a bug in colab/Drive integration, and is tracked in #1607. See this comment for workarounds.

Cloud9 deploy hitting size limit for numpy, pandas

I'm building in Cloud9 to deploy to Lambda. My function works fine in Cloud9 but when I go to deploy I get the error
Unzipped size must be smaller than 262144000 bytes
Running du -h | sort -h shows that my biggest offenders are:
/debug at 291M
/numpy at 79M
/pandas at 47M
/botocore at 41M
My function is extremely simple, it calls a service, uses panda to format the response, and sends it on.
What is in debug and how do I slim it down/eliminate it from the deploy package?
How do others use libraries at all if they eat up most of the memory limit?
A brief background to understand the problem root-cause
The problem is not with your function but with the size of the zipped packages. As per AWS documentation, the overall size of zipped package must not exceed greater than 3MB. With that said, if the package size is greater than 3MB which happens inevitably, as a library can have many dependencies, then consider uploading the zipped package to a AWS S3 bucket. Note: even s3 bucket has a size limit of 262MB. Ensure that your package does not exceed this limit. The error message that you have posted, Unzipped size must be smaller than 262144000 bytes is referring to the size of the deployment package aka the libraries.
Now, Understand some facts when working with AWS,
AWS Containers are empty.
AWS containers have a linux kernel
AWS Cloud9 is only an IDE like RStudio or Pycharm. And it uses S3 bucket for saving the installed packages.
This means you'll need to know the following:
the package and its related dependencies
extract the linux-compiled packages from cloud9 and save to a folder-structure like, python/lib/python3.6/site-packages/
Possible/Workable solution to overcome this problem
Overcome this problem by reducing the package size. See below.
Reducing the deployment package size
Manual method: delete files and folders within each library folder that are named *.info and *._pycache. You'll need to manually look into each folder for the above file extensions to delete them.
Automatic method: I've to figure out the command. work in progress
Use Layers
In AWS go to Lambda and create a layer
Attach the S3 bucket link containing the python package folder. Ensure the lambda function IAM role has permission to access S3 bucket.
Make sure the un-zipped folder size is less than 262MB. Because if its >260 MB then it cannot be attached to AWS Layer. You'll get an error, Failed to create layer version: Unzipped size must be smaller than 262144000 bytes

File uploding failed when i was uploding the file on google-compute instance, now it's showing xyz.zip.csupload. Can i resume it or retry it?

Instance operating system is ubuntu 16.04.
I was uploading using the instance upload file option.
File size was 2.24 GB.
I didn't find anything useful on internet.
Thanks
The file "xyz.zip.ccsupload" is the file with the partial upload. Once the upload is complete, then the file will have the proper name. You cannot resume the upload from where it left off. If it fails, then you will have to attempt uploading the file again.
The reason why it failed is most likely due to the file size. Due to the size of the file, I would suggest using the "gcloud compute scp" command to upload the file to the VM instance as documented here.

how do I split a large csv.gz file in Google Cloud Storage?

I get this error when trying to load a table in Google BQ:
Input CSV files are not splittable and at least one of the files is
larger than the maximum allowed size. Size is: 56659381010. Max
allowed size is: 4294967296.
Is there a way to split the file using gsutil or something like that without having to upload everything again?
The largest compressed CSV file you can load into BigQuery is 4 gigabytes. GCS unfortunately does not provide a way to decompress a compressed file, nor does it provide a way to split a compressed file. GZip'd files can't be arbitrarily split up and reassembled in the way you could a tar file.
I imagine your best bet would likely be to spin up a GCE instance in the same region as your GCS bucket, download your object to that instance (which should be pretty fast, given that it's only a few dozen gigabytes), decompress the object (which will be slower), break that CSV file into a bunch of smaller ones (the linux split command is useful for this), and then upload the objects back up to GCS.
I ran into the same issue and this is how I dealt with it:
First, spin up a Google Compute Engine VM instance.
https://console.cloud.google.com/compute/instances
Then install the gsutil commands and then go through the authentication process.
https://cloud.google.com/storage/docs/gsutil_install
Once you have verified that the gcloud, gsutil, and bq commands are working then save a snapshot of the disk as snapshot-1 and then delete this VM.
On your local machine, run this command to create a new disk. This disk is used for the VM so that you have enough space to download and unzip the large file.
gcloud compute disks create disk-2017-11-30 --source-snapshot snapshot-1 --size=100GB
Again on your local machine, run this command to create a new VM instance that uses this disk. I use the --preemptible flag to save some cost.
gcloud compute instances create loader-2017-11-30 --disk name=disk-2017-11-30,boot=yes --preemptible
Now you can SSH into your instance and then run these commands on the remote machine.
First, copy the file from cloud storage to the VM
gsutil cp gs://my-bucket/2017/11/20171130.gz .
Then unzip the file. In my case, for ~4GB file, it took about 17 minutes to complete this step:
gunzip 20171130.gz
Once unzipped, you can either run the bq load command to load it into BigQuery but I found that for my file size (~70 GB unzipped), that operation would take about 4 hours. Instead, I uploaded the unzipped file back to Cloud Storage
gsutil cp 20171130 gs://am-alphahat-regional/unzipped/20171130.csv
Now that the file is back on cloud storage, you can run this command to delete the VM.
gcloud compute instances delete loader-2017-11-30
Theoretically, the associated disk should also have been deleted, but I found that the disk was still there and I needed to delete it with an additional command
gcloud compute disks delete disk-2017-11-30
Now finally, you should be able to run the bq load command or you can load the data from the console.

large file from ec2 to s3

I have a 27GB file that I am trying to move from an AWS Linux EC2 to S3. I've tried both the 'S3put' command and the 'S3cmd put' command. Both work with a test file. Neither work with the large file. No errors are given, the command returns immediately but nothing happens.
s3cmd put bigfile.tsv s3://bucket/bigfile.tsv
Though you can upload objects to S3 with sizes up to 5TB, S3 has a size limit of 5GB for an individual PUT operation.
In order to load files larger than 5GB (or even files larger than 100MB) you are going to want to use the multipart upload feature of S3.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html
http://aws.typepad.com/aws/2010/11/amazon-s3-multipart-upload.html
(Ignore the outdated description of a 5GB object limit in the above blog post. The current limit is 5TB.)
The boto library for Python supports multipart upload, and the latest boto software includes an "s3multiput" command line tool that takes care of the complexities for you and even parallelizes part uploads.
https://github.com/boto/boto
The file did not exist, doh. I realised this after running the s3 commands in verbose mode by adding the -v tag:
s3cmd put -v bigfile.tsv s3://bucket/bigfile.tsv
s3cmd version 1.1.0 supports the multi-part upload as part of the "put" command, but its still in beta (currently.)