Here is example about restoring postgres from s3 sigle file. It reads this file into stdout and redirects this strem into pg_restore tool. But what if there are several gz files on s3? Is there a way to make pg_restore read them whihout downloading into any temp folder?
About loop restoring
First of all it is unclear how postgress handle this situation. I have 50-100 gz files with different sizes. Yes they have a nems and can be sorted. But will postgres perform correct restore when I you only single file?
Also loop leads to donwload all files into some folder. Files can be big. So it it better to restore them from s3 directly.
Related
I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?
I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.
I'm working on committing a project I have been working on for awhile that we have not yet uploaded to GitHub. Most of it is Python Pandas where we are doing all our ETL work and saving to CSV's and Pickle files to then use in creating dashboards/running metrics on our data.
We are running into some issues with version control without using GitHub so want to get on top of that. I don't need version control on our CSV or Pickle files, but I can't change the file paths or everything will break. When I try to initially commit to the repo it won't let me because our pickle and CSV files are too big. Is there a way for me to commit the project and not upload the whole CSV/pickle files (the largest is ~10 GB).
I have this in my gitignore file, but still not letting me get around it. Thanks for any and all help!
*.csv
*.pickle
*.pyc
*.json
*.txt
__pycache__/MyScripts.cpython-38.pyc
.Git
.vscode/settings.json
*.pm
*.e2x
*.vim
*.dict
*.pl
*.xlsx
I'm using rclone in order to copy some files to an S3 bucket (deep archive). The command I'm using is:
rclone copy --ignore-existing --progress --max-delete 0 "/var/vmail" foo-backups:foo-backups/vmail
This is making rclone to copy files that I know for sure that already exist in the bucket. I tried removing the --ignore-existing flag (which IMHO is badly named, as it does exactly the opposite of what you'd initially expect), but I still get the same behaviour.
I also tried adding --size-only, but the "bug" doesn't get fixed.
How can I make rclone copy only new files?
You could use rclone sync, check out https://rclone.org/commands/rclone_sync/
Doesn’t transfer unchanged files, testing by size and modification time or MD5SUM. Destination is updated to match source, including deleting files if necessary.
It turned out to be a bug in rclone. https://github.com/rclone/rclone/issues/3834
Hey I need to copy all files from a local directory to the HDFS using pig.
In the pig script I am using the copyFromLocal command with a wildcard in the source-path
i.e copyFromLocal /home/hive/Sample/* /user
It says the source path doesnt exist.
When I use copyFromLocal /home/hive/Sample/ /user , it makes another directory in the HDFS by the name of 'Sample', which I don't need.
But when I include the file name i.e /home/hive/Sample/sample_1.txt it works.
I dont need a single file. I need to copy all the files in the directory without making a directory in the HDFS.
PS: Ive also tried *.txt, ?,?.txt
No wildcards work.
Pig copyFromLocal/toLocal commands work only for a file or a directory.It will never take series of files (or) wildcard.More over, pig concentrates on processing data from/to HDFS.Upto my knowledge you cant even loop the files in a directory with ls.because it lists out files in HDFS. So, for this scenario I would suggest you to write a shell script/action(i.e. fs command) to copy files from locally to HDFS.
check this link below for info:
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#copyFromLocal
I'd like to sync a single file from my filesystem to s3.
Is this possible or can only directories by synced?
Use include/exclude options for the sync-directory command:
e.g. To sync just /var/local/path/filename.xyz to S3 use:
s3 sync /var/local/path s3://bucket/path --exclude='*' --include='*/filename.xyz'
cp can be used to copy a single file to S3. If the filename already exists in the destination, this will replace it:
aws s3 cp local/path/to/file.js s3://bucket/path/to/file.js
Keep in mind that per the docs, sync will only make updates to the target if there have been file changes to the source file since the last run: s3 sync updates any files that have a size or modified time that are different from files with the same name at the destination. However, cp will always make updates to the target regardless of whether the source file has been modified.
Reference: AWS CLI Command Reference: cp
Just to comment on pythonjsgeo's answer. That seems to be the right solution but make sure so execute the command without the = symbol after the include and exclude tag. I was including the = symbol and getting weird behavior with the sync command.
s3 sync /var/local/path s3://bucket/path --exclude '*' --include '*/filename.xyz'
You can mount S3 bucket as a local folder (using RioFS, for example) and then use your favorite tool to synchronize file(-s) or directories.