CSV/Pickle Files Too Large to Commit to GitHub Repo - pandas

I'm working on committing a project I have been working on for awhile that we have not yet uploaded to GitHub. Most of it is Python Pandas where we are doing all our ETL work and saving to CSV's and Pickle files to then use in creating dashboards/running metrics on our data.
We are running into some issues with version control without using GitHub so want to get on top of that. I don't need version control on our CSV or Pickle files, but I can't change the file paths or everything will break. When I try to initially commit to the repo it won't let me because our pickle and CSV files are too big. Is there a way for me to commit the project and not upload the whole CSV/pickle files (the largest is ~10 GB).
I have this in my gitignore file, but still not letting me get around it. Thanks for any and all help!
*.csv
*.pickle
*.pyc
*.json
*.txt
__pycache__/MyScripts.cpython-38.pyc
.Git
.vscode/settings.json
*.pm
*.e2x
*.vim
*.dict
*.pl
*.xlsx

Related

Suggestions to backup

I use git to keep track of certain directories modifications, and to avoid file duplication or file inconsistencies. For backup, I store these directories on Github. Now, these directories are over 1GB, and I am thinking of taking them out of Github, but I don't know what is the best way to back up these files and still keep track of file duplication or file inconsistency.
I thought of creating a git server where I store all my repos. I thought of using complex scripts for rsync, or even using borg.
Do you have any suggestions?

colab truncate folder with more than 200k files

there is a maximun number of files allowed per folder when we read gdrive from colab? I create a folder from colab with more than 200k a run an "ls" command just after creation and everything is ok, but everytime i close the session and open it again (remount gdrive) the folder get truncated. can't read anymore this quantity actually not more than 20k, need to recreate/unzip the folder again. The folder contains images for training a DL model.
update: I'm running drive.flush_and_unmount() from the notebook where i created the folder (without closing the session) and is running smoothly. From another notebook I'm controlling the qty of files inside the folder (same folder but from another notebook) and it seems that the qty is beginning to increase slowly so it look like that the solution is running drive.flush_and_unmount() to force sync to gdrive but not sure yet if after closing the session and reopening again the folder will be synced. will let you know! at least it is progress

Prevent rclone from re-copying files to AWS S3 Deep Archive

I'm using rclone in order to copy some files to an S3 bucket (deep archive). The command I'm using is:
rclone copy --ignore-existing --progress --max-delete 0 "/var/vmail" foo-backups:foo-backups/vmail
This is making rclone to copy files that I know for sure that already exist in the bucket. I tried removing the --ignore-existing flag (which IMHO is badly named, as it does exactly the opposite of what you'd initially expect), but I still get the same behaviour.
I also tried adding --size-only, but the "bug" doesn't get fixed.
How can I make rclone copy only new files?
You could use rclone sync, check out https://rclone.org/commands/rclone_sync/
Doesn’t transfer unchanged files, testing by size and modification time or MD5SUM. Destination is updated to match source, including deleting files if necessary.
It turned out to be a bug in rclone. https://github.com/rclone/rclone/issues/3834

Copying large BigQuery Tables to Google Cloud Storage and subsequent local download

My goal is to locally save a BigQuery table to be able to perform some analyses. To save it locally, i tried to export it to Google Cloud Storage as a csv file. Alas the dataset is too big to move it as one file, thus it is splitted into many different files, looking like this:
exampledata.csv000000000000
exampledata.csv000000000001
...
Is there a way to put them back together again in the Google Cloud Storage? Maybe even change the format to csv?
My approach was to download it and try to change it manually. Clicking on it does not work, as it will save it as a BIN.file and is also very time consuming. Furthermore I do not know how to assemble them back together.
I also tried to get it via the gsutil command, and I was able to save them on my machine, but as zipped files. When unzipping with WinRar, it gives me exampledata.out files, which I do not know what to do with. Additionally I am clueless how to put them back together in one file..
How can I get the table to my computer, as one file, and as a csv?
The computer I am working with runs on Ubuntu, but I need to have the data on a Google Virtual Machine, using Windows Server 2012.
try using the following to merge all files into one from the windows command prompt
copy *.cs* merged.csv
Suggest you to save the file as .gzip file, then you can download it from Google Cloud easily as BIN file. If you get these splited files in bigquery as following:
Export Table -> csv format, compression as GZIP, URI: file_name*
Then you can combine them back by doing steps as below:
In windows:
add .zip at the end all these files.
use 7-zip to unzip the first .zip file, with name "...000000000000", then it will automatically detect all the rest .zip files. This is just like the normal way to unzip a splitted .zip file.
In Ubuntu:
I failed to unzip the file following the method I can find in internet. Will update the answer if I figure it out.

___jb_bak___ and ___jb_old___ files in PyCharm

When I got some PyCharm project from my colleague I saw some backup files of *.py files.
This files have types: *.___jb_old___ and *.___jb_bak___.
I open the files in Notepad++ and see that these are identical backup files of the corresponding *.py files.
I asked my colleague, but he didn't know what these are.
Why are there TWO identical backup files for each *.py file?
How can I tune PyCharm? We want to turn off this backup.
Google gave me nothing :(
You can disable "safe write"
Use "safe write" (save changes to a temporary file first) If this
check box is selected, a changed file will be first saved to a
temporary file; if the save operation is completed successfully, the
original file is deleted, and the temporary file is renamed.
https://www.jetbrains.com/webstorm/help/system-settings.html
i had this problem in webstorm when a script file was running and i was editing it in webstorm. when i stopped the script and edited it everything was fine
it's a temporary file used by PyCharm to make sure you change will not be lost when editing files. it's safe to delete them manually, you will only loss very recent changes. IntelliJ IDEA works the same as PyCharm.
How to delete them?
To delete a file on a file system requires two things: 1)you have the permission. 2)no program is using it.
so make sure you have 'w' the permission, and stop all program which is using it. then you can remove it.
How to know which program is using it?
Normally you should already know it. but sometimes some background programs(like crash plan, google drive sync, e.g.) may also hold it quietly, then find and kill all programs may be very tricky. the easiest way is reboot your computer with 'safe mode', in which only the OS kernel is loaded.
I spend two hours to figure out the reason why I cannot delete the temp file even when I have whole permission. a crash plan service is holding it in background. This may not be your issue, but if you cannot delete the temp file, this will save your time.
While JeremyWeir's solution probably does work, the real fix - imo - is to enable write permission on the directory.
Saving a file would only need write permission to that file itself. But with the "safe write", you need permission to create the file and rename it - which means you need write access to the directory.
In Linux this would be e.g. chmod ug+w DIR, if you want to give write access to user and group.
I have exact same issue with PhpStorm after system crash. The fix I found was to manualy delete *._jb_old_ and *._jb_bak_ files and reinstall PhpStorm