How to recursively copy all files except empty files including subdirectories in Databricks file system - apache-spark-sql

I have a requirement to recursively move all files except empty files (0-byte files) to a destination folder while preserving hierarchies in a Databricks file system.
Example:
folder_path/
1/
2/
file1.json- 0 byte
file2.json- 128 kb
3/
4/
file3.json- 0 byte
file4.json- 20 kb
outputs:
folder_path/
1/
2/
file2.jpg - 128 kb
3/
4/ file4.jpg - 20 kb
I'm able to implement this using shell script and AWK but it seems Databricks does not support AWK.
Is there any way to implement this either in Python or Scala using Databricks notebook ?

The below implementation works for me. If the size is 0 and does not end with "json", then it is a directory. Else it is a file with 0 size.
import os
src = 'dbfs:/<src_path>'
def createSubDir(dir):
for i in range(0, len(dir)):
if(dir[i].size == 0 and not dir[i].name.endswith('json')):
dbutils.fs.mkdirs(dir[i].path.replace("<src_dir>", "<dest_dir>"))
createSubDir(dbutils.fs.ls(dir[i].path))
elif(dir[i].size != 0 and dir[i].name.endswith('json')):
dbutils.fs.mv(dir[i].path, dir[i].path.replace("<src_dir>", "<dest_dir>"))
dir = dbutils.fs.ls(src)
createSubDir(dir)

Related

create_pretraining_data.py is writing 0 records to tf_examples.tfrecord while training custom BERT model

I am writing a custom BERT model on my own corpus, I generated the vocab file using BertWordPieceTokenizer and then running below code
!python create_pretraining_data.py
--input_file=/content/drive/My Drive/internet_archive_scifi_v3.txt
--output_file=/content/sample_data/tf_examples.tfrecord
--vocab_file=/content/sample_data/sifi_13sep-vocab.txt
--do_lower_case=True
--max_seq_length=128
--max_predictions_per_seq=20
--masked_lm_prob=0.15
--random_seed=12345
--dupe_factor=5
Getting output as :
INFO:tensorflow:*** Reading from input files ***
INFO:tensorflow:*** Writing to output files ***
INFO:tensorflow: /content/sample_data/tf_examples.tfrecord
INFO:tensorflow:Wrote 0 total instances
Not sure why I am always getting 0 instances in tf_examples.tfrecord, what am I doing wrong?
I am using TF version 1.12
FYI..generated vocab file is 290 KB.
It can not read the input file, please use My\ Drive instead of My Drive:
--input_file=/content/drive/My\ Drive/internet_archive_scifi_v3.txt

How to read a no header csv with variable length csv using pandas

I have a csv file which has no header columns and it has variable length records in each line.
Each record can go upto 398 fields and I want to keep only 256 fields in my dataframe.As I need only those fields to process.
Below is a slim version of the file.
1,2,3,4,5,6
12,34,45,65
34,34,24
In the above I would like to keep only 3 fields(analogous to 256 above) from each line while calling the read_csv.
I tried the below
import pandas as pd
df = pd.read_csv('sample.csv',header=None)
I get the following error as pandas taking the 1st to generate the metadata.
File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 10
Only solution I can think of is using
names = ['column1','column2','column3','column4','column5','column6']
while creating the data frame.
But for the real files which can be upto 50MB I don't want to do that as that is taking a lot of memory and I am trying to run it using aws lambda which will incur more cost. I have to process a large number of files daily.
My question is can I just create a dataframe using the slimmer 256 field while reading the csv alone? Can that be my step one ?
I am very new to pandas so kindly bear my ignorance. I tried to look for a solution for a long time but could find one.
# only 3 columns
df = pd.read_csv('sample.csv', header=None, usecols=range(3))
print(df)
# 0 1 2
# 0 1 2 3
# 1 12 34 45
# 2 34 34 24
So just change range value.

How can I read and manipulate large csv files in Google Colaboratory while not using all the RAM?

I am trying to import and manipulate compressed .csv files (that are each about 500MB in compressed form) in Google Colaboratory. There are 7 files. Using pandas.read_csv(), I "use all the available RAM" just after 2 files are imported and I have to restart my runtime.
I have searched forever on here looking for answers and have tried all the ones I came across, but none work. I have the files in my google drive and am mounted to it.
How can I read all of the files and manipulate them without using all the RAM? I have 12.72GB of RAM and 358.27GM of Disk.
Buying more RAM isn't an option.
To solve my problem, I created 7 cells (one for each data file). Within each cell I read the file, manipulated it, saved what I needed, then deleted everything:
import pandas as pd
import gc
df = pd.read_csv('Google drive path', compression = 'gzip')
filtered_df = df.query('my query condition here')
filtered_df.to_csv('new Google drive path', compression = 'gzip')
del df
del filtered_df
gc.collect()
After all 7 files, each about 500MB, for a total row-by-column size of 7,000,000 by 100, my RAM has stayed under 1MB.
Just using del didn't free up enough RAM. I had to use gc.collect() after in each cell.

How to quickly read csv files and append them into single pandas data frame

I am trying to read 4684 csv files from an folder with each file consisting of 2000 rows and 102 columns and file size is 418 kB. I am reading and appending them one by one using below code.
for file in allFiles:
df = pd.read_csv(file,index_col=None, header = None)
df2 = df2.append(df)
This is taking 4 to 5 hours to read all the 4684 file and append in on dataframe. Is there any possibility to make this process complete quickly. I am using i7 with 32GB ram.
Thanks

Save large numeric output to file natively in Julia 1.0.0

I am trying to run a program in hpc-cluster. Unfortunately, I am unable to install external packages (e.g., JLD2) on the cluster. This is is a temporary problem, and should get fixed.
I don't want to wait around all that time and I am wondering if there is any way to save large output (2-3 GB) in Julia without external dependencies. Most of the output is matrix of numbers. I was using JLD2 previously that stores data in HDF5 format.
Bonus question: Is there a workaround to this using shell commands, like use pipe to get the output and use awk//grep to save data? (something like julia -p 12 main.jl | echo "file").
You could try the Serialization standard library.
To work with multiple variables, you can just store them sequentially:
x = rand(10)
y = "foo"
using Serialization
# write to file
open("data.out","w") do f
serialize(f, x)
serialize(f, y)
end
# load from file
open("data.out") do f
global x2, y2
x2 = deserialize(f)
y2 = deserialize(f)
end
or you could put them in a Dict, and just store that.
You could write as binary. Something along the lines of
julia> x = rand(2,2);
julia> write("test.out", x)
julia> y = reshape(reinterpret(Float64, read("test.out")), 2,2)
julia> x == y
true
If it is just HDF5 that is missing you could use for example NPZ.jl.