Ruby Compare Very Large Files - amazon-s3

I have two CSV files that are lists of S3 bucket objects:
The first CSV file represents the objects in the source S3 bucket.
The second CSV file represents the objects in the destination S3 bucket.
I need to know which files to copy from the source S3 bucket to the destination bucket by finding the objects that aren't already in the destination bucket. The lines of the CSV match path, size, and modified date. If any one of these is different I need the source object copied to the destination bucket.
Here's the first example CSV file:
folder1/sample/test1,55,2019-07-19 19:36:56 UTC
folder2/sample/test5,55,2019-07-19 19:34:31 UTC
folder3/sample/test9,55,2019-07-19 19:32:12 UTC
Here's the second example CSV file:
folder1/sample/test1,55,2019-07-16 19:32:58 UTC
folder2/sample/test5,55,2019-07-14 19:34:31 UTC
folder3/sample/test9,55,2019-07-19 19:32:12 UTC
In this example the first and second lines would be returned.
The following code works on these 3 liners but fails on randomly generated files 1000+ lines:
f1 = File.open('file1.csv', 'r')
f2 = File.open('file2.csv', 'r')
f1.each.zip(f2.each).each do |line1, line2|
if line1 != line2
puts line1
end
end
How can I accurately compare all lines - preferably with the least amount of CPU/Memory overhead?

You could load the destination list into an array in memory, then step through the source list one line at a time. If source line is not in the array, the file needs to be copied.
If even one file is too big to load into memory, and the files are sorted in filename order, then you could step through both files together and compare lines. You'll need to use the filenames to determine whether to skip over lines to stay in sync.
An alternative option is to use Amazon Athena, joining the data between files to find lines that don't match.

Related

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

how to read multiple text files into a dataframe in pyspark

i have a few txt files in a directory(i have only the path and not the names of the files) that contain json data,and i need to read all of them into a dataframe.
i tried this:
df=sc.wholeTextFiles("path/*")
but i cant even display the data and my main goal is to preform queries in diffrent ways on the data.
Instead of wholeTextFiles(gives key, value pair having key as filename and data as value),
Try with read.json and give your directory name spark will read all the files in the directory into dataframe.
df=spark.read.json("<directorty_path>/*")
df.show()
From docs:
wholeTextFiles(path, minPartitions=None, use_unicode=True)
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI.
Each file is read as a single record and returned in a key-value pair,
where the key is the path of each file, the value is the content of
each file.
Note: Small files are preferred, as each file will be loaded fully in
memory.

Avoiding multiple headers in pig output files

We use Pig to load files from directories containing thousands of files, transform them, and then output files that are a consolidation of the input.
We've noticed that the output files contain the header record of every file processed, i.e. the header appears multiple times in each file.
Is there any way to have the header only once per output file?
raw_data = LOAD '$INPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',')
DO SOME TRANSFORMS
STORE data INTO '$OUTPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage('|')
Did you try this option?
SKIP_INPUT_HEADER
See https://github.com/apache/pig/blob/31278ce56a18f821e9c98c800bef5e11e5396a69/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java#L85

Hive Reading external table from compressed bz2 file

this is my scenario.
I have bz2 file in Amazon s3. Within the bz2 file, there lies files with .dat,.met,.sta extensions.I am only interested in files with *.dat extensions.You can download this samplefile to take a look at bz2 file.
create external table cdr (
anum string,
bnum string,
numOfTimes int
)
row format delimited
fields terminated by ','
lines terminated by '\n'
location 's3://mybucket/dir'; #the zip file is inside here
The problem lies such that when I execute the above command, some of the records/rows had issues.
1)all the data from files such as *.sta and *.met are also included.
2)the metadata of the filenames are also included.
The only idea I had was to show the INPUT_FILE_NAME. But then, all the records/rows had the same INPUT_FILE_NAME which was the filename.tar.bz2.
Any suggestions are welcome. I am currently completely lost.

incrementally copy files from S3 to local hdfs

i have an app write data to S3 daily or hourly or just randomly, another app to read data from S3 to local HBase. is there any way to tell what's the last file uploaded from last update, then read files after that, in other word, incrementally copy the files?
for example:
day 1: App1 write files 1,2,3 to folder 1;App2 read those 3 files to HBase;
day 4: App1 write file 4 & 5 to folder 1, 6,7,8 to folder 2; App2 need to read 4 &5 from folder 1 and then 6,7,8 from folder 2.
thanks
The LastModified header field can be used to process data based on the creation date. This requires a built in logic on the client side which stores the items which are already processed and the new items. You can simply store the date which you processed so everything comes after that is considered as new.
Example:
s3cmd ls s3://test
2012-07-24 18:29 36303234 s3://test/dl.pdf
See the date in the front of the file.