I compared the logs in LogDNA and the stored files for the same time period, hour, (8 hours before) and only portion of the lines I find in S3. Why?
Related
I'm new to databricks and just created a delta live tables to ingest 60 millions json file from S3. However the input rate (the number of files that it read from S3) is stuck at around 8 records/s, which is very low IMO. I have increased the number of worker/core in my delta live tables but the input rate stays the same.
Is there any config that I have to add to increase the input rate for my pipeline?
I want to pull a specified number of days from an S3 bucket that is partitioned by year/month/day/hour. This bucket has new files added everyday and will grow to be rather large. I want to do spark.read.parquet(<path>).filter(<condition>), however when I ran this it took significantly longer (1.5 hr) than specifying the paths (.5 hr). I dont understand why it takes longer, should I be adding a .partitionBy() when reading from the bucket? or is it because of the volume of data in the bucket that has to be filtered?
That problem that you are facing is regarding the partition discovery. If you point to the path where your parquet files are with the spark.read.parquet("s3://my_bucket/my_folder") spark will trigger a task in the task manager called
Listing leaf files and directories for <number> paths
This is a partition discovery method. Why that happens? When you call with the path Spark has no place to find where the partitions are and how many partitions are there.
In my case if I run a count like this:
spark.read.parquet("s3://my_bucket/my_folder/").filter('date === "2020-10-10").count()
It will trigger the listing that will take 19 Seconds for around 1700 folders. Plus the 7 seconds to count, it has a total of 26 seconds.
To solve this overhead time you should use a Meta Store. AWS provide a great solution with AWS Glue, to be used just like the Hive Metastore in a Hadoop environment.
With Glue you can store the Table metadata and all the partitions. Instead of you giving the Parquet path you will point to the table just like that:
spark.table("my_db.my_table").filter('date === "2020-10-10").count()
For the same data, with the same filter. The list files doesn't exist and the whole process of counting took only 9 Seconds.
In your case that you partitionate by Year, Month, Day and Hour. We are talking about 8760 folders per year.
I would recommend you take a look at this link and this link
This will show how you can use Glue as your Hive Metastore. That will help a lot to improve the speed of Partition query.
Background: I have 30 days data in 30 separate compressed files stored in google storage. I have to write them to a BigQuery table in 30 different partitions in the same table. Each compressed file size was around 750MB.
I did 2 experiments on the same data set on Google Dataflow today.
Experiment 1: I read each day's compressed file using TextIO, applied a simple ParDo transform to prepare TableRow objects and wrote them directly to BigQuery using BigQueryIO. So basically 30 pairs of parallel unconnected sources and and sinks got created. But I found that at any point of time, only 3 files were read, transformed and written to BigQuery. The ParDo transformation and BigQuery writing speed of Google Dataflow was around 6000-8000 elements/sec at any point in time.
So only 3 source and sinks were being processed out of 30 at any time which significantly slowed the process. In over 90 minutes only 7 out 30 files were written to separate BigQuery partitions of a table.
Experiment 2: Here I first read each day's data from the same compressed file for 30 days, applied ParDo transformation on these the 30 PCollections and stored these 30 resultant Pcollections in a PCollectionList object. All these 30 TextIO sources were being read in parallel.
Now I wrote each PCollection corresponding to each day's data in the PCollectionList to BigQuery using BigQueryIO directly. So 30 sinks were being written into again in parallel.
I found that out of 30 parallel sources, again only 3 sources were being read and applied ParDo transformation at a speed of around 20000 elements/sec. At the time of writing of this question when 1 hr had already elapsed, reading from the all the compressed file had not even read completely 50% of the files and writing to the BigQuery table partitions had not even started.
These problems seem to occur only when Google Dataflow reads compressed files. I had asked a question about its slow reading from compressed files(Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow) and was told that parallelizing work would make reading faster as only 1 worker reads a compressed file and multiple sources would mean multiple workers being given chance to read multiple files. But this also does not seem to be working.
Is there any way to speed up this whole process of reading from multiple compressed files and writing to separate partitions of the same table in BigQuery in dataflow job at the same time?
Each compressed file will be read by a single worker. The initial number of workers for a job can be increased with the numWorkers pipeline option, and the maximum number that can be scaled up to can be set with the maxNumWorkers pipeline option.
Have to come up with a proposal to use Amazon S3 with CloudFront as CDN.
One of the important thing is to do a cost estimate. I read over AWS website and forums, used their calculator, but couldn't come to a conclusion with the final number (approx) that I will be confident of. Honestly, I got confused between terms like "Data Transfer Out", "GET and Other Requests" and whether I need to fill in the details both at Amazon S3 and Amazon CloudFront and then do a sum total.
So need help here to estimate my monthly bill.
I will be using S3 to store files (mostly images)
I will be configuring cloud front with my S3 bucket to deliver the content.
Most of the client base (almost 95%) is in US.
Average file size: 500KB
Average number of files stored on S3 monthly: 80000 (80K)
Approx number of total users requesting for the file monthly or approx number of total requests to fetch the file from CloudFront: 30 Millions monthly
There will be some invalidation requests per month (lets say 1000)
Would be great if I can get more understanding as to how my monthly bill will be calculated and what approximately it will be.
Also, with the above data and estimates, any approx on how much the monthly bill, if I use Akamai or Rackspace.
I'll throw another number into the ring.
Using http://calculator.s3.amazonaws.com/calc5.html
CloudFront
data transfer out
0.5MB x 30 million = ~15,000GB
Average size 500kb
1000 invalidation requests
95% US
S3
Storage
80K x 0.5MB 4GB
requests
30million
My initial result is $1,413. As #user2240751 noted, a factor of safety of 2 isn't unreasonable, so that's in the $1,500 - $3,000/month range.
I'm used to working with smaller numbers, but the final amount is always more than you might expect because of extra requests and data transfer.
Corrections or suggestions for improvements welcome!
Good luck
The S3 put and get request fields (in your case) should be restricted to the number of times you are likely to call / update the files in S3 from your application only.
To calculate the Cloudfront service costs, you should work out the rough outbound bandwidth of your page load (number of objects served from cloudfront per page - then double it - to give yourself some headroom), and fill in the rest of the fields.
Rough calc.
500GB data out (guess)
500k average object size
1000 invalidation requests
95% to US based edge location
5% to Europe based edge location
Comes in at $60.80 + your S3 costs.
I think the maths here is wrong 0.5MB * 30,000,000 is 14503GB NOT 1500GB - thats a factor of 10 out unless I'm missing something
Which means your monthly costs are going to be around $2000 not $200
I'm not quite sure if this is the correct stack exchange site for this question, but i've found no site which fits better.
I'm planning to use S3 for my next project, but i'm not sure how the prices for the storage is actually billed. I would have no problem if i would use S3 just for throwing gigabytes of data in and almost never delete data. But thats not the case.
What if I store an 1 megabyte file in S3, delete it after 1 hour and put another 1 megabyte file onto S3? Will I be billed for 1 megabyte of storage for that month, or 2 megabytes?
Amazon states:
First 1 TB / month of Storage Used
I don't think they will just bill whats stored on my S3 account at the end of the month and will bill that. The other way around - bill me for every store request as "storage used" will not work either, because the stored file might be stored for a long time, during multiple billing months.
I hope someone has the answer to that, i couldn't find anything :-)
Storage is billed as an average of all data stored per month. From the Amazon docs:
The volume of storage billed in a
month is based on the average storage
used throughout the month. This
includes all object data and metadata
stored in buckets that you created
under your AWS account. We measure
your storage usage in
“TimedStorage-ByteHrs,” which are
added up at the end of the month to
generate your monthly charges.
Storage Example: Assume you store
100GB (107,374,182,400 bytes) of
standard Amazon S3 storage data in
your bucket for 15 days in March, and
100TB (109,951,162,777,600 bytes) of
standard Amazon S3 storage data for
the final 16 days in March.
At the end of March, you would have
the following usage in Byte-Hours:
Total Byte-Hour usage
= [107,374,182,400 bytes x 15 days x (24 hours / day)] +
[109,951,162,777,600 bytes x 16 days x
(24 hours / day)] =
42,259,901,212,262,400 Byte-Hours.
Let’s convert this to GB-Months:
42,259,901,212,262,400 Byte-Hours x (1
GB / 1,073,741,824 bytes) x (1 month /
744 hours) = 52,900 GB-Months
So in your example (assuming the 2nd megabyte is stored for the remainder of the month) you will be charged for 1MB.
Remember though, that there are other charges to consider too, like data transfer in/out and total requests etc.