Pig script new record - apache-pig

I am working on following mail data in a file.. (data source:infochimps)
Message-ID: <33025919.1075857594206.JavaMail.evans#thyme>
Date: Wed, 13 Dec 2000 13:09:00 -0800 (PST)
From: john.arnold#enron.com
To: slafontaine#globalp.com
Subject: re:spreads
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: slafontaine#globalp.com # ENRON
X-cc:
X-bcc:
X-Folder: \John_Arnold_Dec2000\Notes Folders\'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf
saw a lot of the bulls sell summer against length in front to mitigate
margins/absolute position limits/var. as these guys are taking off the
front, they are also buying back summer. el paso large buyer of next winter
today taking off spreads. certainly a reason why the spreads were so strong
on the way up and such a piece now. really the only one left with any risk
premium built in is h/j now. it was trading equivalent of 180 on access,
down 40+ from this morning. certainly if we are entering a period of bearish
................]
I am loading above data as:-
A = load '/root/test/enron_mail/maildir/*/*/*' using PigStorage(':') as (f1:chararray,f2:chararray);
but for the message body I am getting separate tuples as message body includes new lines..
how to consolidate last lines into one ?
I want below part in single tuple as:
saw a lot of the bulls sell summer against length in front to mitigate
margins/absolute position limits/var. as these guys are taking off the
front, they are also buying back summer. el paso large buyer of next winter
today taking off spreads. certainly a reason why the spreads were so strong
on the way up and such a piece now. really the only one left with any risk
premium built in is h/j now. it was trading equivalent of 180 on access,
down 40+ from this morning. certainly if we are entering a period of bearish

Related

How can I configure CloudFront so it costs me a bit less?

I have a very static site, basically HTML and some Javascript on S3. I serve this through Cloudfront. My usage has gone up a bit plus one of my Javascript files is pretty large.
So what can I do to cut down the costs of serving those files? they need have very good uptime as it has thousands of active users all over the world.
This is the usage for yesterday:
Looking at other questions about this it seems like changing headers can help but I thought I already had caching enabled. This is what curl returns if I get one of those files:
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200
< content-type: text/html
< content-length: 2246
< date: Fri, 03 Apr 2020 20:28:47 GMT
< last-modified: Fri, 03 Apr 2020 15:21:11 GMT
< x-amz-version-id: some string
< etag: "83df2032241b5be7b4c337f0857095fc"
< server: AmazonS3
< x-cache: Miss from cloudfront
< via: 1.1 somestring.cloudfront.net (CloudFront)
< x-amz-cf-pop: some string
< x-amz-cf-id: some string
This is what the cache is configured as on CloudFront:
This is what S3 says when I use curl to query the file:
< HTTP/1.1 200 OK
< x-amz-id-2: some string
< x-amz-request-id: some string
< Date: Fri, 03 Apr 2020 20:27:22 GMT
< x-amz-replication-status: COMPLETED
< Last-Modified: Fri, 03 Apr 2020 15:21:11 GMT
< ETag: "83df2032241b5be7b4c337f0857095fc"
< x-amz-version-id: some string
< Accept-Ranges: bytes
< Content-Type: text/html
< Content-Length: 2246
< Server: AmazonS3
So what can I do? I don't often update the files and when I do I don't mind if it takes a day or two for the change to propagate.
Thanks.
If your goal is to reduce CloudFront costs, then it's worth reviewing how it is charged:
Regional Data Transfer Out to Internet (per GB): From $0.085 to $0.170 (depending upon location of your users)
Regional Data Transfer Out to Origin (per GB): From $0.020 to $0.160 (data going back to your application)
Request Pricing for All HTTP Methods (per 10,000): From $0.0075 to $0.0090
Compare that to Amazon S3:
GET Requests: $0.0004 per 1000
Data Transfer: $0.09 per GB (Also applies for traffic coming from Amazon EC2 instances)
Therefore, some options for you to save money are:
Choose a lower Price Class that restricts which regions send traffic "out". For example, Price Class 100 only sends traffic from USA and Europe, which has lower Data Transfer costs. This will reduce Data Transfer costs for other locations, but will give them a lower quality of service (higher latency).
Stop using CloudFront and serve content directly from S3 and EC2. This will save a bit on requests (about half the price), but Data Transfer would be a similar cost to Price Class 100.
Increase the caching duration for your objects. However, the report is showing 99.9%+ hit rates, so this won't help much.
Configure the objects to persist longer in user's browsers so less requests are made. However, this only works for "repeat traffic" and might not help much. It depends on app usage. (I'm not familiar with this part. It might not work in conjunction with CloudFront. Hopefully other readers can comment.)
Typically, mosts costs are related to the volume of traffic. If you app is popular, those Data Transfer costs will go up.
Take a look at your bills and try to determine which component is leading to most of the costs. Then, it's a trade-off between service to your customers and costs to you. Changing the Price Class might be the best option for now.

Google Finance: How big is a normal delay for historical stock data or is something broken?

I tried to download historical data from Google with this code:
import pandas_datareader.data as wb
import datetime
web_df = wb.DataReader("ETR:DAI", 'google',
datetime.date(2017,9,1),
datetime.date(2017,9,7))
print(web_df)
and got this:
Open High Low Close Volume
Date
2017-09-01 61.38 62.16 61.22 61.80 3042884
2017-09-04 61.40 62.01 61.31 61.84 1802854
2017-09-05 62.01 62.92 61.77 62.42 3113816
My question: Is this a normal delay or is something broken?
Also I would want to know: Have you noticed that Google has removed the historical data pages at Google Finance? Is this a hint that the will remove or allready have removed the download option for historical stock data, too?
google finance using pandas has stopped working since last night, I am trying to figure out.I have also noticed that the links to the historical data on their website is removed.
It depends on which stocks and which market.
Example with Indonesia market, it still able to get latest data. Of course, it may be soon to follow the fate of others market that stop updating on 5 September 2017. A very sad things
web_df = wb.DataReader("IDX:AALI", 'google',
datetime.date(2017,9,1),
datetime.date(2017,9,7))
Open High Low Close Volume
Date
2017-09-04 14750.0 14975.0 14675.0 14700.0 475700
2017-09-05 14700.0 14900.0 14650.0 14850.0 307300
2017-09-06 14850.0 14850.0 14700.0 14725.0 219900
2017-09-07 14775.0 14825.0 14725.0 14725.0 153300

Is max age relative to last-modified date or request time?

When a server gives Cache-Control: max-age=4320000,
Is the freshness considered 4320000 seconds after the time of request, or after the last modified date?
RFC 2616 section 14.9.3:
When the max-age
cache-control directive is present in a cached response, the response
is stale if its current age is greater than the age value given (in
seconds) at the time of a new request for that resource. The max-age
directive on a response implies that the response is cacheable (i.e.,
"public") unless some other, more restrictive cache directive is also
present.
It is always based on the time of request, not the last modified date. You can confirm this behavior by testing on the major browsers.
tl;dr: the age of a cached object is either the time it was stored by any cache or now() - "Date" response header, whichever is bigger.
Full response:
The accepted response is incorrect. The mentioned rfc 2616 states on section 13.2.4 that:
In order to decide whether a response is fresh or stale, we need to compare its freshness lifetime to its age. The age is calculated as described in section 13.2.3.
And on section 13.2.3 it is state that:
corrected_received_age = max(now - date_value, age_value)
date_value is the response header Date:
HTTP/1.1 requires origin servers to send a Date header, if possible, with every response, giving the time at which the response was generated [...] We use the term "date_value" to denote the value of the Date header.
age_value is for how long the item is stored on any cache:
In essence, the Age value is the sum of the time that the response has been resident in each of the caches along the path from the origin server, plus the amount of time it has been in transit along network paths.
This is why good cache providers will include a header called Age every time they cache an item, to tell any upstream caches for how long they cached the item. If an upstream cache decides to store that item, its age must start with that value.
A practical example: a item is stored on the cache. It was stored 5 days ago, and when this item was fetched, the response headers included:
Date: Sat, 1 Jan 2022 11:05:05 GMT
Cache-Control: max-age={30 days in seconds}
Age: {10 days in seconds}
Assuming now() is Feb 3 2022, the age of the item must be calculated like (rounding up a bit for clarity):
age_value=10 days + 5 days (age when received + age on this cache)
now - date_value = Feb 3 2022 - 1 Jan 2022 = 34 days
The corrected age is the biggest value, that is 34 days. That means that the item is expired and can't be used, since max-age is 30 days.
The RFC presents a tiny additional correction that compensates for the request latency (see section 3, "corrected_initial_age").
Unfortunately not all cache servers will include the "Age" response header, so it is very important to make sure all responses that use max-age also include the "date" header, allowing the age to always be calculated.

RavenDB Document Deleted Before Expiration

I am attempting to write a document to RavenDB with an expiration 20 minutes in the future. I am not using the .NET client, just curl. My request looks like this:
PUT /databases/FRUPublic/docs/test/123 HTTP/1.1
Host: ravendev
Connection: close
Accept-encoding: gzip, deflate
Content-Type: application/json
Raven-Entity-Name: tests
Raven-Expiration-Date: 2012-07-31T22:23:00
Content-Length: 14
{"data":"foo"}
In the studio I see my document saved with Raven-Expiration-Date set exactly 20 minutes from Last-Modified, however, within 5 minutes the document is deleted.
I see this same behavior (deleted in 5 minutes) if I increase the expiration date. If I set an expiration date in the past the document deletes immediately.
I am using build 960. Any ideas about what I'm doing wrong?
I specified the time to 10 millionth of a second and now documents are being deleted just as I would expect.
For example:
Raven-Expiration-Date: 2012-07-31T22:23:00.0000000
The date have to be in UTC, and it looks like you are sending local time.

Omniture Data Warehouse API Not Allowing 'hour' Value for Date_Granularity

When using the Omniture Data Warehouse API Explorer ( https://developer.omniture.com/en_US/get-started/api-explorer#DataWarehouse.Request ), the following request provides an 'Date_Granularity is invalid response'. Does anyone have experience with this? The API documentation ( https://developer.omniture.com/en_US/documentation/data-warehouse/pdf ), states that the following values are acceptable: "none, hour, day, week, month, quarter, year."
{
"Breakdown_List":[
"evar14",
"ip",
"evar64",
"evar65",
"prop63",
"evar6",
"evar16"
],
"Contact_Name":"[hidden]",
"Contact_Phone":"[hidden]",
"Date_From":"12/01/11",
"Date_To":"12/14/11",
"Date_Type":"range",
"Email_Subject":"[hidden]",
"Email_To":"[hidden]",
"FTP_Dir":"/",
"FTP_Host":"[hidden]",
"FTP_Password":"[hidden]",
"FTP_Port":"21",
"FTP_UserName":"[hidden]",
"File_Name":"test-report",
"Metric_List":[ ],
"Report_Name":"test-report",
"rsid":"[hidden]",
"Date_Granularity":"hour",
}
Response:
{
"errors":[
"Date_Granularity is invalid."
]
}
Old question, just noticing it now.
Data Warehouse did not support the Hour granularity correctly until Jan 2013 (the error you saw was a symptom of this). Then it was corrected for date ranges less then 14 days. In the July 2013 maintenance release of v15 the 14 day limit should be gone. But I have not verified that myself.
As always the more data you request the longer the DW processing will take. So I recommend keeping ranges to a maximum of a month and uncompressed file sizes to under a 1GB, though I hear 2 GB should now be supported.
If you still have issues please let us know.
Thanks C.