Extract Day from Timestamp in PIG - apache-pig

I am trying to analyze the logs using PIG. There is a need to extract the Day from the timestamp. Below is the sample log.
Sample Log-
122.172.200.100 - - [17/Oct/2014:00:04:36 -0400] "GET /tag/hbase-sink/ HTTP/1.1" 200 15997 "https://www.google.co.in/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36"
I have loaded the log files using below command
logs = LOAD 'sample_log' USING org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader() AS (addr: chararray, logname: chararray, user: chararray, time: chararray,method: chararray, uri: chararray, proto: chararray,status: int, bytes: int, referer: chararray, userAgent: chararray);
Now I extracted the Date from the time using DateExtractor as below
foreach_logs = FOREACH logs GENERATE org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor(time));
Now I have to extract the Day from the date. I have tried using GetDay but it is not working. Can anyone know how to extract the Day from the Date?

First ensure you are getting the field loaded correctly into the variable time.Then convert the time to a datetime object using ToDate(time,'yyyy/MM/dd:HH:mm:ss') and then use GetDay().Assuming you have correctly loaded the data.
foreach_logs = FOREACH logs GENERATE GetDay(ToDate(time,'dd/MMM/yyyy:HH:mm:ss Z'));

Related

Cloudwatch Logs Insights working with multiple #messages

I have the following query with the following output:
Query:
filter #message like /A:|B:/
Output:
[INFO] 2020-07-28T09:20:48.406Z requestid A: [{'Delivery': OK, 'Entry': 12323 }]
[INFO] 2020-07-28T09:20:48.407Z requestid B: {'MyValue':0}
I would like to print ONLY the A message when in the B message 'MyValue' = 0. For the above example, I would have to have the following output
Output:
[INFO] 2020-07-28T09:20:48.406Z requestid A: [{'Delivery': OK, 'Entry': 12323 }]
For the next example
[INFO] 2020-07-28T09:20:48.406Z requestid A: [{'Delivery': OK, 'Entry': 12323 }]
[INFO] 2020-07-28T09:20:48.407Z requestid B: {'MyValue':12}
The output should be empty
I can't do something like this because I miss the A message:
filter #message like /A:|B:/
filter MyValue = 0
Any ideas?
If anyone still interested, there IS ways to get the first and last from grouping by a field. So if you can fit your data into pairs of messages, it might help.
For example, given API Gateway access log (each row is a #message):
2021-09-14T14:09:00.452+03:00 (01c53288-5d25-*******) Extended Request Id: ***************
2021-09-14T14:09:00.452+03:00 (01c53288-5d25-*******) Verifying Usage Plan for request: 01c53288-5d25-*******. API Key: API Stage: **************/dev
2021-09-14T14:09:00.454+03:00 (01c53288-5d25-*******) API Key authorized because method 'ANY /path/{proxy+}' does not require API Key. Request will not contribute to throttle or quota limits
2021-09-14T14:09:00.454+03:00 (01c53288-5d25-*******) Usage Plan check succeeded for API Key and API Stage **************/dev
2021-09-14T14:09:00.454+03:00 (01c53288-5d25-*******) Starting execution for request: 01c53288-5d25-*******
2021-09-14T14:09:00.454+03:00 (01c53288-5d25-*******) HTTP Method: GET, Resource Path: /path/json.json
2021-09-14T14:09:00.468+03:00 (01c53288-5d25-*******) Method completed with status: 304
We can get method, uri and return code from the last 2 rows.
To do this, I parse the relevant data into params, and then get them by doing aggregation by request id (that i also parse)
The magic is: using stats likesortsFirst() and sortsLast() and grouping by #reqid. (AWS Docs
Note: IMO, don't use earliest() and latest() as they depend on built-in #timestamp and worked weird for me where 2 sequential messages had the same timestamp
So, for example, using this query:
filter #message like "Method"
| parse #message /\((?<#reqid>.*?)\) (.*?) (Method: (?<#method>.*?), )?(.*?:)* (?<#data>[^\ ]*)/
| sort #timestamp desc
| stats sortsFirst(#method) as #reqMethod, sortsFirst(#data) as #reqPath, sortsLast(#data) as #reqCode by #reqid
| limit 20
We would get the following desired output:
#reqid #reqMethod #reqPath #reqCode
f42e2b44-b858-45cb-***************** GET /path-******.json 304
fecddb03-3804-4ff5-***************** OPTIONS /path-******.json 200
e8e47185-6280-4e1e-***************** GET /path-******.json 304
e4fa9a0c-6d75-4e26-***************** GET /path-******.json 304

Should user account be locked after X amount of failed logins?

I have almost finished developing my login system and there is one more thing that I'm not sure about. So many debates I found on should the internet about counting invalid logins and locking users account. My system stores user names and passwords (that are salted and hashed) in database. If user enters invalid user name or password I keep track of their Username, Password, LoginTime, SessionID, IP and Browser. Here is example:
LoginID LoginTime LoginUN LoginPW LoginSessionID LoginIP LoginBrowser
1 2018-03-15 13:40:25.000 jpapis test E72E.cfusion 10.18.1.37 Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
98 2018-03-15 13:48:45.000 mhart mypass55 E72E.cfusion 10.12.1.87 Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
32 2018-03-15 14:29:14.000 skatre 1167mmB! 378E.cfusion 10.36.1.17 Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
I'm wondering if I should lock account after X attempts? If so what would be the best practice to do that? Here is one approach that I found:
SELECT COUNT(LoginID) AS countID, DATEDIFF(mi,LoginTime,GETDATE ( )) AS TimeElapsed
FROM FailedLogins
WHERE (LoginUN = '#username#' OR LoginSessionID = '#SESSION.sessionid#' OR LoginIP = '#REMOTE_ADDR#')
AND DATEDIFF(mi,LoginTime,GETDATE ( )) <= 60
GROUP BY LoginID, LoginTime
HAVING COUNT(LoginID) >= 5;
Query above will look for username, sessionID or IP address. If either of these it's found in FailedLogin table within 60min and it's greater than 5 I would lock the account. Only problem here is I'm not sure what this would prevent, brute force attack can send way too many attempts in 60min so I'm not sure what would be the benefit of checking failed logins this way. Is there better way to handle failed logins now days? Should I even lock the account? If anyone can provide some thoughts and examples please let me know. Thank you.
Agree with #Ageax on checking Information Security.
I'm not sure that I need this kind of security check in my system.
Yes, you do. You always do. It's those that don't that often appear on the news.
Some best practices
Lock account after X number of failed logins.
Keep a record of Y past passwords (hashed, not plain text). When someone updates their password, check the new one against the old ones so they can't reuse recent passwords (compare hashes).
Monitor failed attempts past X to determine if you need to block IP addresses if failed attempts become excessive.
When a user's login fails, never tell them if it was specifically the user name or the password that was incorrect. You're just helping hackers progress faster.
Do some reading on the other site and see what else is recommended.
You must create a password lockout system only for payed or premium accounts, or if the website you or someone else owns is extremely popular (More than 100,000 annual viewers), as they are most valued, and most likely to be attacked. If you expect such a volume of people, then it is best practice to implement this. You can see many large corporations doing this practice, Google locks people out of accounts because they can contain money like in Google Play Store Credits or Android Pay wallets. The same goes for Minecraft accounts, Netflix accounts, etc. The algorithm behind this is something like this :
if(md5($password)==$loginrow['login'])
{
//Do your login sequence here
}
else
{
if($loginrow['AttemptsInPastFifteenMinutes']>=15)
{
mysqli_query($dbconnect,'CREATE EVENT reset ON SCHEDULE EVERY 15 MINUTE DO UPDATE ' .$loginrow['user']. 'set AttemptsInPastFifteenMinutes = 0');
echo '<script>alert("You have typed in invalid passwords too many times. Please try again later.");</script>';
}
else
{
mysqli_query($dbconnect,'UPDATE logins SET AttemptsInPastFifteenMinutes=' .($loginrow['AttemptsInPastFifteenMinutes'] + 1). ' WHERE user=' .$loginrow['user']');
echo '<script>alert("Invalid username or password");</script>';
}
}

Determine Browser Type

First time using Google Big Query/Big Data. Still getting used to queries & commands, but i'm trying to count the total # of browsers that are using our application.
So far, I have:
SELECT
user_agent_data, session_count,
SUM(LENGTH(user_agent_data)) as device_type
FROM [metal-filament-151915:ipc.intercomusers]
where
user_agent_data contains 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
group by user_agent_data, session_count
order by device_type DESC
Which is returning
[
{
"user_agent_data": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0",
"session_count": "4",
"device_type": "164"
},
{
"user_agent_data": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0",
"session_count": "2",
"device_type": "164"
}
]
This is counting separate instances of the same browser type. How would I be able to roll up that count as 2, instead of 2 separate rows from 2 separate users.
Not sure if this is what you need but does it work for you?
SELECT
user_agent_data,
count(user_agent_data) freq_browsers
FROM [metal-filament-151915:ipc.intercomusers]
where
user_agent_data contains 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
group by user_agent_data,
order by freq_browsers DESC
In your query, as you are grouping by the session_count field, the result will be separated by each value observed in session_count (in this case, the "2" and "4").
But still you would need the count operation to get how many browsers were observed (if I understood it correctly).

Apache access log meaning

I'm trying to understand my access log. I didn't find any format example like this.
2162004 93.186.15.149 [25/Apr/2016:12:53:40 +0200] 4914163 200 www.example.org "GET /foto/376.JPG HTTP/1.1"
I don't get the first long number before the IP and the second long number before the 200 status.
Thanks a lot :)
First long should be Bytes sent and the second one is Time spent(ms)
But you need to confirm that with your log format under httpd.conf to understand the meaning of each value.
Help yourself with this to better understand your log:
list:http://httpd.apache.org/docs/current/mod/mod_log_config.html

RavenDB Document Deleted Before Expiration

I am attempting to write a document to RavenDB with an expiration 20 minutes in the future. I am not using the .NET client, just curl. My request looks like this:
PUT /databases/FRUPublic/docs/test/123 HTTP/1.1
Host: ravendev
Connection: close
Accept-encoding: gzip, deflate
Content-Type: application/json
Raven-Entity-Name: tests
Raven-Expiration-Date: 2012-07-31T22:23:00
Content-Length: 14
{"data":"foo"}
In the studio I see my document saved with Raven-Expiration-Date set exactly 20 minutes from Last-Modified, however, within 5 minutes the document is deleted.
I see this same behavior (deleted in 5 minutes) if I increase the expiration date. If I set an expiration date in the past the document deletes immediately.
I am using build 960. Any ideas about what I'm doing wrong?
I specified the time to 10 millionth of a second and now documents are being deleted just as I would expect.
For example:
Raven-Expiration-Date: 2012-07-31T22:23:00.0000000
The date have to be in UTC, and it looks like you are sending local time.