Get valuable insights from your Apache log data with grep!
I love shiny one-off solutions as much as the next guy, but sometimes it’s easy to overlook the simplicity of using built-in tools to accomplish some niche thing. In this article, I am going to show you how you can quickly get information from your Apache log data using the grep command line utility. At the end of this, you should be able to pluck out stats quickly, and you can adapt the approaches I show you here to be as complex, or as simple, as you need them to be.
Getting 404 request statistics
As a webmaster, it’s usually good to know which pages on your site are throwing a 404 not found error. Generally, this can result from a few things, but it’s always good to have access to the data to differentiate between bot requests - which offer insight on their own - and actual problems your users might be facing.
# Using the grep with Extended Regular Expression flag:
grep -E --color=auto "GET (.+)\"\s404" /var/www/html/mysite/access_log -o | sed "s/GET //" | sort | uniq -c | sort -rn
# Using egrep (usually an alias to the above)
egrep "GET (.+)\"\s404" /var/www/html/mysite/access_log -o | sed "s/GET //" | sort | uniq -c | sort -rn
The above example will present you with a list of 404 GET requests made against your site, counted up in a unique list, and sorted from highest to lowest number of encountered matches.
Getting Top Visited Endpoints
grep -E --color=auto "GET (.+) HTTP" /var/www/html/mysite/access_log -o | sed "s/GET //" | sed "s/ HTTP//" | sort | uniq -c | sort -rn
Using the same methodology, you can make some small edits to the pattern, and easily grab all GET requests by path using the same operation to get a reverse order sorted list of top requested endpoints on your site. The basic idea in this example, is to use the GET
and HEAD
terms as anchors, and then remove them, and their associated spaces before further processing.
Getting POST requests
grep -E --color=auto "POST (.+) HTTP" /var/www/html/mysite/access_log -o | sed "s/GET //" | sed "s/ HTTP//" | sort | uniq -c | sort -rn
This is basically the same, only this time we substitute GET
with POST
. This can often show you a list of malicious attempts, but you could extend this further to pull only log entries that contain a 2xx
or 3xx
status code - as most bot requests will show in the 4xx
or 5xx
categories. Both of which offer you insight into possible nefarious attempts via bots or bad actors.
# All POST requests resulting in a 2xx or 3xx response status code
grep -P --color=auto "POST\s(.+)\sHTTP\/\d\.\d\"\s(2\d{2}|3\d{2})" /var/www/html/mysite/access_log -o | head
# All POST requests resulting in a 4xx or 5xx response status code - usually bad actor stuff
grep -P --color=auto "POST\s(.+)\sHTTP\/\d\.\d\"\s(4\d{2}|5\d{2})" /var/www/html/mysite/access_log -o | head
I left the example above with the pipe to head
so you can get a quick glimpse at the data it pulls back. Perl compatible regular expressions are very powerful, and the functionality to use them is built in to grep
. Keep in mind, the -E
and -P
options are mutually exclusive, meaning you can’t use both! However, this doesn’t stop you from using the -E
earlier in your chain and piping the results to a more specific, Perl compatible, call to grep
. You can get very creative once you have the basic down. I highly recommend copying over some lines into the RegEx101 Tester Website to test your regular expressions. That way you get instant visual feedback on what you’re actually matching.
Getting the unusual other HTTP verb requests
Sometimes this can be valuable, especially if you are looking through API logs. More information about each of these HTTP verbs can be found here.
# All the Other Verb requests resulting in a 2xx or 3xx response status code
grep -P --color=auto "(HEAD|PUT|DELETE|CREATE|OPTIONS|TRACE|PATCH)\s(.+)\sHTTP\/\d\.\d\"\s(2\d{2}|3\d{2})" /var/www/html/mysite/access_log -o | head
# All the Other Verb requests resulting in a 4xx or 5xx response status code
grep -P --color=auto "(HEAD|PUT|DELETE|CREATE|OPTIONS|TRACE|PATCH)\s(.+)\sHTTP\/\d\.\d\"\s(4\d{2}|5\d{2})" /var/www/html/mysite/access_log -o | head
Notice in this example, we wrap the verbs in parentheses and separate each exact possible match with a |
character - this means match any of HEAD
, PUT
, DELETE
, CREATE
, OPTIONS
, TRACE
, or PATCH
. Feel free to modify to meet your needs.
Expectation Management
Don’t get me wrong, you can get a lot done with just what we covered here. However, if you want to take it to the next level, using a more mature and full-featured language like Python is your best bet. As alluded to above, you could easily save the output from any of the commands we just went over to a file and then have another Python script regularly check and parse that file for integration with other reporting systems.
If you haven’t checked out my other article on using GoAccess for Apache metrics, be sure to read it! GoAccess is another free and open-source tool that make parsing and looking through your Apache logs, much less of a headache. It does have its limitations being a Terminal User Interface (TUI) based application.
Additionally, if you want to see how I did something similar in a simple PHP command line application, check out my project: Feather on GitHub. It has some techniques I used to extract and parse Apache log data into usable CLI operations for JSON transforms etc.
Wrapping Up
I hope this article got you excited to go and learn how grep
can give you quick insights into your Apache website’s traffic. Keep in mind, learning is a journey, and there are always easier ways to get what you need without paying for some over-priced service, or having to settle with not getting the data how you need it. Not to mention, while this article is about Apache logs, you could easily use what you’ve learned here to analyze any logs that use a repeating format.
As per usual - shameless plug coming - if you want to learn more, or get a consultation on how you can achieve your visibility goals on a budget, feel free to either book a training slot with me or get in contact with me to schedule a consultation to get your reporting pipeline built from the ground up!