Top

GREP tricks for Apache Logs

Get insights into your logs using built in GNU/Linux utilities.

Get valuable insights from your Apache log data with grep!

I love shiny one-off solutions as much as the next guy, but sometimes it’s easy to overlook the simplicity of using built-in tools to accomplish some niche thing. In this article, I am going to show you how you can quickly get information from your Apache log data using the grep command line utility. At the end of this, you should be able to pluck out stats quickly, and you can adapt the approaches I show you here to be as complex, or as simple, as you need them to be.

You could easily throw any of these examples into a cronjob, and have them saved to a file for further processing using your tool of choice, or perhaps another script built in something like Python.

Getting 404 request statistics

As a webmaster, it’s usually good to know which pages on your site are throwing a 404 not found error. Generally, this can result from a few things, but it’s always good to have access to the data to differentiate between bot requests - which offer insight on their own - and actual problems your users might be facing.

# Using the grep with Extended Regular Expression flag:
grep -E --color=auto "GET (.+)\"\s404" /var/www/html/mysite/access_log -o | sed "s/GET //" | sort | uniq -c | sort -rn
# Using egrep (usually an alias to the above)
egrep "GET (.+)\"\s404" /var/www/html/mysite/access_log -o | sed "s/GET //" | sort | uniq -c | sort -rn

The above example will present you with a list of 404 GET requests made against your site, counted up in a unique list, and sorted from highest to lowest number of encountered matches.

Getting Top Visited Endpoints

grep -E --color=auto "GET (.+) HTTP" /var/www/html/mysite/access_log -o | sed "s/GET //" | sed "s/ HTTP//" | sort | uniq -c | sort -rn

Using the same methodology, you can make some small edits to the pattern, and easily grab all GET requests by path using the same operation to get a reverse order sorted list of top requested endpoints on your site. The basic idea in this example, is to use the GET and HEAD terms as anchors, and then remove them, and their associated spaces before further processing.

Keep in mind this will also include static file requests i.e. css, png files etc.

Getting POST requests

grep -E --color=auto "POST (.+) HTTP" /var/www/html/mysite/access_log -o | sed "s/GET //" | sed "s/ HTTP//" | sort | uniq -c | sort -rn

This is basically the same, only this time we substitute GET with POST. This can often show you a list of malicious attempts, but you could extend this further to pull only log entries that contain a 2xx or 3xx status code - as most bot requests will show in the 4xx or 5xx categories. Both of which offer you insight into possible nefarious attempts via bots or bad actors.

# All POST requests resulting in a 2xx or 3xx response status code
grep -P --color=auto "POST\s(.+)\sHTTP\/\d\.\d\"\s(2\d{2}|3\d{2})" /var/www/html/mysite/access_log -o | head
# All POST requests resulting in a 4xx or 5xx response status code - usually bad actor stuff
grep -P --color=auto "POST\s(.+)\sHTTP\/\d\.\d\"\s(4\d{2}|5\d{2})" /var/www/html/mysite/access_log -o | head
Notice here I used the -P flag vs the -E. In this example, PERL compatible regular expressions are needed for more advanced matching.

I left the example above with the pipe to head so you can get a quick glimpse at the data it pulls back. Perl compatible regular expressions are very powerful, and the functionality to use them is built in to grep. Keep in mind, the -E and -P options are mutually exclusive, meaning you can’t use both! However, this doesn’t stop you from using the -E earlier in your chain and piping the results to a more specific, Perl compatible, call to grep. You can get very creative once you have the basic down. I highly recommend copying over some lines into the RegEx101 Tester Website to test your regular expressions. That way you get instant visual feedback on what you’re actually matching.

Getting the unusual other HTTP verb requests

Sometimes this can be valuable, especially if you are looking through API logs. More information about each of these HTTP verbs can be found here.

# All the Other Verb requests resulting in a 2xx or 3xx response status code
grep -P --color=auto "(HEAD|PUT|DELETE|CREATE|OPTIONS|TRACE|PATCH)\s(.+)\sHTTP\/\d\.\d\"\s(2\d{2}|3\d{2})" /var/www/html/mysite/access_log -o | head
# All the Other Verb requests resulting in a 4xx or 5xx response status code
grep -P --color=auto "(HEAD|PUT|DELETE|CREATE|OPTIONS|TRACE|PATCH)\s(.+)\sHTTP\/\d\.\d\"\s(4\d{2}|5\d{2})" /var/www/html/mysite/access_log -o | head

Notice in this example, we wrap the verbs in parentheses and separate each exact possible match with a | character - this means match any of HEAD, PUT, DELETE, CREATE, OPTIONS, TRACE, or PATCH. Feel free to modify to meet your needs.

Expectation Management

Don’t get me wrong, you can get a lot done with just what we covered here. However, if you want to take it to the next level, using a more mature and full-featured language like Python is your best bet. As alluded to above, you could easily save the output from any of the commands we just went over to a file and then have another Python script regularly check and parse that file for integration with other reporting systems.

In truth, you could build a much more feature rich matching and extraction application using something like Python to sort requests out by date, or by visitor IP. The options are truly limitless.

If you haven’t checked out my other article on using GoAccess for Apache metrics, be sure to read it! GoAccess is another free and open-source tool that make parsing and looking through your Apache logs, much less of a headache. It does have its limitations being a Terminal User Interface (TUI) based application.

Additionally, if you want to see how I did something similar in a simple PHP command line application, check out my project: Feather on GitHub. It has some techniques I used to extract and parse Apache log data into usable CLI operations for JSON transforms etc.

Wrapping Up

I hope this article got you excited to go and learn how grep can give you quick insights into your Apache website’s traffic. Keep in mind, learning is a journey, and there are always easier ways to get what you need without paying for some over-priced service, or having to settle with not getting the data how you need it. Not to mention, while this article is about Apache logs, you could easily use what you’ve learned here to analyze any logs that use a repeating format.

As per usual - shameless plug coming - if you want to learn more, or get a consultation on how you can achieve your visibility goals on a budget, feel free to either book a training slot with me or get in contact with me to schedule a consultation to get your reporting pipeline built from the ground up!

Think I might be a good fit for your project?

Let's get the conversation started!