Get a quick keyword report using the built-in CLI: Spark
I am a huge fan of command line applications (CLI). They are fast, portable, and give you a lot of flexibility to modify and chain streams of data in unique ways to address multiple problems without the need of a HTML/CSS graphical user interface and Javascript to glue it to the backend PHP. With that understanding, it should be no surprise that any framework I use should offer a full-featured CLI tool, and it just so happens that CodeIgniter 4 ships with one included called Spark.
CodeIgniter 4 is simply an amazing framework. Laravel gets all the attention, but when you compare the two, CodeIgniter 4 just simply shines in terms of the framework’s size, speed, and availability of common features that any web developer might need. It also feels very natural and intuitive, at least to me, to develop sites and applications with. If you haven’t checked it out in a while, definitely take the time to, because it has changed a lot!
This snippet will cover one way to get a keyword count from one of your routes. It’s an early prototype from a project I am currently working on for a suite of SEO tools that I plan on porting to Ivory, which is built on CodeIgniter 4. The general flow starts by grabbing the route service from CodeIgniter and then using the built web request library to visit and parse a supplied route by name. As it is, this example will give you a list of keywords on a page by their count. Here’s the general approach - keep in mind, not all features are implemented here, and it is shown only for demonstration purposes:
General Approach
- Filter text
- Split into words
- Remove two character words and stop words
- Determine word frequency + density
- Determine word prominence
- Determine word containers (Not implemented here)
- Title
- Meta description
- URL
- Headings
- Meta keywords
- Calculate keyword value
Filter text
The first thing you need to do is filter make sure the encoding is correct, so convert is to UTF-8:
iconv ($encoding, "utf-8", $file); // where $encoding is the current encoding
After that, you need to strip all html tags, punctuation, symbols and numbers. I used the following to remove punctuation, but didn’t remove numbers.
str_replace([".","?","!","@","/","'", '"', "&", "©",",",":",")","(",";","]","[","{","}"], "", $body)
Split into words
$words = mb_split( ' +', $text );
Remove 2 character words and stopwords
Any word consisting of either one or two characters won’t be of any significance, so we remove all of them. To remove stop words, we first need to detect the language. There are a couple of ways we can do this: - Checking the Content-Language HTTP header - Checking lang="" or xml:lang="" attribute - Checking the Language and Content-Language metadata tags If none of those are set, you can use an external API like the AlchemyAPI.
You will need a list of stopwords per language, which can be easily found on the web. The author of the original list used this, and so did I: http://www.ranks.nl/resources/stopwords.html
Determine word frequency + density
To count the number of occurrences per word, use this:
$uniqueWords = array_unique ($keywords); // $keywords is the $words array after being filtered as mentioned in step 3
$uniqueWordCounts = array_count_values ( $words );
Now loop through the $uniqueWords array and calculate the density of each word like this:
$density = $frequency / count ($words) * 100;
Determine word prominence
The word prominence is defined by the position of the words within the text. For example, the second word in the first sentence is probably more important than the 6th word in the 83th sentence.
To calculate it, add this code within the same loop from the previous step:'
$keys = array_keys ($words, $word); // $word is the word we're currently at in the loop
$positionSum = array_sum ($keys) + count ($keys);
$prominence = (count ($words) - (($positionSum - 1) / count ($keys))) * (100 / count ($words));
Determine word containers
A very important part is to determine where a word resides - in the title, description and more. So first, you need to grab the title, all metadata tags and all headings using something like DOMDocument or PHPQuery. Then you need to check, within the same loop, whether these contain the words.
Calculate keyword value
The last step is to calculate a keywords value. To do this, you need to weigh each factor - density, prominence and containers. For example:
$value = (double) ((1 + $density) * ($prominence / 10)) * (1 + (0.5 * count ($containers)));
With this guide, I implemented some of those features using the following Command class:
<?php
namespace App\Commands;
use CodeIgniter\CLI\BaseCommand;
use CodeIgniter\CLI\CLI;
use Config\Services;
class CreateSeoKeywordReport extends BaseCommand {
protected $group = 'SEO';
protected $name = 'create:kwreport';
protected $description = 'Will parse a web resource into a keyword analysis report.';
protected $usage = 'create:kwreport route [options]';
protected $arguments = [
'route' => 'The CodeIgniter named routes resource.'
];
protected $options = [
'w' => 'Will just display the words not the count',
'c' => 'Will display them as comma separated values order by count',
'l' => 'Limit of words to return'
];
protected static $stopwords = [
"i","me","my","myself","we","our","ours","ourselves",
"you","your","yours","yourself","yourselves",
"he","him","his","himself",
"she","her","hers","herself",
"it","its","itself",
"they","them","their","theirs","themselves",
"what","which","who","whos","whose","whom",
"this","that","these","those",
"am","is","are","was","were",
"be","been","being",
"have","has","had","having",
"do","does","did","doing",
"a","an",
"the","and","but","if","or","because",
"as","until","while","of","at","by",
"for","with","about","against","between",
"into","through","during","before","after",
"above","below","to","from","up","down","in","out","on","off",
"over","under","again","further","then","once",
"here","there","when","where","why","how","all","any","both",
"each","few","more","most","other","some","such",
"no","nor","not","only","own","same","so","than",
"too","very","s","t","can","will","just","don","should","now",
];
protected static $outputpath = WRITEPATH."kwreport.txt";
public function run(array $params)
{
if (empty($params)) {
CLI::write("Missing route name!", "red"); return null;
}
$route = $params[0];
$site_routes = Services::routes()->loadRoutes();
try {
$client = Services::curlrequest();
$response = $client->request("GET", url_to($route));
if ($response->getStatusCode() == 200) {
$content = $response->getBody();
preg_match("/<title>(.+)<\/title>/", $content, $title);
$title = $title[1] ?? "";
preg_match('/<meta name=["|\']description["|\'].+content=["|\'](.+)["|\']/', $content, $description);
$description = $description[1] ?? "";
preg_match("/<body.*\/body>/s", $content, $body);
// Get rid of Script tage content
$body = preg_replace("/<script.*\/script>/s", "", $body);
$body = strip_tags($body[0]);
$body = str_replace([".","?","!","@","/","'", '"', "&", "©",",",":",")","(",";","]","[","{","}"], "", $body);
$words = mb_split( ' +', $body);
$extracted_words = [];
foreach ($words as $word) {
if (empty(trim($word))) {
continue;
}
if (strlen($word) <= 3) {
continue;
}
array_push($extracted_words, trim($word));
}
$usable_words = [];
foreach ($extracted_words as $w) {
if (!in_array(strtolower($w), static::$stopwords)) {
array_push($usable_words, strtolower($w));
}
}
// All processing done - now stats
$uniqueWords = array_unique($usable_words);
$uniqueWordCounts = array_count_values($usable_words);
uasort($uniqueWordCounts, function($a, $b) {
if ($a == $b) {
return 0;
}
return ($a > $b) ? -1 : 1;
});
$i = 0;
$limit = (CLI::getOption('l')) ? (int) CLI::getOption('l') : 20;
if (CLI::getOption('c')) {
$wordString = "";
foreach ($uniqueWordCounts as $word => $count) {
if ($i > $limit) {
CLI::write(rtrim($wordString, ","));
return 0;
}
$wordString .= $word.",";
$i++;
}
return 0;
}
foreach ($uniqueWordCounts as $word => $count) {
if ($i > $limit) {
return 0;
}
if (CLI::getOption('w')) {
CLI::write($word);
} else {
CLI::write($word.": ".$count);
}
$i++;
}
} else {
CLI::write("Unable to get resource. Status Code: {$response->getStatusCode()}", "red");
return 0;
}
} catch (\Exception $e) {
CLI::write("Error! {$e->getMessage()}", "red");
CLI::write($e->getTraceAsString());
}
}
}
The usage output, which you can get by running: php spark create:kwreport, for this command is as follows:
CodeIgniter v4.3.5 Command Line Tool - Server Time: 2023-07-02 07:55:52 UTC+00:00
Usage:
create:kwreport route [options]
Description:
Will parse a web resource into a keyword analysis report.
Arguments:
route The CodeIgniter named routes resource.
Options:
w Will just display the words not the count
c Will display them as comma separated values order by count
l Limit of words to return
This command is far from complete. For example, I did not implement the keyword density yet (I’m still tinkering with it) and I didn’t include the weight system. These are all projects for you. When you run the following command:
php spark create:kwreport home
…where home is a named route in your application, you should see something similar to the following:
[allinnia@alienfedora zen-ci]$ php spark create:kwreport home
CodeIgniter v4.3.5 Command Line Tool - Server Time: 2023-07-02 07:59:49 UTC+00:00
development: 8
design: 6
services: 6
solutions: 6
online: 5
perfect: 4
business: 4
offer: 4
support: 3
applications: 3
products: 3
studio: 2
projects: 2
learn: 2
quote: 2
platform: 2
independent: 2
consulting: 2
provides: 2
product: 2
comprehensive: 2
This example uses my own home page. As you can see, as the class is now, we get a list of keywords by the number of times they are referenced in the content. If I wanted them in a copy and paste format for my HTML keywords markup, I could run the following:
php spark create:kwreport -c
Which would output the following:
CodeIgniter v4.3.5 Command Line Tool - Server Time: 2023-07-02 08:11:41 UTC+00:00
development,design,services,solutions,online,perfect,business,offer,support,applications,products,studio,projects,learn,quote,platform,independent,consulting,provides,product,comprehensive
You could remove the CodeIgniter header by supplying the option: –no-header, which would allow you to directly output the keywords to a file. The next steps are to implement the keyword density formula and the weight system by comparing it to the title, description, and keywords meta fields. I hope you found this snippet helpful. As mentioned, the final product will be available in Ivory, which is set to be available on Github by mid 2024.