jjhale.com

Generating a sitemap for MediaWiki hosted on GoDaddy

I have just managed to get my wiki’s sitemap submitted to google. It was a bit tricky, so I figured I’d write up what I did in case anyone else had similar difficulties. In this post I describe how I enabled automatic sitemap generation for a MediaWiki 1.12.0 installation hosted on GoDaddy. I used the generateSitemap.php file and a shell script to correct the output and a cron job to run it once a day.


The simplest way to generate a sitemap is just to use a web based sitemap generator. With these generators you just enter the URL of your site and it pokes around and spits out a sitemap.xml file which you can upload to the root of the site. Then you sign up for a google webmaster account and tell google where the sitemap is stored. You can find out about what an XML  sitemap is here.

The problem with using a web-based generator is that it needs to be redone whenever your content changes. It also needs to read every page of your site – which could take a while for larger sites. In addition to this, the free versions are often limited to a certain number of pages.

A more elegant solution is to use MediaWiki’s built-in generateSitemap.php script. Having easily installed the sitemap plug-in for Wordpress for my blog, I figure that it would be as easy for MediaWiki. I just needed the generator script to automatically run at regular intervals. However, it turns out there are a couple of extra steps that are need before it is complete. This method is not guaranteed and always make sure that you have everything backed-up before you begin.

Since my site is hosted by GoDaddy, I’ll be describing the process required for people using GoDaddy. I hope it is of some use to people using other hosting providers too. I’ll be describing how you enable the maintenance scripts, to a test run of the generateSitemap.php. Then how to tweak a shell script that will correct the output from generateSitemap.php, and finally how to schedule this script to run at regular intervals.

Step 0 – Log on

The "My Hosting Account" on the GoDaddy homepage

The "My Hosting Account" on the GoDaddy homepage

Log into goDaddy  and choose “My Hosting Account” from the Hosting & Servers menu.

Select “Manage Account” for the hosting account that you have your wiki installed on. You should now see the Hosting  Control Center. (here I had version 2.8.2).  From here you can start the File Manager and the Cron Manager.

Step 1 – enable the maintenance scripts.

By default the maintenance scripts are not enabled. A settings file need to be edited to fix this. In the file manager navigate to the directory in which the wiki is installed and rename “AdminSettings.sample” to AdminSettings.php”. Copy the database username and password from the “LocalSettings.php” in to the ‘AdminSettings.php’ file

$wgDBadminuser      = ‘wikiadmin’;
$wgDBadminpassword  = ‘adminpass’;

Note that this can all be done within the File Manager using the edit button.

Step 2 – Running generateSitemap.php

The File Manager and Cron Manager in the Control Center

The File Manager and Cron Manager in the Control Center

On GoDaddy I do not have command line access to the webserver. To run any command you need to set up a cron job. A cron job is a command that is scheduled to run at a certain time. You do this from the Hosting Control Center. Click on “Content” and then on “Cron Manager”

Firstly make sure that you have a valid email address for the output from the job will be sent. This is in the account info panel on the left. All errors, warnings and text output from command will be sent to that address.

Next create a new job by hitting the “Create Cron Job” button
Enter “Test Wiki sitemap” in the Cron Job Title and press the browse button for the command. Then browse to the directory where the wiki is installed and then to the “maintenance” subdirectory. Select the “generateSitemap.php” file and hit “Select”. The command line should be filled with something like:

/web/cgi-bin/php5 "$HOME/html/jjhalewiki/maintenance/generateSitemap.php"

Which is getting the php server to run the file “generateSitemap.php”, we will be using these details later so copy the whole command and paste it into an empty text file.

Set the frequency to hourly and choose the minute so that it will run within the next 5 mins (the current server time is in the account info tab.) Hit save and your job should be ready to go. It will be added to the list of cron jobs and will be run at the number of minutes past the hour which you specified. Make sure that its status is set to “enabled”, if it is not just select the tick box corresponding to the Test Wiki sitemap job and press the enable button at the top of the page.

Once the job’s time passes you will get an email from the cron manager letting you know the job ran. A typical email might look like:

Subject:
Cron /web/cgi-bin/php5 "$HOME/html/jjhalewiki/maintenance/generateSitemap.php"
Message:
<br />
<b>Warning</b>:  in_array() [<a href='function.in-array'>function.in-array</a>]:
 Wrong datatype for second argument in
<b>/home/content/s/c/r/scro0874/html/jjhalewiki/maintenance/generateSitemap.php</b>
on line <b>448</b><br />
<br />
<b>Warning</b>:  array_shift()
[<a href='function.array-shift'>function.array-shift</a>]:
The argument should be an array in
<b>/home/content/s/c/r/scro0874/html/jjhalewiki/maintenance/commandLine.inc</b>
on line <b>39</b><br />
<br />
<b>Warning</b>:  reset() [<a href='function.reset'>function.reset</a>]:
 Passed variable is not an array or object in
 <b>/home/content/s/c/r/scro0874/html/jjhalewiki/maintenance/commandLine.inc</b>
on line <b>49</b><br />
0 ()
sitemap-jjh0819204212248-mw_-NS_0-0.xml.gz
4 (wiki.jjhale.com)
sitemap-jjh0819204212248-mw_-NS_4-0.xml.gz
6 (Image)
sitemap-jjh0819204212248-mw_-NS_6-0.xml.gz
10 (Template)
sitemap-jjh0819204212248-mw_-NS_10-0.xml.gz
12 (Help)
sitemap-jjh0819204212248-mw_-NS_12-0.xml.gz
14 (Category)
sitemap-jjh0819204212248-mw_-NS_14-0.xml.gz

Don’t worry about the warnings. If you get errors, make sure that you have set up the username and password correctly in “AdminSettings.php” and that you are using “php5” rather than “php” in the command line.

Step 3 Create the script to correct and move the files

You usually need to fix these files so that they have the website address.
Assuming the command ran ok, you will now have the files listed in your email in the “maintenance” directory. You can download these with your ftp client just to check them. They are gzipped files and once uncompressed you should see a bunch of links mixed in with the xml tags.

The script often fails to find the name of your site properly, using “localhost” instead.

This can be fixed using the following script (based on this one). Copy and paste it into a new plain text file. Now a couple of variables need to be replaced:

  1. change “$HOME/html/jjhalewiki/maintenance/” on lines 3 and 5 to the path to your maintenance directory (this is the second bit of the command you got from the cron manager before without the generateSitemap.php for line 3).
  2. Do a search and replace for wiki.jjhale.com, replacing it with the url to your wiki without the “http://” bit.
    • If your wikis is at http://www.mySite.com/wiki, use the string www.mySite.com/wiki
  3. Replace “sitemap-index-jjh0819204212248-mw_.xml” with the name of the generated file starting “sitemap-index” and ending “xml.gz”, but remove the “.gz”. You will find this in the maintenance folder after you run the “Test Wiki sitemap” cron
    • Eg if after generateSitemap.php had been run by the cron manager, look in the maintenance directory (either ftp to it or use the goDaddy file manager) for a file starting “sitemap-index” and ending “xml.gz”. If you found “sitemap-index-myDBusername-mw_.xml.gz” you would replace “sitemap-index-jjh0819204212248-mw_.xml” in the script with “sitemap-index-myDBusername-mw_.xml”, note the lack of “.gz”
  4. Save the script as “sitemap_fix”

#!/bin/sh
echo generating the sitemap…
cd $HOME/html/jjhalewiki/maintenance/
echo Current Dir
pwd
/web/cgi-bin/php5 $HOME/html/jjhalewiki/maintenance/generateSitemap.php
echo prepping the files…
mv -f *.gz ../
mv -f *.xml ../
cd ..
gzip -d -f *.xml.gz

mv -f sitemap-index-jjh0819204212248-mw_.xml sitemap.xml

echo repairing the index file…
sed ’s/<loc>/<loc>http:\/\/wiki.jjhale.com\//g’ sitemap.xml > sitemap.xml.sed
mv sitemap.xml.sed sitemap.xml

echo repairing the files…
for i in $( ls *.xml ); do
sed ’s/localhost/wiki.jjhale.com/g’ $i > $i.sed
done
ls -d *.sed | sed ’s/\(.*\).sed$/mv “&” “\1″/’ | sh

echo gzip it all back up…
gzip -f *.xml
gzip -d -f sitemap.xml.gz

Step 4 Upload the script

Upload the script that you just saved to the directory that you installed your wiki to.  Go to the File Manager in the GoDaddy and find the file “sitemap_fix”. Check the box next to it and hit the “permissions” button. Check the “executable” box and uncheck the “web visible” box. Then hit OK.

Step 5 Set the script to run

Now you need to go back to the cron manager. Edit the job that you previously created by selecting the check box beside the Test Wiki Sitemap and hitting the edit button.
Press the browse button next to the command text box and find the “sitemap_fix” file and hit select. Change the time the job executes so that it will be run soon and press “Save”. (Note it should still be set to run hourly while we test it.)

Step 6 Check it is working

After the execution time of the cron job passes you should get a email saying the job has run. There should now be a file called “sitemap.xml” in the root directory of your wiki, eg http://wiki.jjhale.com/sitemap.xml, this lists the other sitemap files which actually contain the urls to your wiki pages. Make sure that the url look right (ie have your web address in them not mine!). If it does not look right, go back and check the sitemap_fix file. If it does not work at all send me a comment with the problem and the version number of your wiki (you can find it in the special pages).  Note that this is only tested on 1.12.0 so far.

Step 6 Change the frequency of the job and tell google

Once it is working properly you can change the frequency at which the cron job is run. This will depend on how frequently your site changes. Go back to the Cron manager, select the job and hit edit.

You can then head over to Google webmaster tools and tell them where your “sitemap.xml” file is.

Good Luck!

jjh

Tags: , , ,

One Response to “Generating a sitemap for MediaWiki hosted on GoDaddy”

  1. Websites tagged "godaddy" on Postsaver Says:

    [...] - Generating a sitemap for MediaWiki hosted on GoDaddy saved by ewherrmann2008-09-21 - Chillin’ with Lovespirals #56: Going to NME 2008 saved by [...]

Leave a Reply

You must be logged in to post a comment.