Archive

Posts Tagged ‘source code’

RedHat init Script For trac

January 15th, 2009

I found a few init scripts for trac, but they were all Debian-based, using ’start-stop-daemon’ which does not exist in RedHat-based distros. Here’s an init script that will work with RedHat.

#!/bin/bash
#
# tracd Start/Stop tracd.
#
# chkconfig: - 62 38
# description: tracd
#
# processname: tracd
#
# Author: Elliot (info@m7tech.net)

# Source function library
. /etc/init.d/functions

# Get network config
. /etc/sysconfig/network

RETVAL=0

TRACD_PORT=8000
TRACD_USER=tracd
DAEMON=/usr/bin/tracd
PIDFILE=/var/lock/subsys/tracd
TRACD_DIR=/subversion/trac

start() {
    echo -n $"Starting tracd: "
    daemon --user $TRACD_USER $DAEMON --port=$TRACD_PORT --daemonize \
                 --env-parent-dir=$TRACD_DIR
    RETVAL=$?
    echo
    [ $RETVAL -eq 0 ] && touch $PIDFILE
    return $RETVAL
}

stop() {
    echo -n $"Stopping tracd: "
    killproc tracd
    RETVAL=$?
    echo
    [ $RETVAL -eq 0 ] && rm -f $PIDFILE
    return $RETVAL
}    

restart() {
    stop
    start
}    

reload() {
    stop
    start
}

case "$1" in
  start)
      start
    ;;
  stop)
      stop
    ;;
  status)
    status tracd
    ;;
  restart)
      restart
    ;;
  condrestart)
      [ -f $PIDFILE ] && restart || :
    ;;
  reload)
    reload
    ;;
  *)
    echo $"Usage: $0 {start|stop|status|restart|condrestart|reload}"
    exit 1
esac

exit $?
Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Google

System Administration ,

Madlibs Site Generator Code

November 10th, 2008

Writing content sucks. So what if there were a way to generate locale-specific content (and ads!) using a template and a few small tricks? The idea is to write a small amount of content as a template, then populate that template using a database. The source code I’m providing here uses a “database” (really just a list) of US states, cities and keywords. These three variables are used to generate both a search-engine friendly URL and the content that shows location-specific ads as sponsored results.

This code is 2 years old, but shows a few interesting tricks like putting keywords into subdomains and then extracting them. It generates a site with thousands of pages using only one php script and some html templates. The URLs generated are SEO friendly, and distinct titles and meta descriptions are generated on each page. I used the following URI scheme:

http://<state>.<domain>/<keyword>/<city>/

You can change this by modifying the rewrites in the .htaccess file, and the getState, getCity and getKeyword methods in util.php. The subdomains actually worked really well in terms of indexing - Google loved them. Obviously you’ll need to write new templates, too. One modification that would really help is to have multiple templates, or make the templates PHP-based so they can generate different content for all the leaf pages at the very bottom (i.e. that include keyword, city and state.) The content is so obviously generated that it might not even get indexed these days. If you want to actually use this code, just search for YOURDOMAIN and the word “widget” (case insensitive) to find all the places you need to modify. If you’re super stuck, email me and I’ll try to give you a hand. All I would ask is that you let me know where you’ve used it and how [un]successful it is.

Here’s the madlibs site generator source code.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Google

Tools ,

DMOZ Expired Domain Finder

November 10th, 2008

Back in the days of yore, right after wikipedia started added nofollow to all their outbound links, I built a bunch of small scripts that could be piped together to find domains in dmoz (Google’s directory) that were expired. The idea was to download a copy of the directory, scrape all the domains out of it, see which ones were expired, register them and put up the old content from archive.org. It’s a shame that these domains go to waste, since they have a Google directory link. Usually more backlinks than that.

Now that Google pays attention to domain expiry, I don’t think the tools are of much use for the original purpose. They are, however, a good example of how a bunch of small, special purpose utilities can be combined on the UNIX command line to accomplish tasks in parallel. Also, these scripts could be used for something else, like checking which of your list of domains is expiring, or quickly scanning through DMOZ for domains for doing link exchanges, etc.

Here’s the link to the DMOZ directory as a single gzipped file. It’s over 300MB compressed, and almost 2GB uncompressed. The source code for finding expired DMOZ domains is as follows:

  • parsedmoz.rb - ruby script to extract domains from the content.rdf.utf8 file
  • findMistakes.pl - perl script that checks for a DNS ‘A’ record (good indication that it doesn’t exist) and prints out the domain if it lacks one. Uses memoization (caching) to remember if it has already seen a domain. There are lots of dupes in the DMOZ dump.
  • checkDomain.sh - bash shell script that prints out which domains are available for registration.

Note: No, I don’t know why I wrote all three scripts in different languages. I think someone made a stupid remark to me that day about which language was Teh bEsT 3v4r! I should’ve written the whole thing in Haskell for fun.

Also, each script takes its input from stdin, and send its output to stdout. That way you can chain them if you want. For example:

$ cat content.rdf.u8 | ./parsedmoz.rb

Will give you a list of all host names in the DMOZ dump.

$ cat content.rdf.u8 | ./parsedmoz.rb | ./findMistakes.pl

Will give you a list of all domains from the DMOZ dump that don’t resolve anymore (perhaps a trip to archive.org is in order?) Piping these through checkDomain.sh will then give you a list of domains that are available for registration.

Unfortunately, the whois servers for .org domains tend to limit the number of queries that you can make per hour. You can either run your whois lookups through SOCKS proxies, split your list and run on multiple servers, or even just code delay between lookups to get around them.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Google

Tools ,

Content Generation With N-Grams

November 7th, 2008

Although this is an outdated method, I thought I would post some content generation code I wrote a while ago. Google possesses the n-gram data (more on those later) and algorithms to detect content generated in this fashion. It’s a cool method for text generation but I haven’t found too much in the way of available source code for it. Sure, there is code to take some text and generate n-grams (there’s a perl module for it!), but no sample code to run the n-grams in “reverse” to generate statistically-equivalent text.

The steps for generating statistically-equivalent text to some document are as follows:

  1. Generate a database of n-grams from source document(s) that are similar in nature to what you want to generate. If you want to generate content about male pattern baldness, use articles and content about male pattern baldness. You must record how often each n-gram appears in your source text.
  2. For each n-gram, create a new record that has the first n-1 characters as they key, and the last character and how often it occurred as the value. For example, the 4-gram “then” occurred 15 times in your source text, so your new database entry would have the key “the” with the value ( “n”, 15 ).
  3. Group the same keys together, from step 2. This new database is what you will use to generate the content. For example, step 1 gave you the following 4-grams: ( “then” => 10, “ther” => 20, “thes” => 30 ). Grouping the results from step 2 would give you ( “the” => ( ( “n”, 10 ), ( “r”, 20 ), ( “s”, 30 ) ) ).
  4. Now to generate text, simply start with a random key from your database at step 3, and use the occurrence values as weights to a random number generator to decide which character ( n, r, or s) above should be chosen. Then use the next n-1 characters as a key into your dictionary at step 3 and lather, rinse, repeat until you have enough text.

Here is the source code to generate content. To generate 1,000 characters of text, put all your source content into a file (we’ll call it source.txt), and do the following:

$ gendict.pl 8 source.txt > s_dict.txt
$ gentext.pl s_dict.txt 1000

Obviously you can play around with the ‘n’ parameter (I chose 8 as a starting point.) If you go too small, you’ll end up generating garbage words, and if you go too big, you’ll generate large portions of your source text, but it will make more sense.

I used character-level n-grams in this code, but word-level n-grams would work well, for a large source body. This is similar to the Dissociated Press algorithm except we do a pre-processing step and build an n-gram database first. This n-gram database can be used for other things, such as duplicate content detection, generated content detection and source author recognition to detect cheaters and people using essay writing services.

Modifying the code to stick the generated text into a MySQL database, and then generate an RSS feed from that would allow you use a technique like Affiliate Marketing through RSS Feeds easily. The key to this method is giving it enough source content, and playing around with the size of the n-grams.

Here is some sample text I generated using this document as source with n=8:

baldness, use articles and content detection, generated in this code, but
word-level n-gram appears in your source text, so your new database of
n-grams in thi. It's a cool method, I thought I would work well, for a
large source to generaly-equivalent text.
Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Google

Tools ,