Write Your Own Bayesian Classifier!

(LPW '07) (john melesky)

This is an hour talk, squeezed into 30 minutes, so let's get going.

Write Your Own Bayesian Classifier!

(LPW '07) (john melesky)

This is an hour talk, squeezed into 30 minutes, so let's get going.

(oh, i'm available for freelance work in machine learning/natural language processing)

Wait, wait

Why the hell should i reinvent the wheel write my own Bayesian classifier?

Opening joke

Q: How many Londoners does it take to change a lightbulb?

Opening joke

Q: How many Londoners does it take to change a lightbulb?

A: 127

Really?

If this is true, that means that 8.415 million Londoners (99%!) are changing lightbulbs right now.

Really?

If this is true, that means that 8.415 million Londoners (99%!) are changing lightbulbs right now.

By contrast, only 1% of the rest of the world are currently changing lightbulbs.

Question

If you're changing a lightbulb right now, what's the likelihood you're a Londoner?

Question

If you're changing a lightbulb right now, what's the likelihood you're a Londoner?

(hint: the answer is not 99%)

Bayes' Theorem

If you look it up on Wikipedia, you'll see something like this.

Bayes' Theorem

Translated, roughly:

Easy to write:

sub bayes {
  my ($p_a, $p_b, $p_b_a) = @_;
  
  my $p_a_b = ($p_b_a * $p_a) / $p_b;
  
  return $p_a_b;
}

Right, that's the theory

To make a classifier...

  1. Tokenize your training set
  2. Build your model
  3. Test it

Tokenize your training set

sub tokenize {
  my $contents = shift;
  
  my %tokens = map { $_ => 1 } split(/\s+/, $contents);
  return %tokens;
}

Build your model

my %work_tokens = ();
my %notwork_tokens = ();
  
foreach my $file (@work_files) {
  my %tokens = tokenize_file("training_set/" . $file);
  %work_tokens = combine_hash(\%work_tokens, \%tokens);
}
  
foreach my $file (@notwork_files) {
  my %tokens = tokenize_file("training_set/" . $file);
  %notwork_tokens = combine_hash(\%notwork_tokens, \%tokens);
}
  
my %total_tokens = combine_hash(\%work_tokens, 
                   \%notwork_tokens);

Build your model

sub combine_hash {
  my ($hash1, $hash2) = @_;
  
  my %resulthash = %{ $hash1 };
  
  foreach my $key (keys(%{ $hash2 })) {
    if ($resulthash{$key}) {
      $resulthash{$key} += $hash2->{$key};
    } else {
      $resulthash{$key} = $hash2->{$key};
    }
  }
  
  return %resulthash;
}

Build your model

sub tokenize_file {
  my $filename = shift;
  
  my $contents = '';
  open(FILE, $filename);
  read(FILE, $contents, -s FILE);
  close(FILE);
  
  return tokenize($contents);
}

Build your model

my $total_work_files = scalar(@work_files);
my $total_notwork_files = scalar(@notwork_files);
my $total_files = $total_work_files + $total_notwork_files;
my $probability_work = $total_work_files / $total_files;
my $probability_notwork = $total_notwork_files / $total_files;

Test it

Wait a minute ...

Test it

Wait a minute ...

What is P(B|A), when you have more than one B?

Test it

Wait a minute ...

What is P(B|A), when you have more than one B?

For that matter, what is P(B), when you have more than one B?

P(B|A)

P(B1|A) P(B2|A) ... P(Bn|A)

P(B)

Let's, um, ignore that for now.

P(B)

Let's, um, ignore that for now.

Trust me, it will work out.

Test it

my %total_tokens = combine_hash(\%work_tokens, \%notwork_tokens);
  
my $work_accumulator = 1;
my $notwork_accumulator = 1;
my $total_tokens = scalar(keys(%test_tokens));
  
foreach my $token (keys(%test_tokens)) {
  if (exists($total_tokens{$token})) {
    my $p_t_w = (($work_tokens{$token} || 0) + 1)
                 / ($total_work_files + $total_tokens);
    $work_accumulator = $work_accumulator * $p_t_w;
  
    my $p_t_nw = (($notwork_tokens{$token} || 0) + 1)
                  / ($total_notwork_files + $total_tokens);
    $notwork_accumulator = $notwork_accumulator * $p_t_nw;
  }
}

Test it

my $score_work = bayes( $probability_work,
                        $total_tokens,
                        $work_accumulator);
  
my $score_notwork = bayes( $probability_notwork,
                           $total_tokens,
                           $notwork_accumulator);
  
my $likelihood_work = $score_work / ($score_work + $score_notwork);
my $likelihood_notwork = $score_notwork / ($score_work + $score_notwork);
  
printf("likelihood of work email: %0.2f %%\n",
       ($likelihood_work * 100));
printf("likelihood of notwork email: %0.2f %%\n",
       ($likelihood_notwork * 100));

And, we're done!

Possible improvements

Gotchas

Questions?

Thanks, kindly