Machine Learning (a primer)

(Portland Python, March 2008) (john melesky)

In a nutshell

Take facts, turn it into knowledge.

In a nutshell

Take facts, turn it into knowledge, algorithmically.

Discovering things

Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.

The problem: check the spelling of things that aren't in the dictionary

The problem: check the spelling of things that aren't in the dictionary

Indigo Montoya
Inigo Montana
Inigo Montoya
Neego Montoya
Inigo Mantoya

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

Spellcheck

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

And that's it.

Clustering

Problem: given a big pile of documents, figure out what different categories there are.

Clustering

Solution: simple geometry

Clustering

Solution: simple (high-dimensional) geometry

Clustering

Solution: (a whole lot of) simple (high-dimensional) geometry

Technique: k-Means Clustering

1. Pick some (k) random points in your vector space.
2. For each document, figure out the nearest point.
3. Lather, rinse, repeat.
4. Voila! Slow-cooked category discovery

Supervised Learning

When you already know something about your data, and you want to apply that knowledge to more, less-known data

Classification

You have 100 documents in two different categories. Predict the category for the next 5000 documents.

Technique: Nearest Neighbor

1. Plot your knowns
2. Figure out the closest known to your unknown (geometrically)

Technique: Linear Separation

1. Plot your knowns
2. Figure out a line separating the categories
3. Use that line to classify the unknowns

Linear Separation: Step 1 Linear Separation: Step 2 Non-linearly separable data Non-linearly separable data Non-linearly separable data Technique: Support Vector Machines Technique: Naive Bayesian Classifiers

Not geometric, but statistical.

Bayes' Theorem

Future probabilities derived from prior probabilities

Bayes' Theorem

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

Bayes' Theorem

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

(hint: it's not 95%)

Bayes' Theorem

Answer: Depends on how many people use drugs.

Bayes' Theorem

Answer: Depends on how many people use drugs.

If the rate of drug use is 1%, then we have:

test positivetest negative
users95% of 1%5% of 1%
non-users5% of 99%95% of 99%

Bayes' Theorem

Answer: Depends on how many people use drugs.

Number of positive results: 0.95% + 4.95% == 5.9%

Number of correct positive results: 0.95% / 5.9% == 16.1%

The basic process

1. Look at your data, figure out a good numeric representation
2. Turn your data into numbers (usually vectors of numbers)
3. Run your algorithms
4. Profit! (or Fun!)