(Portland Python, March 2008) (john melesky)

Take facts, turn it into knowledge.

Take facts, turn it into knowledge, algorithmically.

Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.

The problem: check the spelling of things that aren't in the dictionary

The problem: check the spelling of things that aren't in the dictionary

Indigo MontoyaInigo Montana

Inigo Montoya

Neego Montoya

Inigo Mantoya

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

And that's it.

Problem: given a big pile of documents, figure out what different categories there are.

Solution: simple geometry

Solution: simple (high-dimensional) geometry

Solution: (a whole lot of) simple (high-dimensional) geometry

- Pick some (k) random points in your vector space.
- For each document, figure out the nearest point.
- Lather, rinse, repeat.
- Voila! Slow-cooked category discovery

When you already know something about your data, and you want to apply that knowledge to more, less-known data

You have 100 documents in two different categories. Predict the category for the next 5000 documents.

- Plot your knowns
- Figure out the closest known to your unknown (geometrically)

- Plot your knowns
- Figure out a line separating the categories
- Use that line to classify the unknowns

Not geometric, but statistical.

Future probabilities derived from prior probabilities

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

(hint: it's not 95%)

Answer: Depends on how many people use drugs.

Answer: Depends on how many people use drugs.

If the rate of drug use is 1%, then we have:

test positive | test negative | |
---|---|---|

users | 95% of 1% | 5% of 1% |

non-users | 5% of 99% | 95% of 99% |

Answer: Depends on how many people use drugs.

Number of positive results: 0.95% + 4.95% == 5.9%

Number of *correct* positive results: 0.95% / 5.9% == 16.1%

- Look at your data, figure out a good numeric representation
- Turn your data into numbers (usually vectors of numbers)
- Run your algorithms
- Profit! (or Fun!)