Cameron Smith welcome to my personal site

In this article I’ll teach you how to build a text classification model from scratch. You’ll enter some text from a language, and the app will identify which language it comes from (English, Spanish, Vietnamese, etc). All it’ll take you to get started is a rudimentary knowledge of Python, the command line, and Git.

Completing this project will give you a good sense of the “full stack” of technical concepts that data scientists encounter: data acquisition, data analysis, modeling, visualization, app development, etc.

This project can roughly be divided into three steps. The first involves downloading data from Wikipedia and saving it to a database. The second step is building a model that predicts a language when given some text. The third step is deploying the app to the web.

You can find the source code on Github. If you’d like a sneak peek at what the finished application looks like in the wild, click here.

When I’m trying to decipher some hairy math formula, I find it helpful to translate the equation into code. In my experience, it’s often easier to follow the logical flow of a programming function than an equivalent function written in mathematical notation. This guide is intended for programmers who want to gain a deeper understanding of both mathematical notations and concepts. I decided to use Python as it’s closer to pseudo-code than any other language I’m familiar with.

This is not meant to be a reference manual or encyclopedia. Instead, this document is intended to give a general overview of mathematical concepts and their relationship with code. All code snippets were typed into the interpreter, so I am omitting the canonical >>>.

Please see the update post for a sleeker and more modern approach to the same problem!

I’m obsessed with learning different writing systems, but keeping things organized can be difficult. Fortunately, very few writing systems are independent inventions: most are derived from other scripts. To make things easier for myself, I created a taxonomic tree of all writing systems descended from Egyptian Hieroglyphs.1,2 Also included are some inspired orthographies such as Cherokee, which was invented by Sequoyah through the process of “stimulus diffusion”. Click here for the full screen version (recommended).

Mouse over a node in this tree to see some information about the script.

1. Data taken from Wikipedia

2. Some taxonomic groupings such as North Brahmic and South Brahmic are used for convenience of organization even though they are not scripts themselves.

Pangrams are sentences that contain every character in a writing system at least once. The pangram most familiar to English-speakers is the quick brown fox jumps over the lazy dog. Although they’re primarily used in typography, pangrams can also be useful for learning new writing systems - they also exhibit some of the fundamental differences between the world’s writing systems, which is what I’ll be exploring in this blog post. This will be a journey deep down the orthography rabbit-hole, with detours into history, linguistics, religion, and even poetry, so hold onto your seatbelts…