Cameron Smith Data, language, code Welcome to my personal site.

This is how I create modeling servers on Google Cloud. Below I’ve listed the individual steps. However, keep in mind that I tend to bundle the programmatic portions of these steps into a bash script so that I’m not copy-pasting each line. Also, the Google Cloud CLI can be used to create projects and instances instead of the GUI. However, I find that using the GUI doesn’t add much more time, and it can be a useful tool for visualizing your data usage and projects. Get familiar with both!

“With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.”

–John Von Neumann

Models are simplified representations of the world around us. Sometimes by simplifying something, we’re able to see connections that we wouldn’t be able to notice with all the messy details still present. Because models can represent relationships, they can also be used to predict the future. However, sometimes modeling something too closely becomes a problem. After all, a model should be a simplified version of reality: when you try to create an over-complicated model, you might begin noticing connections that don’t actually exist! In this blog post I’ll discuss how regularization can be used to make sure our models reflect reality in meaningful ways.

I struggled with algebra in middle school. I still don’t like algebra. When I was studying for math tests as a kid I kept asking myself why does this matter? and what does this mean? After all, mathematics is emphatically not reality: it’s only connected to the world and our experiences through clever metaphors (or isomorphisms if you want to use math jargon). Yes, a perfect sphere has never existed, but the shape of the Earth is spherical enough that we can pretend it’s one.

The biggest reason that people struggle with understanding academic topics is that those topics aren’t given a sense of meaningfulness. This isn’t the fault of learners, it’s the fault of educators. When we navigate the world we can’t help but try to discover the meaning of the things we hear, think, and see. Humans are great at extracting meaningfulness: we understand the gist of articles, the plot of a story, or the intentions of someone talking to us. So it’s only natural that when we don’t find algebra meaningful we lose interest.

You’ll never understand or appreciate mathematics, poetry, or chemistry if you view them as the meaningless manipulation of numbers, words, or chemicals. Unfortunately, this is exactly how all of these fields are taught in school. In this blog post I want to show how to make things more meaningul. Along the way, we’ll develop some techniques to help us extract meaning from the world around us.

When I first encountered multi-threading and multi-processing, I wasn’t able to distinguish the two. For me, both were some sort of magical way to make your programs run faster. However, understanding how multi-threading and multi-processing is critical for many medium- and large-sized software projects. In this post, I’ll explain how each works.

When I’m trying to decipher some hairy math formula, I find it helpful to translate the equation into code. In my experience, it’s often easier to follow the logical flow of a programming function than an equivalent function written in mathematical notation. This guide is intended for programmers who want to gain a deeper understanding of both mathematical notations and concepts. I decided to use Python as it’s closer to pseudo-code than any other language I’m familiar with.

This is not meant to be a reference manual or encyclopedia. Instead, this document is intended to give a general overview of mathematical concepts and their relationship with code. All code snippets were typed into the interpreter, so I am omitting the canonical >>>.

In this article, I’m going to teach you how to build a text classification application from scratch. To get started, all you need to know is a little Python, the rudiments of Bash, and how to use Git. The finished application will have a simple interface that allows users to enter blocks of text and then returns the identity of that text.

This project has three steps. The first is constructing a corpus of language data. The second is training and testing a language classifier model to predict categories. The third step is deploying the application to the web along with an API.

You can find the source code on Github. If you’d like a sneak peek at what the application looks like in the wild, click here.

I’m obsessed with learning different writing systems, but keeping things organized can be difficult. Fortunately, very few writing systems are independent inventions: most are derived from other scripts. To make things easier for myself, I created a taxonomic tree of all writing systems descended from Egyptian Hieroglyphs.1,2 Also included are some inspired orthographies such as Cherokee, which was invented by Sequoyah. Click here for the full screen version (recommended).

1. Data taken from Wikipedia

2. Some taxonomic groupings such as North Brahmic and South Brahmic are used for convenience of organization even though they are not scripts themselves.

Amazon Mechanical Turk is an extremely useful resource for conducting basic human research. It allows researchers to quickly gather data from large numbers of participants at a very low cost. MTurk was an essential ingredient in the creation of the Imagenet dataset used throughout the computer vision community. MTurk’s built-in platform is generally used for conducting surveys; however, I find MTurk more useful as a recruiting platform. After recruiting participants, I can then redirect them to a website I’ve created that hosts my user study.

However, some researchers have criticized MTurk for its lack of transparency. It’s not always easy to confirm whether participants are lying about certain background criteria. For instance, I’ve conducted several language-learning studies using MTurk where native fluency in English is required. While MTurk is useful for initial hypothesis-testing in language studies, it isn’t particularly reliable for longer-term studies where language fluency must be confirmed.

Visualizations of the world's growing (and changing) population.

I’ll be exploring pangrams, which are sentences that contain every character in a writing system at least once. The pangram most familiar to English-speakers is “the quick brown fox jumps over the lazy dog.” Although they’re primarily used in typography, pangrams can also be useful for learning new writing systems.

In this blog post I’ll be looking at pangrams in Korean, Japanese, Arabic, and Hindi. Although all languages are created equal, some writing systems are more user-friendly than others: in the process of exploring pangrams, I’ll have the chance to contrast the relative merits of different orthographies.