Cameron Smith welcome to my personal site

Question answering with BERT

16 APR 2021 • code

Now that BERT is available to the public through libraries like transformers, it’s super easy to build question answering (QA) and dialogue systems. In this blog post I’ll introduce you to a simple QA system I built and show you how to use it. Enjoy 🥰

Which languages sound the most similar?

28 MAR 2021 • language

When you start learning a new language, the first thing you’ll notice is how it sounds. Specifically, you’ll notice the places where it sounds different from your native language: maybe there are tones, or unfamiliar consonant clusters, or speech sounds you’ve never heard before. I’ve always been curious about which languages sound similar. This blog post emerged from a project where I used math to quantify the difference between languages.

In the next section I’ll talk about how I framed this problem and the datasets I used. Then I’ll discuss the algorithm I developed and its results. I’ll conclude by talking about the significance of the results and the specific languages that turned up.

If you want to dive really deep into this topic, all the data analysis I did is in this notebook

How to Build a Machine Learning App from Scratch

19 FEB 2021 • data

In this article I’ll teach you how to build a text classification app from scratch. You’ll enter some text from a language, and the app will identify which language it comes from (English, Spanish, Vietnamese, etc). All it’ll take you to get started is a rudimentary knowledge of Python, the command line, and Git.

Completing this project will give you a good sense of the “full stack” of technical concepts that data scientists encounter: data acquisition, data analysis, modeling, visualization, app development, and deployment. I don’t go super deeply into any of these topics, but if you’re a beginner I think it’s more important to know how data science works from end to end than it is to be an expert in any one topic.

This project can roughly be divided into three steps. The first involves downloading data from Wikipedia and saving it to a database. The second step is building a model that predicts a language when given some text. The third step is creating an app and deploying it to the web.

You can find the source code on Github. If you’d like a sneak peek at what the finished application looks like in the wild, click here.

The Statistical Argument for Intersectionality

4 DEC 2020 • data

“Intersectionality” is one of those divisive buzzwords adopted by people on the left and ridiculed by people on the right. If you believe the alt-right narrative, intersectionality is a tool that liberals use in order to feel bad about themselves: not only am I downtrodden African American, the narrative goes, but I’m intersectionally both African American and gay, a double victim.

However, the actual definition of intersectionality is simpler and much more meaningful: discrimination might not appear when you look at individual factors, like race, gender, or income: discrimination (or differences in outcomes) might only appear when multiple factors are taken into account.

In this blog post I’m going to argue that anyone who disagrees with the intersectionality as a tool for analyzing discrimination is ignorant of basic marginal statistics. I’ll start off by introducing a made-up dataset and tools to analyze it. It will become apparent that the most useful tools for understanding this kind of data are “intersectional.”

The Geometry of Social Distancing

22 APR 2020 • data

Public health organizations have stated that social distancing is one of the most effective ways to reduce the spread of the coronavirus. Social distancing entails active separation from in-person social gatherings: avoiding parties, public transport, crowded streets, and any other source of fun. When people are forced to go around others, the CDC offers specific guidelines about how to keep yourself and others safe: one of these guidelines is to keep six feet away from others. While this seems like a simple recommendation, it has specific, geometric implications…

How to Code (Part 2)

18 APR 2020 • code

“The idea behind digital computers may be explained by saying that these machines are intended to carry out any operations which could be done by a human computer.”

Alan Turing, Computing Machinery and Intelligence

“Once men turned their thinking over to machines in the hope that this would set them free. But that only permitted other men with machines to enslave them.”

Frank Herbert, Dune

The science fiction novel Dune occurs thousands of years after a cybernetic revolt led to the near-extinction of the human race. In the wake of this disaster, a religious cult has permeated a dogma across the galaxy, proclaiming that Thou shalt not make a machine in the likeness of a man’s mind: a permanent ban in Artificial Intelligence and computers. In order to compensate for the absence of this technology, mentats have been trained to mimic the abilities of digital computers. But there’s a problem with this: digital computers themselves mimic the same logical reasoning that humans perform. The only difference between a computer and a human is that the computer is capable of holding a larger sequence of logical events in memory and can act faster. There are only a small number of fundamental building blocks that software is made out of: concepts like variables, data types, loops, and conditional statements – all of which are easily comprehensible to humans: you can memorize a program, write it on a piece of paper, or run it in a computer. This chapter will explore some of these fundamental building blocks in the Python language, leaving you with a strong foundation to build upon.

How to Code (Part 1)

16 APR 2020 • code

I know a lot of people who work in tech but not as software engineers. They often ask me the question “how do I code?” This isn’t the kind of question you can with a yes or a no or in a single paragraph. Instead, I have to answer this question by pointing to books or online resources that promise to teach coding. I also have to point out which kinds of things aren’t coding (e.g. online interfaces, HTML, and SQL are not programming languages in the conventional imperative sense). But when I give an answer like this, it’s not really my answer: it’s someone else’s answer that I’m just packaging and reselling. In this blog post I’m going to give my answer to the question how do I code? – this is an answer that I personally find satisfying (and so will you!). This will be a long post, equivalent to a meaty textbook chapter or even a short monograph. In it I’ll first explain what programming means using very abstract language, then I’ll show you the environment that programmers work in. Lastly I’ll show you the core logical concepts that constitute programming.

Everything You Ever Wanted to Know About Linear Regression

26 MAR 2020 • data

This article is here to answer all the questions you’ve ever had about linear regression. I’ll walk you through every step of the process, from acquiring data, describing it, creating a model, evaluating it, optimizing it, and preventing it from overfitting. Along the way I’ll introduce generally applicable data science techniques like data munging, visualizations, and some nifty Python code. I take a two-pronged approach to introducing linear regression: code and math. Every problem I present in this article can be described from an abstract, mathematical perspective as well as concretely in code. Both of these perspectives are important, and shed light on different aspects of linear regression, so I do a careful dance back and forth between both of these. At the end of this article, you’ll have a deep understanding of linear regression. If you’re a beginner, you’ll find yourself comfortable. And if you’ve already encountered linear regression before, you’ll leave with all of your nagging questions answered.

Before reading this you should have a basic understanding of coding principles. All the code in this article is in Python, but if you know another high-level language it should be easy to follow along even if you’ve never written a single line of Python before. I also use Python’s matplotlib and pandas libraries frequently, and it’ll be worth your while to check out their documentation if you find yourself getting lost. I also introduce some mathematical formalisms from calculus and linear algebra. However, I explain these topics in such a way that they will make intuitive (if not explicit sense) even to people who’ve never taken a calculus class before.

Although linear regression is often introduced in a formal, statistical way, I’m going to talk about the topic from a less formal, more machine-learly perspective. This doesn’t mean that I’m any less rigorous, but that the rigour comes from the practical perspective of a data scientist rather than the theoretical persective of a mathematician. Hold onto your seats and enjoy the ride!

This blog post also exists as a Jupyter notebook. Check it out here.

How to Create a Modeling Server on Google Cloud

15 SEP 2019 • code

This is how I create modeling servers on Google Cloud. Below I’ve listed the individual steps. However, keep in mind that I tend to bundle the programmatic portions of these steps into a bash script so that I’m not copy-pasting each line. Also, the Google Cloud CLI can be used to create projects and instances instead of the GUI. However, I find that using the GUI doesn’t add much more time, and it can be a useful tool for visualizing your data usage and projects. Get familiar with both!

Penalizing Complexity

31 AUG 2019 • data

“With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.”

–John Von Neumann

Models are simplified representations of the world around us. Sometimes by simplifying something, we’re able to see connections that we wouldn’t be able to notice with all the messy details still present. Because models can represent relationships, they can also be used to predict the future. However, sometimes modeling something too closely becomes a problem. After all, a model should be a simplified version of reality: when you try to create an over-complicated model, you might begin noticing connections that don’t actually exist! In this blog post I’ll discuss how regularization can be used to make sure our models reflect reality in meaningful ways.

Multi-threading and Multi-processing in Python

31 JAN 2018 • code

When I first encountered multi-threading and multi-processing, I wasn’t able to distinguish the two. For me, both were some sort of magical way to make your programs run faster. However, understanding how multi-threading and multi-processing is critical for many medium- and large-sized software projects. In this post, I’ll explain how each works.

Math to Code

27 DEC 2017 • code

When I’m trying to decipher some hairy math formula, I find it helpful to translate the equation into code. In my experience, it’s often easier to follow the logical flow of a programming function than an equivalent function written in mathematical notation. This guide is intended for programmers who want to gain a deeper understanding of both mathematical notations and concepts. I decided to use Python as it’s closer to pseudo-code than any other language I’m familiar with.

This is not meant to be a reference manual or encyclopedia. Instead, this document is intended to give a general overview of mathematical concepts and their relationship with code. All code snippets were typed into the interpreter, so I am omitting the canonical >>>.

How to Build a Machine Learning App from Scratch

10 OCT 2017 • data

Please see the updated post for a sleeker and more modern approach to the same problem!

Orthography Tree

5 DEC 2016 • language

I’m obsessed with learning different writing systems, but keeping things organized can be difficult. Fortunately, very few writing systems are independent inventions: most are derived from other scripts. To make things easier for myself, I created a taxonomic tree of all writing systems descended from Egyptian Hieroglyphs.¹^,² Also included are some inspired orthographies such as Cherokee, which was invented by Sequoyah through the process of “stimulus diffusion”. Click here for the full screen version (recommended).

Mouse over a node in this tree to see some information about the script.

Data taken from Wikipedia. ↩
Some taxonomic groupings such as North Brahmic and South Brahmic are used for convenience of organization even though they are not scripts themselves. ↩

Population Projections

4 NOV 2016 • data

In science fiction, it’s a common theme (maybe even a cliché!) for the future world to be greatly influenced by East Asian culture. This is a reasonable assumption for writers to make, given the enormous economic and demographic expansion that East Asia has undergone in the past half-century. However, as East Asia’s population growth flattens, it’s becoming clear that this cliché is outdated. The future planet will be proportionally more African, South Asian, and Muslim than it is today. People interested in depicting future societies might want to turn away from East Asia and pay more attention to these under-appreciated societies.

In the next section I created some data visualizations to demonstrate the planet’s changing demographic composition. In the section after that, I reflect on how this information might inform speculative fiction.

Everything you ever wanted to know about pangrams

4 NOV 2016 • language

Pangrams are sentences that contain every character in a writing system. The pangram most familiar to English-speakers is the quick brown fox jumps over the lazy dog - every letter of the alphabet occurs at least one time. Pangrams usually appear in typography; when you want to show off a new font succinctly, you can write out a pangram. However, I think pangrams are interesting for another reason: they show how the world’s writing systems differ from one another. In this blog post, I’m going to try and create pangrams in some of the world’s major writing systems: Chinese, Devanagari, Latin, Korean, Arabic and others. In the process I’ll explain how each system works, its origins, and its merits and limitations. In the end, you’ll have a much greater appreciation of the world’s writing systems.