Cameron Smith welcome to my personal site

Everything You Ever Wanted to Know About Linear Regression

This article is here to answer all the questions you’ve ever had about linear regression. I’ll walk you through every step of the process, from acquiring data, describing it, creating a model, evaluating it, optimizing it, and preventing it from overfitting. Along the way I’ll introduce generally applicable data science techniques like data munging, visualizations, and some nifty Python code. I take a two-pronged approach to introducing linear regression: code and math. Every problem I present in this article can be described from an abstract, mathematical perspective as well as concretely in code. Both of these perspectives are important, and shed light on different aspects of linear regression, so I do a careful dance back and forth between both of these. At the end of this article, you’ll have a deep understanding of linear regression. If you’re a beginner, you’ll find yourself comfortable. And if you’ve already encountered linear regression before, you’ll leave with all of your nagging questions answered.

Before reading this you should have a basic understanding of coding principles. All the code in this article is in Python, but if you know another high-level language it should be easy to follow along even if you’ve never written a single line of Python before. I also use Python’s matplotlib and pandas libraries frequently, and it’ll be worth your while to check out their documentation if you find yourself getting lost. I also introduce some mathematical formalisms from calculus and linear algebra. However, I explain these topics in such a way that they will make intuitive (if not explicit sense) even to people who’ve never taken a calculus class before.

Although linear regression is often introduced in a formal, statistical way, I’m going to talk about the topic from a less formal, more machine-learly perspective. This doesn’t mean that I’m any less rigorous, but that the rigour comes from the practical perspective of a data scientist rather than the theoretical persective of a mathematician. Hold onto your seats and enjoy the ride!

This blog post also exists as a Jupyter notebook. Check it out here.

Read More >

When Humanity Retreats

Strawberry tree

After the Cherobyl nuclear disaster in 1986, a Zone of Alienation (The Zone) was established in the vicinity of the Chernobyl nuclear power plant. Humans were evacuated by the Soviet military and never allowed to return. But The Zone is not entirely abandoned: it’s been re-inhabited by wolves, badgers, moose, and foxes – all species that had gone extinct in The Zone long ago…

Despite its name, the Korean Demilitarized Zone (DMZ) is the most militarized place on the planet. Like a long jagged scar, it traces a path from the Yellow Sea to the Sea of Japan, splitting the Korean Peninsula into Communist North and Capitalist South. Humans are wary of entering the 2.5 mile-wide gap between the two countries: towns, villages, farms, and factories were all abandoned in the wake of the Korean War. But animal and plant life are less apprehensive: today the area is one of the most biodiverse regions in North-East Asia. There are even claims that locally-extinct Siberian Tigers have re-colonized the DMZ…

In 1962, life in the mining town of Centralia, Pennsylvania changed forever. A small fire spread to a coal-strip mine. Like a metastasizing tumor, the fire spread beneath the city, following veins of anthracite coal. Smoldering pits began opening up across the city, swallowing up homes, animals, people. In a town of devout protestants, it didn’t take much convincing for the inhabitants to correctly interpret the metaphor of fire and brimstone, and Centralia was abandoned. A half century later, grasses poke up through cracks in asphalt roads, frogs lay eggs in gutted churches, and deer stroll confidently down Main Street in the middle of the day…

The United Nations Buffer Zone separates Turkish Northern Cyprus from its Southern (Greek) neighbor. Established following the Turkish invasion of Northern Cyprus and subsequent displacement of Greeks from the northern half of the island, it’s much more permeable than the Korean DMZ or The Ukrainian Zone of Alienation. Despite a limited human presence, animal and plant life still flourish: in a region that’s been victimized by human civilization longer than anywhere else on the planet, this twisted strip of land has been allowed to return to nature…

Each of these are examples of “involuntary parks,” created unintentionally by humans as the result of war, conflict, and industrial catastrophe. During my itinerant mid-20’s, I happened to stumble across all of these places. Nowhere in the world is free from human influence: even the most “virgin” forests have been radically transformed by humanity over the past hundred thousand years: where are the mammoths, mastodons, and sabre-tooth tigers? The special thing about involuntary parks is that they admit to having been influenced by mankind – this is a kind of honesty you don’t experience in Yosemite, Yellowstone, or the Grand Canyon – all places that despire their beauty have nonetheless been subjected to dramatic human alteration.

As the coronavirus suppresses economic activity across the world, we’re entering a phase when the entire planet is being turned into a (temporary) involuntary park. The skies over China are clearing of smog, the rivers are de-acidifying, there are fewer cars on the street and planes in the air… I’m not sure if this is a good thing or a terrible thing, but hopefully it makes you pause to think, like it made me.

Strawberry tree

Read More >

How to Create a Modeling Server on Google Cloud

This is how I create modeling servers on Google Cloud. Below I’ve listed the individual steps. However, keep in mind that I tend to bundle the programmatic portions of these steps into a bash script so that I’m not copy-pasting each line. Also, the Google Cloud CLI can be used to create projects and instances instead of the GUI. However, I find that using the GUI doesn’t add much more time, and it can be a useful tool for visualizing your data usage and projects. Get familiar with both!

Read More >

Penalizing Complexity

“With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.”

–John Von Neumann

Models are simplified representations of the world around us. Sometimes by simplifying something, we’re able to see connections that we wouldn’t be able to notice with all the messy details still present. Because models can represent relationships, they can also be used to predict the future. However, sometimes modeling something too closely becomes a problem. After all, a model should be a simplified version of reality: when you try to create an over-complicated model, you might begin noticing connections that don’t actually exist! In this blog post I’ll discuss how regularization can be used to make sure our models reflect reality in meaningful ways.

Read More >

The Meaning of Meaning

I struggled with algebra in middle school. I still don’t like algebra. When I was studying for math tests as a kid I kept asking myself why does this matter? and what does this mean? After all, mathematics is emphatically not reality: it’s only connected to the world and our experiences through clever metaphors (or isomorphisms if you want to use math jargon). Yes, a perfect sphere has never existed, but the shape of the Earth is spherical enough that we can pretend it’s one.

The biggest reason that people struggle with understanding academic topics is that those topics aren’t given a sense of meaningfulness. This isn’t the fault of learners, it’s the fault of educators. When we navigate the world we can’t help but try to discover the meaning of the things we hear, think, and see. Humans are great at extracting meaningfulness: we understand the gist of articles, the plot of a story, or the intentions of someone talking to us. So it’s only natural that when we don’t find algebra meaningful we lose interest.

You’ll never understand or appreciate mathematics, poetry, or chemistry if you view them as the meaningless manipulation of numbers, words, or chemicals. Unfortunately, this is exactly how all of these fields are taught in school. In this blog post I want to show how to make things more meaningul. Along the way, we’ll develop some techniques to help us extract meaning from the world around us.

Read More >

Poisonous Plants from the Streets of San Francisco

People associate cities with traffic, asphalt, and buildings. San Francisco often seems like one enormous parking lot: disconnected from nature. But there’s a real jungle coexisting with the urban jungle. Hiding in plain sight are an abundance of deadly plants. These range from the terrifying — like the deliriant Brugmansia aurea — to the lethal: Ricinus communis. I spent a day walking around San Francisco identifying some of the deadly plants that live among us. None of these plants are in inaccessible areas: all are easily spotted from streets or parks. Anyone can identify, observe, and maybe even bring home some of these poisonous plants.

1. American Black Nightshade

Solanum americanum

South-West corner of Geary and Webster

The Solanum genus contains many edible plants like tomatoes, potatoes, and eggplants. But it also contains several poisonous species. The fruits of the American Black Nightshade contain the glycoalkaloids solanine and solamargine. Consumption can cause nausea, vomiting, cardiac dysrhythmia, paralysis, and death. Children have died from eating unripe berries, which are especially toxic.

Early European colonists in the Americas avoided consuming tomatoes because of their association with the nightshade family: American Black Nightshade strongly resembles a tomato plant. Confusing the two can have disastrous results — in fact, there were tomatoes growing less than 10 meters from where I identified nightshade in Japantown.

This plant is especially sinister because of its appealing fruits and widespread distribution. Among all the plants on this list, American Black Nightshade is probably the easiest to accidentally consume.

2. Poison Hemlock

Conium maculatum

Stanyan and Kezar

While I was walking from Haight-Ashbury to Golden Gate Park I came across a tall, withered plant with prominent floral corymbs that resemble those of a wild carrot and pinnated leaves that look like parsley. This plant is poison hemlock.

Poison hemlock native to the Mediterranean and traveled to the Americas along with European colonists. Hemlock contains high levels of the alkaloids coniine and coniceine. These alkaloids attack the central nervous system, suppressing the respiratory muscles and leading to vasoconstriction — especially in the lower body (more on that below) — and death.

A brew of hemlock is infamous for being the plant that killed the philosopher Socrates. His reaction to the poison was typical: “the attendant examined Socrates’ feet and legs, then the man pinched his foot hard and asked if he felt it. Socrates said “No”; then after that, his thighs; and passing upwards in this way he showed us that he was growing cold and rigid. And then again he touched Socrates and said that when it reached his heart, he would be gone. The chill had now reached the region about the groin, and uncovering his face the attendant saw that Socrates’ eyes were fixed.”

3. White Snakeroot

Ageratina altissima

Lily Pond, Golden Gate Park

White snakeroot contains tremetol, which is named after the “trembles” caused by ingesting the poison. While it’s extremely uncommon to be poisoned by white snakeroot today, there was a time when milk sickness —caused by drinking milk produced by cows that had eaten snakeroot — killed thousands of American settlers in the Midwest. Nancy Lincoln — the mother of Abraham Lincoln — died of milk sickness.

Although white snakeroot is not as dangerous as the previous two entries in this list, it deserves a place here because of the historical damage it caused. More people have likely died of snakeroot poisoning than all the other poisonous plants on this list combined.

4. Angel’s Trumpets

Brugmansia aurea

520 Oak Street

The alluring trumpets of the brugmansia genus are found on the streets of every neighborhood in San Francisco. These are represented by aromatic species like brugmansia arborea, brugmansia insignis, and brugmansia aurea — although the latter seems to be the most common.

All parts of these plants contain high levels of the tropane alkaloids hyoscyamine and scopolamine. The former is the toxin that gives Deadly Nightshade the title “deadly,” and the latter, scopolamine, is a powerful deliriant.

Deliriants are a subcategory of hallucinogens that induce a state of delirium in the user. Consuming as little as one leaf or flower can cause confusion, tremors, cycloplegia, and auditory and visual hallucinations. Users have reported seriously injuring themselves, attacking other people, and wandering around naked for hours — all without any awareness of their behavior.

An Erowid user reported that after consuming Angel’s Trumpets he “went to the bathroom to look at himself in the mirror… The reflection was normal, except for one thing: its eyes were closed! When I moved, the reflection moved. And when I touched my eyes, they were closed too.” …creepy…

Deliriant psychedelics are common both in nature and in your medicine cabinet. Plants like Jimson Weed (which I have come across frequently in San Jose, and is often considered the worst drug in the world), henbane, and mandrake are all deliriants, along with over-the-counter drugs like Benadryl and prescription pills like Ambien.

5. Castor Bean

Ricinus communis

400 Douglass St

The castor bean is the most sinister-looking of any plant on my list: it has large red palmate leaves with serrated edges and small clumps of seed pods covered in blood-red tentacles. It’s also the most poisonous: only a dozen of its small, grain-like seeds are enough to kill an adult.

Castor beans are deadly because they contain ricin. Ricin can be sprayed as a toxic dust and was weaponized by the US military during World War II (though it was never actually used). Ricin has been used in several murders and murder attempts. The Bulgarian dissident Georgi Markov was stabbed with an umbrella tipped with a ricin pellet, and President Bush was mailed letters containing ricin residue.

Despite its lethality, castor beans are common as ornamental plants and are grown to produce castor oil, which you can safely rub all over your body in the form of castor oil soap. Because of its usefulness and interesting appearance, castor beans can be found in parks and gardens throughout San Francisco.

6. Pokeweed

Phytolacca americana

4307 20th St

I discovered American Pokeweed growing in the shade of a small tree on 20th Street. Pokeweed fruits look like shiny, purple blueberries organized into grape-like clumps. Among all the plants on the list, it looks the most appealing and edible. However, it’s highly toxic to humans and other mammals.

All parts of the pokeweed plant are poisonous, containing phytolaccatoxin and phytolaccigenin, which are poisonous to mammals (but not birds). The juice of the berries can be absorbed through the skin. Some people have mistakenly thought that pokeweed berries are safe to eat after observing birds eating them, only to experience its powerful emetic and purgative effects. In high enough doses, pokeweed poisoning also causes convulsions and paralysis of the respiratory muscles, which can lead to death.

7. Field Bindweed

Convolvulus arvensis

599 McAllister St

The Convolvulaceae family (also called the Morning Glory family) contains over a thousand species of plants, including the sweet potato and hundreds of species of morning glory vines. Field Bindweed contains the alkaloid pseudotropine and can be dangerous for grazing animals, but human poisonings are not documented.

Morning glory species can be found on almost every block in San Francisco. An especially interesting species (which I was not able to identify) is the Mexican Morning Glory (Ipomoea tricolor). Ipomoea tricolor contains the alkaloid ergine (LSA) which has similar effects to the psychedelic LSD, although it has additional symptoms associated with alkaloid poisoning like nausea and vomiting.

Other Dangers

In seeking out the most poisonous plants in San Francisco, I also managed to identify many of the most poisonous plants in the world. Castor Bean, Poison Hemlock, and White Snakeroot belong on any list of poisonous plants. However, some especially deadly plants don’t seem to grow in San Francisco. These include Rosary Peas (Abrus precatorius), Deadly Nightshade (Atropa belladonna), and the Yew tree (Taxus baccata).

But the most deadly plant in the world is by far the most familiar: a genus of nightshades — called Nicotiana — are the source of tobacco, which kills almost half a million Americans each year.

Most plants growing in San Francisco are not toxic. And those that are are rarely deadly. Furthermore, the plants that are deadly do a good job of advertising their lethality: most people would never even approach a castor bean, much less put it in their mouth. During this journey I learned to stop worrying and love the poisonous plants that live among us.

Read More >

Multi-threading and Multi-processing in Python

When I first encountered multi-threading and multi-processing, I wasn’t able to distinguish the two. For me, both were some sort of magical way to make your programs run faster. However, understanding how multi-threading and multi-processing is critical for many medium- and large-sized software projects. In this post, I’ll explain how each works.

Read More >

Math to Code

When I’m trying to decipher some hairy math formula, I find it helpful to translate the equation into code. In my experience, it’s often easier to follow the logical flow of a programming function than an equivalent function written in mathematical notation. This guide is intended for programmers who want to gain a deeper understanding of both mathematical notations and concepts. I decided to use Python as it’s closer to pseudo-code than any other language I’m familiar with.

This is not meant to be a reference manual or encyclopedia. Instead, this document is intended to give a general overview of mathematical concepts and their relationship with code. All code snippets were typed into the interpreter, so I am omitting the canonical >>>.

Read More >

How to Build a Machine Learning App from Scratch

In this article, I’m going to teach you how to build a text classification application from scratch. To get started, all you need to know is a little Python, the rudiments of Bash, and how to use Git. The finished application will have a simple interface that allows users to enter blocks of text and then returns the identity of that text.

This project has three steps. The first is constructing a corpus of language data. The second is training and testing a language classifier model to predict categories. The third step is deploying the application to the web along with an API.

You can find the source code on Github. If you’d like a sneak peek at what the application looks like in the wild, click here.

Read More >

A Month Without Buying Food

100% freegan for a month.

Read More >

Orthography Tree

I’m obsessed with learning different writing systems, but keeping things organized can be difficult. Fortunately, very few writing systems are independent inventions: most are derived from other scripts. To make things easier for myself, I created a taxonomic tree of all writing systems descended from Egyptian Hieroglyphs.1,2 Also included are some inspired orthographies such as Cherokee, which was invented by Sequoyah through the process of “stimulus diffusion”. Click here for the full screen version (recommended).

Mouse over a node in this tree to see some information about the script.

  1. Data taken from Wikipedia

  2. Some taxonomic groupings such as North Brahmic and South Brahmic are used for convenience of organization even though they are not scripts themselves. 

Read More >

User Studies with Amazon Mechanical Turk

Amazon Mechanical Turk is an extremely useful resource for conducting basic human research. It allows researchers to quickly gather data from large numbers of participants at a very low cost. MTurk was an essential ingredient in the creation of the Imagenet dataset used throughout the computer vision community. MTurk’s built-in platform is generally used for conducting surveys; however, I find MTurk more useful as a recruiting platform. After recruiting participants, I can then redirect them to a website I’ve created that hosts my user study.

However, some researchers have criticized MTurk for its lack of transparency. It’s not always easy to confirm whether participants are lying about certain background criteria. For instance, I’ve conducted several language-learning studies using MTurk where native fluency in English is required. While MTurk is useful for initial hypothesis-testing in language studies, it isn’t particularly reliable for longer-term studies where language fluency must be confirmed.

Read More >

Population Projections

Visualizations of the world's growing (and changing) population.

Read More >

Pangrams

I’ll be exploring pangrams, which are sentences that contain every character in a writing system at least once. The pangram most familiar to English-speakers is “the quick brown fox jumps over the lazy dog.” Although they’re primarily used in typography, pangrams can also be useful for learning new writing systems.

In this blog post I’ll be looking at pangrams in Korean, Japanese, Arabic, and Hindi. Although all languages are created equal, some writing systems are more user-friendly than others: in the process of exploring pangrams, I’ll have the chance to contrast the relative merits of different orthographies.

Read More >

Yazidi Symbols

Symbols from Lalish, holy mountain for the Yazidi people.

Read More >

Hitchhiking Kurdistan

Pictures from my trip to Northern Iraq.

Read More >

Strawberry Trees

My favorite fruit.

Read More >

Climbing the Great Pyramid of Giza at Night

An unlawful ascent.

Read More >