DADeepu Asok

← back

How to Build a Learning Machine

From Calculator to Pattern Finder

Computers started as calculation machines. Literally. The first ones were built to do math faster than humans could. Addition, multiplication, ballistic trajectories. You gave it numbers and an operation. It gave you an answer.

Then people started building logic on top of the math. If this, then that. Loops, conditions, branching paths. That's what we call a traditional algorithm. You write the rules. The computer follows them. "If the customer's credit score is above 700 and their income is above $50,000, approve the loan." You decided the threshold. You decided the factors. You decided the logic. The computer just executes it.

This worked incredibly well for decades. It still does. Most of the software you use every day runs on hand-written rules.

But over time, people ran into a problem. Some patterns are too complex for a human brain to write rules for. You could look at a thousand loan applications and never figure out the exact combination of 50 different factors that separates the ones that default from the ones that don't. You know the pattern exists. You just can't articulate it.

That's where machine learning comes in. Instead of you writing the rules, you give the computer examples and let it figure out the rules on its own. You show it 10,000 past loan applications with their outcomes (defaulted or didn't), along with every attribute you have about each one (income, credit score, employment history, zip code, debt ratio). The model finds the patterns that predict the outcome. Patterns you might never have written yourself.

How ML Actually Works

The learning process is less mysterious than people think.

  1. Start with data. Rows of examples. Each row has input attributes and an outcome you care about.
  2. The model starts knowing nothing. It assigns random importance to each attribute.
  3. Using those random weights, it predicts outcomes for your training data.
  4. It measures how wrong it is. This error is called the loss.
  5. It nudges the weights slightly in the direction that reduces the error.
  6. Repeat thousands of times. Each pass gets slightly less wrong.
  7. Stop when good enough. The model has now encoded patterns from your data into its weights.

That's it. That's the "learning."

A traditional algorithm is you encoding your knowledge into rules. A machine learning model is the data encoding its patterns into weights.

This is why ML is powerful when the patterns are too complex for a human to write out. Nobody could write "if attribute 38 is above 0.7 AND attribute 12 is below 0.3 AND..." for 125 attributes. But a model can find those combinations.

The tradeoff is simple. Traditional algorithms: you understand exactly why they made every decision, but they only know what you told them. ML models: they can discover patterns you'd never see, but explaining why they made a specific decision gets harder. That's the "black box" problem.

The Real Question: What Do You Feed the Model?

Here's the thing nobody talks about. The algorithm is the easy part. The hard part is figuring out the right inputs.

I work in clinical trials. We built a model that predicts how well a research site will perform at enrolling patients. The model takes in attributes about each site (staff size, geographic region, investigator experience, past performance) and predicts whether they'll be a top performer or not.

But who decided those were the right attributes? The model can only learn from what it receives. If nobody thinks to include "distance to nearest patient population," the model will never learn that it matters. Even if it's the most predictive factor in the world.

So how do you figure out the right inputs?

1. Start With What You Have

Most teams start here. Look at your existing databases and ask "what do we already capture about sites?" Past enrollment numbers, therapeutic area, sponsor history. Throw it all in and see what sticks.

This is fast but limited. You're constrained by what someone decided to collect years ago for completely different reasons.

2. Subject Matter Expertise

This is the big one. A clinical operations expert with 20 years of experience says: "The PI's publication count matters. Whether the site has a dedicated research coordinator matters. Whether the site is near a major metro area matters."

These are hypotheses based on domain knowledge. The model can't discover a factor that was never given to it. This is where product managers and domain experts are irreplaceable. The data scientist knows how to build the model. The domain expert knows what the model should be looking at.

3. Feature Engineering

The most underappreciated part. Sometimes the raw data isn't predictive, but a transformation of it is.

You have "number of patients enrolled in Study A" and "number of patients enrolled in Study B." Neither alone is that useful. But someone creates "average patients per month across the last 3 studies." That's a new factor built from existing data. It didn't exist in any database. Someone had to think it up.

I saw this firsthand. The single most predictive feature in an enrollment model I work with turned out to be the minimum performance rate for a given sponsor and therapeutic area. Not the average. The floor. Someone hypothesized that the worst-case performance might be more telling than the mean, built that factor, and the model confirmed it was the strongest signal in the entire dataset.

Nobody had that sitting in a table. It had to be imagined into existence.

See how all three layers feed into a model:

Predict Home Prices

Source
Engineer
Train
Rank
Predict

You're predicting home prices. Pick the features you think matter.

4 selected

1 / 5

How Do You Know What You Don't Know?

You don't. Not fully. You close the gap through:

  • Domain expert interviews. "What do you actually look at when you pick a site?" Their answers surface factors no database captures.
  • Literature review. Published research from other organizations uses different factors. Those differences are hypotheses worth testing.
  • Error analysis. When the model gets a prediction wrong, investigate why. "Their coordinator quit mid-study." Now you know staff turnover might matter.
  • Analogies from other fields. Retail uses foot traffic and demographics to predict store performance. Could patient density within 30 miles work for clinical sites?

You can never be sure you have all the right factors. One model I work with went from 38 features to 125, and predictive power jumped dramatically. But there's still a large share of variation the model can't explain. Likely because the factors that would explain it haven't been thought of yet, or the data doesn't exist.

The Model Is Trained. Now What Do You Ask the User?

You've got a trained model. It learned from 80 features. Now someone shows up with a new house to price, or a new clinical site to evaluate. Do you ask them for all 80 inputs?

No. Most of those inputs the system already knows.

This is the product decision that separates a useful ML tool from a form nobody wants to fill out. For every feature the model uses, ask one question: can the system look this up, or does only the user know?

If the system can look it up, don't ask. Auto-populate it behind the scenes. The user shouldn't even know it exists.

If only the user knows it and it's important to the prediction, make it a required field. That's where you invest your UX effort.

Think about Zillow. Their model uses the median home price in your zip code. They don't ask you to type that in. They already have it. They ask you for what only you know: square footage, number of bedrooms, recent renovations.

The same logic applies everywhere. A clinical trial model that predicts site performance might use dozens of historical features behind the scenes. But all the user needs to enter are the study parameters: therapeutic area, phase, target enrollment. The system already has each site's track record. It combines what it knows with what the user provides and makes the prediction.

Feature importance tells you which inputs matter most to accuracy. But the product question is narrower. Of the inputs that matter, which ones require a human? That's your input form.

Get this wrong in either direction and you lose. Ask too much and nobody uses the tool. Ask too little and the predictions are mediocre. The sweet spot is: auto-populate everything the system knows, and ask only for the high-importance factors that require human judgment.

The PM's Role in ML

This is what most people miss.

The PM's job isn't to build the model. It's to ask the questions the model builders aren't asking.

Data scientists are brilliant at optimization within a defined problem space. Give them the inputs and a target, and they'll find the best algorithm, tune it relentlessly, squeeze every fraction of accuracy out of the data.

But deciding what the inputs should be? Hypothesizing that a completely new type of feature might matter? That requires someone who lives in the domain, who understands the messy operational reality behind the numbers, who talks to the people doing the actual work.

ML models are only as smart as the factors humans think to give them. The algorithm is the engine, but domain expertise is the fuel.