AWS GPU instance baseline

A minimal cloud baseline focused on identity, networking, and cost control.

Layout

Key takeaways

  • Cloud baselines fail at identity, networking, and instance drift.
  • A small, repeatable path beats a big, flexible one.
  • Verify endpoints before wiring the app.

This run was about learning the AWS shape, not squeezing performance. For system boundaries and failure loops, see Systems 001: Foundations.

Architecture map

The AWS mental model is a VPC boundary with security groups and IAM as the two control planes.

AWS baseline map A VPC boundary holds compute and database, with IAM and security groups as control layers. GPU instance Postgres Security groups IAM
VPC boundary with security groups and IAM as the two control planes.

What happened

My first run failed even with valid keys. The issue was not the model. It was a security group rule that blocked the inbound port and a missing IAM permission that made the instance look healthy but unusable.

The two gates

The first gate is IAM. If the role is wrong, you can see the instance but cannot use it.

The second gate is networking. Security groups block traffic by default; your app must be explicitly allowed.

Portal walkthrough

  1. Create a VPC or reuse the default one for the first pass.
  2. Launch a GPU instance and attach a minimal IAM role.
  3. Open the inbound port for your app and database access.
  4. Create a Postgres instance inside the same VPC.
  5. Collect the endpoint, key, and host values for config.

First-time config

export LLM_HOST="http://<instance-ip>:<port>"
export DATABASE_URL="postgresql://<user>:<password>@<db-hostname>:5432/<db-name>"

Quick checks

nc -vz <instance-ip> <port>
psql "postgresql://<user>:<password>@<db-hostname>:5432/<db-name>"

Failure modes

  • IAM role looks correct but lacks the specific service permission.
  • Security group allows the port but the instance subnet is private.
  • Using a different region for the database adds invisible latency.

What made the difference

I treated IAM and security groups as first-class parts of the architecture, not afterthoughts. Once those gates were clear, the rest was predictable.

What I would do next time

I would pin the AMI, keep everything in a single region, and log network rules alongside app config.