Day 6: Give It a Memory — Build Your First AI App in 10 Days

Your assistant is brilliant at patterns. It knows nothing about you.

It doesn’t know your notes, your projects, your writing, your study material. Every time you reference something personal, it either guesses or admits it doesn’t know. That ceiling is real. This is the day you break through it.

RAG stands for Retrieval-Augmented Generation. The idea: before the model answers, you search your own documents for relevant information and inject it into the prompt. The model answers using your data. Not its training. Yours.

By the end of today, you can drop any document into your app and have a genuine conversation with it.

What Embeddings Are

An embedding is a list of numbers that encodes the meaning of text. Not the words. The meaning.

“The exam is tomorrow” and “I have a test in the morning” have completely different words. But their embeddings are close together in high-dimensional space because they mean similar things. Cosine similarity measures that closeness: 1.0 means identical meaning, 0 means unrelated.

You do not need to understand the math. You need to understand the consequence: you can search by meaning, not just keyword. Ask “what did I write about focus?” and retrieval finds the chunk about deep work even if you never used the word “focus.”

Setup

uv add sentence-transformers ipython

import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

all-MiniLM-L6-v2 downloads once (~80MB) and runs locally. Fast, free, and accurate enough for everything you will build this week.

Step 1: Chunk Your Document

Models have context limits. You cannot paste an entire book. Chunking splits your document into pieces small enough to fit, with a little overlap so ideas don’t get cut in half.

def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i : i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

# Use your own text here: paste notes, a README, a textbook chapter, anything
my_document = """
Paste any text you own here. Your course notes from college.
A chapter from something you are reading. Your own writing.
The more personal, the more satisfying the result will be.
"""

chunks = chunk_text(my_document)
print(f"{len(chunks)} chunks created.")

Step 2: Build the Index

Embed each chunk once. Store the chunks and their embeddings together. This is your search index.

def build_index(chunks):
    embeddings = embedder.encode(chunks)
    return list(zip(chunks, embeddings))

index = build_index(chunks)
print(f"Index ready: {len(index)} entries.")

Step 3: Retrieve What’s Relevant

At query time, embed the question and find the chunks closest to it.

def cosine_similarity(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def retrieve(query, index, top_k=3):
    query_vec = embedder.encode(query)
    scored = [(cosine_similarity(query_vec, vec), chunk) for chunk, vec in index]
    scored.sort(reverse=True)
    return [chunk for _, chunk in scored[:top_k]]

Step 4: Answer With Context

Inject the retrieved chunks into the prompt before the model sees the question.

from groq import Groq
import os

client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

def ask(question, index):
    context_chunks = retrieve(question, index)
    context = "\n\n---\n\n".join(context_chunks)

    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {
                "role": "system",
                "content": """Answer the question using only the provided context.
If the context doesn't contain enough information to answer, say so directly.
Do not make up information that isn't in the context."""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

answer = ask("What does my document say about [something you know is in it]?", index)
print(answer)

The model answers using your document. Not guessing. Not its training data. Your actual text.

Add It to Your Marimo App

Wire retrieval in as a toggle. When the user uploads a document, switch from “general assistant” to “document assistant.”

import marimo as mo

upload = mo.ui.file(label="Upload a document (.txt)", filetypes=[".txt"])
upload

mo.stop(not upload.value, mo.md("*Upload a document to enable document mode.*"))

content = upload.value[0].contents.decode("utf-8")
doc_index = build_index(chunk_text(content))
mo.md(f"Document loaded. {len(doc_index)} chunks indexed. Ask anything about it.")

mo.stop(not upload.value)

question = mo.ui.text(placeholder="Ask something about your document...", full_width=True).form(
    submit_button_label="Ask →"
)
question

mo.stop(question.value is None)

doc_answer = ask(question.value, doc_index)
mo.md(f"**Answer:**\n\n{doc_answer}")

Now your Marimo app has two modes: general assistant and document assistant. The user picks.

Upload your own study notes, a textbook chapter you are working through, or your own writing. Have a real conversation with it. Ask it to explain a concept from the notes. Ask it what you argued in a specific paragraph. Ask it to quiz you on the content.

This is the moment the app stops feeling like a demo. It knows your stuff.

What Just Changed

The context window is no longer your limit. Your document can be a hundred pages. The retrieval system finds the three most relevant chunks. The model reads those chunks and answers precisely.

You are not using a chatbot anymore. You are building a knowledge system. Tomorrow it starts taking actions on its own.

Give It a Memory