On Day 8, the curriculum shifted from simple backend routes to integrating generative intelligence. The goal was to build a complete web-based AI chatbot interface, passing user input from the HTML page through a Flask server, sending it to an LLM, and returning the response dynamically to render.
1. Local vs. API-Driven Inference
We explored two methods of connecting an AI model: running a local pipeline with TinyLlama loading all model weights directly into host memory, or calling remote inference endpoints via the Hugging Face API to run Qwen2.5-7B-Instruct. While a local pipeline allows offline execution, its download size is massive (2GB+) and tokens generate very slowly on standard student laptop CPUs.
2. Input Validation and Exception Boundaries
To make the chatbot robust, we focused on error boundaries: validating that user input is not empty before posting, handling model-generated prompt prefixes, and wrapping requests in try-except blocks to catch API timeouts or offline network drops. This returns user-friendly messages rather than crashing the Flask runtime.
3. The Developer Speed Gap: Torch Downloads vs. Fast Requests
The lab highlighted a stark difference in developer speed. While my classmates struggled to install transformers, torch, accelerate (over 2GB in dependencies) and faced memory allocation crashes on their slow dual-core laptops, I skipped the local download entirely. I set up an asynchronous requests loop calling the remote Qwen-7B endpoint. When Rashmi checked my screen, she was amazed. While others were still installing PyTorch, I was chatting with a zero-latency chatbot. She highly praised my choice, explaining to the class that engineering is about matching resource constraints, not loading massive dependencies when a simple remote request works in milliseconds.