Gen AI
AI concepts
- Artificial Intelligence is the broad field of building systems that perform tasks requiring human-like intelligence.
- Machine Learning is a subset of AI where systems learn patterns from data instead of being explicitly programmed.
- Deep Learning is a subset of ML that uses multi-layer neural networks to learn complex representations.
- Relationship: AI > ML > Deep Learning.
- Supervised learning uses labeled data. Example: spam detection, house price prediction.
- Unsupervised learning uses unlabeled data to find hidden structure. Example: clustering customers.
- Reinforcement learning learns by reward and punishment through interaction with an environment.
- Use supervised for prediction, unsupervised for discovery, and reinforcement for sequential decision-making.
- A neural network is made of layers of connected nodes that learn weights from data.
- Input layers receive data, hidden layers learn features, and output layers produce predictions.
- Transformers are deep learning architectures designed to process sequences efficiently.
- They rely on self-attention to understand relationships between tokens in a sequence.
- Transformers power modern LLMs because they scale well and capture long-range context better than older sequence models.
- An LLM is a large language model trained on massive text data to predict the next token.
- Tokens are chunks of text processed by the model. A word may be one token or several tokens.
- The context window is the amount of input and output text a model can handle in one request.
- Prompting means giving instructions, context, constraints, and examples to guide model output.
- Better prompts improve relevance, structure, and accuracy, but they do not guarantee correctness.
- RAG stands for Retrieval-Augmented Generation. It fetches relevant external information before generating an answer.
- RAG is useful when knowledge changes often or must come from private documents.
- Fine-tuning updates model behavior by training it further on task-specific examples.
- Use RAG to improve factual grounding and access current data. Use fine-tuning to improve style, format, or repeated task behavior.
- They solve different problems and are often combined in production systems.
- Hallucination means the model generates confident but incorrect or unsupported information.
- Embeddings convert text into numeric vectors that capture semantic meaning.
- Vector databases store embeddings and allow similarity search for related content.
- Embeddings and vector search are commonly used in semantic search and RAG pipelines.
- Hallucinations can be reduced with better prompts, retrieval, verification, and narrower task scope.
- Evaluation measures how well a model performs on quality, accuracy, latency, cost, and reliability.
- Bias occurs when model behavior unfairly favors or disadvantages certain groups or viewpoints.
- AI safety focuses on reducing harmful, insecure, or misleading outputs.
- Common controls include guardrails, human review, content filters, and domain-specific validation.
- In real systems, model quality is not enough. Safety, monitoring, and fallback handling matter just as much.
Text is a sequence of words or sentences used as input for models.
Tokens are smaller units of text (words, subwords, or characters).
Probability represents how likely a term is to appear next.
Neural networks learn patterns from data to make predictions.
An LLM is a neural network trained to predict the next term in an input sequence. For instance, if provided with the phrase "all that glitters," the model predicts and returns the completion "is not gold".
A token is the smallest unit of text processed by a model, which may or may not represent a complete meaningful unit.
Tokenization is the initial processing step where input text is broken into discrete tokens. This process goes beyond simple space-based splitting; it identifies meaningful suffixes like "-ing" or "-ers" (e.g., "glitters," "eating"), which helps the model understand that a specific action is being performed by an object.
Vectors are mathematical coordinates in an n-dimensional space used to represent the inherent meaning of tokens. Words with similar semantic meanings are clustered close together, while words with opposite meanings are placed far apart, allowing the model to construct sentences effectively.
Attention is a mechanism that allows a model to derive contextual meaning by looking at nearby words. It solves the problem of ambiguity (e.g., distinguishing "apple" the fruit from "Apple" the company) by pushing a word's vector toward its relevant context based on surrounding terms like "tasty" or "revenue".
This training method uses the inherent structure of data such as predicting blanked-out words in a sentence or patches in an image to learn without human-labeled data. This approach is highly scalable and makes acquiring training data significantly cheaper because it can be scraped directly from existing sources like the internet.
A Transformer is a specific algorithm or engine used within an LLM to predict the next token. It utilizes stacked layers of attention mechanisms and feed-forward neural networks to extract, manipulate, and refine the meaning of input tokens.
Fine-tuning is the process of taking a base model and training it on a specific set of questions and answers to specialize it for a particular domain, such as medicine or finance. This step ensures the model provides helpful and desirable responses while penalizing incorrect or unhelpful behavior.
This technique involves augmenting a query during inference time by including examples within the prompt. Providing these examples helps the model understand the desired context and format, thereby increasing the quality of the response.
RAG enhances responses by fetching relevant documents (like company policies or terms and conditions) in real-time and adding them to the model's context. This allows the LLM to provide high-quality, company-specific answers without having that data in its original training set.
A Vector Database is used to store documents and perform similarity searches efficiently. It identifies which stored documents are semantically closest to a user's query even if they don't share exact keywords by comparing their vector coordinates.
MCP is a standardized protocol that allows an LLM to connect with external tools and databases. This enables the model to retrieve real-time information (like flight details) and even execute actions (like booking a flight) through an MCP client.
Context Engineering is the discipline of managing the information sent to an LLM, including Few-Shot-Prompting, RAG, MCP, and user preferences. It also involves context summarization techniques, such as using a "sliding window" to summarize long chat histories to fit within the model's token limits.
New challenges faced by context Engineers are -user preferences and context summarization . For example - to solve context summarization , we might sent last 100 chats as well as summarry of previous chats .
Agents are long-running processes capable of querying LLMs, external systems, and other agents to autonomously execute complex requirements. For example, a travel agent could monitor flight prices and book a trip based on a user's pre-defined preferences.
Reinforcement Learning is a training method where models learn to follow optimal paths in a vector space to maximize a reward score. While powerful for optimizing behavior based on outcomes, it does not necessarily build a mental model or a physical understanding of how the world works.
RL fails in following scenario - suppose we have a fair coin , and it have only given heads on all the time it was tossed. then RL would say it will give head the next time because it has learned from previous outcomes but since it is an unbiased coin then outcome shoud be 50-50 percent of head and tail.
because human brain have mental model of coin works.
RLHF uses human preferences (e.g., choosing which of two responses is better) to assign positive or negative scores to the model's output paths. This feedback helps reinforce good outputs, training the model to prioritize responses that make the end user happy.
Chain of Thought involves training or prompting a model to break a problem down step-by-step. By explaining the reasoning process rather than providing a direct answer, the model can solve more complex problems with higher accuracy.
Reasoning models, such as those from DeepSeek or OpenAI, are designed to solve difficult problems by adding more reasoning steps as the task complexity increases. These models can figure out how to solve a problem step-by-step using various algorithms.
Tree of though and graph of though also exists . examples include o1 and o3 of OpenAI.
Multimodal models can process and generate multiple types of data, including text, images, and video. These models often perform better than text-only models because they gain a deeper understanding of objects by analyzing them across different formats.
SLMs are smaller neural networks (3M to 300M parameters) trained for task-specific use cases to reduce costs and keep data private. They are often created through distillation, where a smaller "student" model is trained to mimic the output of a larger "teacher" LLM.
distillation, where a smaller "student" model is trained to mimic the output of a larger "teacher" LLM.
Quantization is a post-training technique used to reduce inference costs by compressing model weights (e.g., from 32-bit to 8-bit numbers). This process saves significant memory, making the models easier to host and faster to run in production.