Git Your Code: Advanced Semantic Repository Analysis System using RAG
Git Your Code: Advanced Semantic Repository Analysis System
Git Your Code implements a cutting-edge Retrieval-Augmented Generation (RAG) architecture designed for deep semantic analysis of GitHub repositories. The system leverages vector embeddings, natural language processing, and machine learning to provide intelligent code comprehension and query capabilities.
![Cloud Architecture Diagram showing the complete system design](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fthumbnail.9cd0dab0.png&w=3840&q=75)
Vector Processing Engine
The system employs sophisticated vector query processing utilizing cosine similarity metrics for precise document retrieval. The implementation leverages SQL with vector operations, maintaining a similarity threshold of 0.5 for optimal precision-recall balance.
![Cloud Architecture Diagram showing the complete system design](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Frag-architecture.2bba40cd.png&w=1080&q=75)
Document Pipeline Architecture
The scalable document processing pipeline demonstrates integration with GitHub's API infrastructure, managing repository indexing, document processing, and embedding generation through an event-driven architecture.
![Cloud Architecture Diagram showing the complete system design](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fdocument-processing-pipeline.15d56f00.png&w=3840&q=75)
Persistence Layer
PostgreSQL-based vector store implementation showcasing the high-dimensional embedding space. The system utilizes pg_vector for efficient similarity computations through Hierarchical Navigable Small World (HNSW) indexing.
![Cloud Architecture Diagram showing the complete system design](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fsource-code-embeddings-in-pg-vector.7cf2dc5d.png&w=1920&q=75)
Core System Components
Vector Embedding Engine
The system implements a sophisticated vector embedding pipeline utilizing transformer-based architectures for semantic encoding:
export const generateEmbeddings = async (docs: Document[]): Promise<EmbeddingVector[]> => {
return await Promise.all(docs.map(async (doc) => {
const summary = await summarizeCodeByAI(doc);
const embedding = await generateVectorEmbeddings(summary || doc.pageContent);
return {
summary,
embedding,
sourceCode: JSON.stringify(doc.pageContent),
filename: doc.metadata.source,
};
}));
};
Query Processing
The query processing system implements a sophisticated vector similarity search with configurable thresholds:
const result = await prisma.$queryRaw`
SELECT "fileName", "sourceCode", "summary",
1 - ("summaryEmbedding" <=> ${vectorQuery}::vector) AS CosineSimilarity
FROM "SourceCodeEmbedding"
WHERE 1 - ("summaryEmbedding" <=> ${vectorQuery}::vector) > .5
ORDER BY CosineSimilarity DESC
LIMIT 10
`;
Prompt Templates
Code Analysis Template
The system employs a sophisticated template hierarchy for code analysis:
export const GetCodeSummarizePrompt = (doc: Document): string[] => {
const code = doc.pageContent.slice(0, 10000);
return [
'You are an intelligent senior software engineer specializing in onboarding junior software engineers onto projects',
`You are onboarding a junior software engineer and explaining to them the purpose of the ${doc.metadata.source} file`,
'Here is the code:',
'---',
`${code}`,
'---',
'Give a summary no more than 100 words of the code above'
];
};
This template implements a pedagogical approach to code explanation, positioning the AI system as an experienced mentor facilitating knowledge transfer to junior developers.
Differential Analysis Template
The system includes a comprehensive template for git differential analysis:
export const GetDiffSummarizePrompt = (diff: string): string => {
return `
# Git Commit Summary Prompt
Analyze the provided git diff and create a structured summary following these guidelines:
1. High-Level Changes
2. Core Features (✨ New Features)
3. Schema Changes (🗃️ Database)
4. Dependencies (📦 Dependencies)
5. Infrastructure (🛠️ Infrastructure)
...`;
};
Contextual Query Template
The system implements a sophisticated context-aware query processing template:
export function GetAskAQuestionPrompt({ context, question }: PromptParams): string {
return `You are an AI code assistant who answers questions about the codebase...
${context ? `START CONTEXT BLOCK\n${context}\nEND OF CONTEXT BLOCK\n` : ''}
${question ? `START QUESTION\n${question}\nEND OF QUESTION\n` : ''}`;
}
System Architecture Components
Document Processing Pipeline
The system implements a sophisticated document processing pipeline:
-
Repository Ingestion
- GithubRepoLoader integration for systematic content extraction
- Metadata preservation and provenance tracking
- Rate-limited API interaction management
-
Document Vectorization
- Transformer-based embedding generation
- Dimensional reduction for optimization
- Efficient vector storage and indexing
-
Query Processing
- Vector similarity computation
- Context window assembly
- Response generation orchestration
Database Schema
The system utilizes a sophisticated PostgreSQL schema with vector extensions:
CREATE TABLE "SourceCodeEmbedding" (
id TEXT PRIMARY KEY,
fileName TEXT NOT NULL,
sourceCode TEXT NOT NULL,
summary TEXT,
summaryEmbedding vector(1536)
);
CREATE INDEX embedding_idx ON "SourceCodeEmbedding"
USING ivfflat (summaryEmbedding vector_cosine_ops)
WITH (lists = 100);
Conclusion
The Git Your Code system represents a significant advancement in semantic repository analysis. By leveraging cutting-edge RAG architecture, sophisticated vector processing, and intelligent prompt engineering, it provides developers with powerful tools for code comprehension and analysis. As the system continues to evolve, it promises to revolutionize the way developers interact with and understand complex codebases.