Data Flow and Storage
Data Flow and Storage
Synthesis is designed to handle your research data efficiently and securely, from initial document ingestion to the generation and storage of complex research artifacts. This section outlines how your data flows through the system, how it's processed, and where it's stored.
Overview
At its core, Synthesis combines a relational database (PostgreSQL via Prisma ORM) for structured project data and a vector database for semantic search and Retrieval-Augmented Generation (RAG). As you interact with the platform, your uploaded documents and the insights generated by AI agents are meticulously processed and stored, forming a comprehensive knowledge base for your projects.
Data Ingestion: Document Upload
When you upload a document to a project, the following data flow is initiated:
- File Upload: Your document is sent to the server via the
/api/uploadendpoint. - Temporary Storage: The raw file is temporarily saved to the local file system. In a production environment, this would typically involve secure cloud storage (e.g., AWS S3, Google Cloud Storage).
// app/api/upload/route.ts // ... const filepath = join(uploadsDir, `${Date.now()}-${file.name}`); await writeFile(filepath, buffer); // ... - Text Extraction & Metadata Collection: The
fileProcessorservice extracts the raw text content and relevant metadata (e.g., file type, size) from the uploaded file.// app/api/upload/route.ts // ... const extracted = await fileProcessor.processFile(filepath, file.type); // ... - Database Storage (Prisma/PostgreSQL): The extracted text and file metadata are then stored in the primary database as a
Documentrecord, linked to your specific project.// app/api/upload/route.ts // ... const document = await prisma.document.create({ data: { projectId, filename: file.name, filepath, filesize: file.size, mimetype: file.type, extractedText: extracted.text, // The full extracted text metadata: JSON.stringify(extracted.metadata), }, }); // ... - Vector Indexing (Asynchronous): Critically, the extracted text is also processed for semantic search.
- It's chunked into smaller, meaningful segments.
- Each chunk is converted into a numerical vector embedding.
- These embeddings are added to an in-memory (or external) vector store, along with metadata linking them back to the original document and project. This process runs asynchronously to avoid blocking the user interface.
// app/api/upload/route.ts // ... (async () => { try { const chunks = vectorStore.chunkText(extracted.text || ''); await Promise.all( chunks.map((chunk, i) => vectorStore.addDocument({ id: `${document.id}-chunk-${i}`, content: chunk, metadata: { projectId, documentId: document.id, type: 'paragraph', title: file.name, }, }) ) ); } catch (err) { console.error('Background indexing error:', err); } })(); // ...
Agent-Driven Data Generation and Transformation
When you initiate an AI agent pipeline for a project (e.g., via /api/agents/run), a series of intelligent agents work together to process your research data and generate new insights:
-
Agent Orchestration: The
orchestratorservice manages the execution of various agents (e.g.,outliner,writer,presenter,hypothesis-generator).// app/api/agents/run/route.ts // ... orchestrator.runPipeline(projectId) // ... -
Data Interaction: Agents retrieve context from your project by:
- Querying the vector store to find semantically similar document chunks.
- Fetching existing project data (documents, hypotheses, previous agent outputs) from the PostgreSQL database via Prisma.
-
Generated Data Storage: The outputs of these agents are stored as
AgentRunrecords in the PostgreSQL database. These outputs often contain structured JSON data representing:- Research paper outlines.
- Full research paper drafts (sections, full text, references).
- Presentation slides.
- Hypotheses (
prisma.hypothesis). - Concept nodes (
prisma.conceptNodes). - Project statistics (
prisma.statistics).
For example, saving a paper draft:
// app/api/projects/[projectId]/paper/route.ts // ... await prisma.agentRun.update({ where: { id: writerRun.id }, data: { output: JSON.stringify({ ...currentOutput, fullText, updatedAt: new Date().toISOString() }) } }); // ...
Data Retrieval and Interaction
Synthesis provides various ways to access and utilize your stored data:
- Chat with Research: The
/api/chatendpoint allows you to converse with your research.- Your query is used to search the vector store for the most relevant document chunks and agent outputs.
- These retrieved contexts, along with your conversation history, are sent to an LLM (Gemini) to generate a coherent response, providing accurate answers grounded in your project's data.
// app/api/chat/route.ts // ... const searchResults = await searchProject(projectId, query, 3); // ... const agentRuns = await prisma.agentRun.findMany({ /* ... */ }); // ... const response = await geminiClient.generateText(prompt); // ... - Project Details & Analytics:
- The
/api/projects/[projectId]endpoint fetches all related data for a specific project, including its documents, agent runs, hypotheses, concept nodes, and statistics. - The
/api/analyticsendpoint retrieves aggregate data across all projects to provide high-level insights, quality metrics, and trends, all sourced from the PostgreSQL database.
// app/api/analytics/route.ts // ... const projects = await prisma.project.findMany({ include: { documents: true, hypotheses: true, conceptNodes: true, }, }); // ... - The
- Export and Download:
- Endpoints like
/api/export/[format]/[projectId]and/api/download/ppt/[projectId]allow you to download generated content. - The system retrieves the latest paper draft (from
AgentRunoutputs) or presentation data, processes it into the requested format (PDF, LaTeX, DOCX, Markdown, PPT), and streams it to your browser.
// app/api/export/[format]/[projectId]/route.ts // ... const project = await prisma.project.findUnique({ where: { id: projectId }, include: { agentRuns: { /* ... */ } } }); const paperData = JSON.parse(project.agentRuns[0].output || '{}'); // ... // Calls exportToPDF, exportToLaTeX, etc. // ... - Endpoints like
Storage Mechanisms
Synthesis leverages a hybrid storage approach to manage different types of research data:
-
PostgreSQL Database (via Prisma ORM):
- This is the primary relational database for all structured and semi-structured data.
- It stores
Projectmetadata,Documentdetails (including extracted text),AgentRunrecords (with JSON outputs for papers, outlines, presentations),Hypothesisdata,ConceptNodedata, andStatistics. - Prisma ORM ensures type safety and efficient interaction with the database, maintaining data integrity and relationships between entities.
-
Vector Database (In-memory or Persistent):
- This specialized database stores the numerical vector embeddings of your document chunks.
- It's optimized for fast similarity searches, which is crucial for the RAG capabilities of the chat interface and for agents needing to retrieve relevant context.
- For simplicity in development, an in-memory vector store is used, but for large-scale production deployments, a persistent vector database solution (like Pinecone, Weaviate, or Qdrant) would be integrated.
-
Local File System (for raw uploads):
- Raw uploaded files are temporarily stored on the server's local disk.
- As noted, for a production environment, it's recommended to integrate with robust cloud storage solutions for scalability, durability, and better security practices.
This robust data flow and storage architecture ensures that your research data is not only processed intelligently but also stored reliably and made accessible for advanced analysis and interaction.