Shandu - DEEP RESEARCH
Shandu
https://github.com/jolovicdev/shandu
Shandu is a cutting-edge AI research assistant that performs in-depth, multi-source research on any topic using advanced language models, intelligent web scraping, and iterative exploration to generate comprehensive, well-structured reports with proper citations.
Shandu is an intelligent, LLM-powered research system that automates the comprehensive research process - from initial query clarification to in-depth content analysis and report generation. Built on LangGraph's state-based workflow, it recursively explores topics with sophisticated algorithms for source evaluation, content extraction, and knowledge synthesis.
- Academic Research: Generate literature reviews, background information, and complex topic analyses
- Market Intelligence: Analyze industry trends, competitor strategies, and market opportunities
- Content Creation: Produce well-researched articles, blog posts, and reports with proper citations
- Technology Exploration: Track emerging technologies, innovations, and technical developments
- Policy Analysis: Research regulations, compliance requirements, and policy implications
- Competitive Analysis: Compare products, services, and company strategies across industries
Shandu 2.0 introduces a major redesign of the report generation pipeline to produce more coherent, reliable reports:
- Modular Report Generation: Process reports in self-contained sections, enhancing overall system reliability
- Robust Error Recovery: Automatic retry mechanisms with intelligent fallbacks prevent the system from getting stuck
- Section-By-Section Processing: Each section is processed independently, allowing for better error isolation
- Progress Tracking: Detailed progress tracking helps identify exactly where the process is at each stage
- Enhanced Citation Management: More reliable citation handling ensures proper attribution throughout reports
- Intelligent Parallelization: Key processes run in parallel where possible for improved performance
- Comprehensive Fallback Mechanisms: If any step fails, the system gracefully degrades rather than halting
- Intelligent State-based Workflow: Leverages LangGraph for a structured, step-by-step research process
- Iterative Deep Exploration: Recursively explores topics with dynamic depth and breadth parameters
- Multi-source Information Synthesis: Analyzes data from search engines, web content, and knowledge bases
- Enhanced Web Scraping: Features dynamic JS rendering, content extraction, and ethical scraping practices
- Smart Source Evaluation: Automatically assesses source credibility, relevance, and information value
- Content Analysis Pipeline: Uses advanced NLP to extract key information, identify patterns, and synthesize findings
- Sectional Report Generation: Creates detailed reports by processing individual sections for maximum reliability
- Parallel Processing Architecture: Implements concurrent operations for efficient multi-query execution
- Adaptive Search Strategy: Dynamically adjusts search queries based on discovered information
- Full Citation Management: Properly attributes all sources with formatted citations in multiple styles
from shandu.agents import ResearchGraph from langchain_openai import ChatOpenAI # Initialize with custom LLM if desired llm = ChatOpenAI(model="gpt-4") # Initialize the research graph researcher = ResearchGraph( llm=llm, temperature=0.5 ) # Perform deep research results = researcher.research_sync( query="Your research query", depth=3, # How deep to go with recursive research breadth=4, # How many parallel queries to explore detail_level="high" ) # Print or save results print(results.to_markdown())
Advanced Architecture
Shandu's research pipeline consists of these key stages:
- Query Clarification: Interactive questions to understand research needs
- Research Planning: Strategic planning for comprehensive topic coverage
- Iterative Exploration:
- Smart query generation based on knowledge gaps
- Multi-engine search with parallelized execution
- Relevance filtering of search results
- Intelligent web scraping with content extraction
- Source credibility assessment
- Information analysis and synthesis
- Reflection on findings to identify gaps
Shandu 2.0 introduces a robust, modular report generation pipeline:
- Data Preparation: Registration of all sources and their metadata for proper citation
- Title Generation: Creating a concise, professional title (with retry mechanisms)
- Theme Extraction: Identifying key themes to organize the report structure
- Citation Formatting: Properly formatting all citations for reference
- Initial Report Generation: Creating a comprehensive draft report
- Section Enhancement: Individually processing each section to add detail and depth
- Key Section Expansion: Identifying and expanding the most important sections
- Report Finalization: Final processing and validation of the complete report
Each step includes:
- Comprehensive error handling
- Automatic retries with exponential backoff
- Intelligent fallbacks when issues occur
- Progress tracking for transparency
- Validation to ensure quality output
- Google Search
- DuckDuckGo
- Wikipedia
- ArXiv (academic papers)
- Custom search engines can be added
- Dynamic JS Rendering: Handles JavaScript-heavy websites
- Content Extraction: Identifies and extracts main content from web pages
- Parallel Processing: Concurrent execution of searches and scraping
- Caching: Efficient caching of search results and scraped content
- Rate Limiting: Respectful access to web resources
- Robots.txt Compliance: Ethical web scraping practices
- Flexible Output Formats: Markdown, JSON, plain text
改进版本:
https://github.com/semukhin/deepresearch_shandu
from shandu.agents import ResearchAgent from langchain_openai import ChatOpenAI # Initialize the LLM llm = ChatOpenAI(model="gpt-4") # Initialize the research agent agent = ResearchAgent( llm=llm, max_depth=3, # How deep to go with recursive research breadth=4 # How many parallel queries to explore ) # Perform deep research results = agent.research_sync( query="Your research query", engines=["google", "duckduckgo"] ) # Print results in markdown format print(results.to_markdown())
https://zhuanlan.zhihu.com/p/27726648728