Architecture Overview
System Architecture
Section titled “System Architecture”DealAI.lt is built on a multi-tier architecture that separates concerns into distinct layers: data collection, storage, processing, search, and presentation. This design enables scalability, maintainability, and performance.
High-Level Architecture
Section titled “High-Level Architecture”┌─────────────────────────────────────────────────────────────────┐│ Presentation Layer ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ WordPress │ │ Admin │ │ Public │ ││ │ Theme │ │ Dashboards │ │ Search UI │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────────┐│ Application Layer ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Search │ │ Product │ │ Analytics │ ││ │ Engine │ │ Management │ │ & Reports │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────────┐│ Data Layer ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ PostgreSQL │ │ Elasticsearch│ │ Scrapyd │ ││ │ Database │ │ Search │ │ Crawler │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────────┐│ External Sources ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ E-commerce │ │ E-commerce │ │ E-commerce │ ││ │ Site A │ │ Site B │ │ Site C │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────────┘Core Components
Section titled “Core Components”1. WordPress Application Layer
Section titled “1. WordPress Application Layer”Purpose: Main application framework and user interface
Components:
- Custom Theme (
/wp-content/themes/products/) - Page Templates for different functionalities
- AJAX Handlers for dynamic interactions
- Authentication & Authorization
Key Files:
functions.php- Core theme functions and hookspage-*.php- Specialized page templates/inc/- Modular PHP includes
2. Database Layer (PostgreSQL)
Section titled “2. Database Layer (PostgreSQL)”Purpose: Primary data storage for products and metadata
Server: 162.55.174.116:5432
Key Tables:
product- Main product catalog (60K+ records)core_category- Hierarchical category structureproduct_crawl_history- Historical price/availability trackingproduct_screenshot- Screenshot metadatacategory_product_crawl- Scraping queue management
Features:
- ACID compliance for data integrity
- Advanced indexing for performance
- Full support for JSON/JSONB data types
- Efficient time-series data handling
3. Search Layer (Elasticsearch)
Section titled “3. Search Layer (Elasticsearch)”Purpose: Fast full-text search with Lithuanian language support
Server: 91.99.113.45:9200
Index Configuration:
- Lithuanian Analyzer with snowball stemming
- Multi-field mapping (title, brand, description, SKU)
- Fuzzy matching for typo tolerance
- Aggregations for faceted filtering
Synchronization:
- Three-phase sync process
- Batch processing (500 products/batch)
- State persistence for resumable operations
- Real-time monitoring dashboard
4. Web Scraping Infrastructure
Section titled “4. Web Scraping Infrastructure”Purpose: Automated data collection from e-commerce sites
Server: 78.56.0.236:6800 (Scrapyd)
Components:
- Scrapyd Daemon - Job scheduling and execution
- Spider Collection - Python-based scrapers
- Job Queue - Database-backed queue
- Status Monitoring - Real-time job tracking
Process Flow:
- Queue products for scraping
- Schedule jobs with Scrapyd
- Execute spiders on remote server
- Collect and normalize data
- Update database with new information
5. Automation Layer
Section titled “5. Automation Layer”Purpose: Scheduled tasks for maintenance and updates
Cron Jobs:
- Elasticsearch Sync (every 5 minutes)
- Crawler Management (every 15 minutes)
- Screenshot Capture (daily)
- Category Updates (daily)
State Management:
- JSON-based state persistence
- Failed job tracking
- Progress monitoring
- Error recovery mechanisms
Data Flow Architecture
Section titled “Data Flow Architecture”Product Discovery Flow
Section titled “Product Discovery Flow”External Site → Scrapyd Spider → PostgreSQL → Elasticsearch → Search UISteps:
- Scraping: Spider extracts product data
- Storage: Raw data stored in PostgreSQL
- Indexing: Product indexed in Elasticsearch
- Search: User queries via search interface
- Results: Relevant products displayed
Price History Flow
Section titled “Price History Flow”Scheduled Scan → Compare Prices → Store History → Generate ChartsSteps:
- Scheduled: Cron job triggers product rescan
- Comparison: New price compared to historical data
- Storage: Changes stored in
product_crawl_history - Visualization: Chart.js renders price trends
Search Query Flow
Section titled “Search Query Flow”User Query → WordPress Handler → Elasticsearch → Results Processing → DisplaySteps:
- Input: User enters search query
- Processing: Query sanitized and enhanced
- Execution: Elasticsearch performs search
- Filtering: Apply category, price, brand filters
- Rendering: Results formatted and displayed
Integration Points
Section titled “Integration Points”PostgreSQL ↔ Elasticsearch
Section titled “PostgreSQL ↔ Elasticsearch”Purpose: Keep search index synchronized with database
Method: Automated synchronization script
Process:
- Query products from PostgreSQL
- Transform data for Elasticsearch
- Batch index documents
- Track sync status in database
- Handle errors and retries
WordPress ↔ Scrapyd
Section titled “WordPress ↔ Scrapyd”Purpose: Manage and monitor scraping jobs
Method: HTTP API integration
Endpoints:
daemonstatus.json- Server healthlistjobs.json- Job queue statusschedule.json- Schedule new jobscancel.json- Cancel running jobs
Frontend ↔ Backend
Section titled “Frontend ↔ Backend”Purpose: Dynamic user interactions
Method: AJAX with WordPress admin-ajax.php
Endpoints:
get_scrapyd_stats- Dashboard statisticssearch_products- Product searchget_price_history- Historical pricingupdate_product- Product modifications
Security Architecture
Section titled “Security Architecture”Authentication & Authorization
Section titled “Authentication & Authorization”- WordPress Users - Built-in user management
- Role-Based Access - Admin-only console access
- Nonce Verification - AJAX request validation
- Session Management - WordPress session handling
Data Security
Section titled “Data Security”- SQL Injection Prevention - Parameterized queries
- XSS Protection - Output escaping
- Input Validation - Server-side validation
- HTTPS - Encrypted data transmission
Network Security
Section titled “Network Security”- Firewall Rules - Port restrictions
- Private Networks - Database isolation
- API Authentication - Service credentials
- Rate Limiting - Abuse prevention
Scalability Considerations
Section titled “Scalability Considerations”Horizontal Scaling
Section titled “Horizontal Scaling”Database:
- PostgreSQL replication (master-slave)
- Read replicas for reporting
- Connection pooling (PgBouncer)
Elasticsearch:
- Multi-node cluster
- Shard distribution
- Replica configuration
Scrapyd:
- Multiple Scrapyd servers
- Load distribution
- Spider deployment automation
Vertical Scaling
Section titled “Vertical Scaling”Application Server:
- Increased memory for PHP
- More CPU cores for processing
- SSD storage for faster I/O
Database Server:
- Larger RAM for caching
- NVMe storage for performance
- CPU upgrade for query processing
Caching Strategies
Section titled “Caching Strategies”Application Level:
- WordPress object cache (Redis/Memcached)
- Database query results
- API responses
CDN Level:
- Static assets (CSS, JS, images)
- Cacheable HTML pages
- Geographic distribution
Monitoring & Observability
Section titled “Monitoring & Observability”Application Monitoring
Section titled “Application Monitoring”- Error Logs - PHP error logging
- Debug Logs - WordPress debug mode
- Access Logs - Web server logs
Database Monitoring
Section titled “Database Monitoring”- Query Performance - Slow query log
- Connection Pool - Active connections
- Storage Usage - Disk space monitoring
Search Monitoring
Section titled “Search Monitoring”- Index Health - Cluster status
- Query Performance - Search latency
- Sync Status - Index freshness
Scraping Monitoring
Section titled “Scraping Monitoring”- Job Success Rate - Completion percentage
- Error Tracking - Failed jobs
- Performance Metrics - Items per second
Deployment Architecture
Section titled “Deployment Architecture”Development Environment
Section titled “Development Environment”- Local WordPress installation
- Local PostgreSQL database
- Elasticsearch via Docker
- Mock Scrapyd for testing
Staging Environment
Section titled “Staging Environment”- Dedicated staging server
- Database snapshot from production
- Separate Elasticsearch index
- Limited Scrapyd access
Production Environment
Section titled “Production Environment”- Load-balanced web servers
- High-availability PostgreSQL
- Multi-node Elasticsearch cluster
- Distributed Scrapyd servers
Performance Characteristics
Section titled “Performance Characteristics”Response Times
Section titled “Response Times”- Search Queries: < 100ms (p95)
- Product Details: < 200ms (p95)
- Dashboard Load: < 500ms (p95)
- API Endpoints: < 150ms (p95)
Throughput
Section titled “Throughput”- Search: 100+ queries/second
- Indexing: 500 products/minute
- Scraping: 1000+ products/hour
- API: 50+ requests/second
Data Processing
Section titled “Data Processing”- Bulk Sync: 60K products in ~2 hours
- Incremental: Real-time (< 5 min delay)
- Screenshot: 100 products/hour
- Price History: Full catalog daily
Next Steps
Section titled “Next Steps”- Technology Stack - Detailed technology overview
- Installation - Set up your environment
- Database System - PostgreSQL deep dive
- Elasticsearch - Search implementation