Skip to content

Architecture Overview

DealAI.lt is built on a multi-tier architecture that separates concerns into distinct layers: data collection, storage, processing, search, and presentation. This design enables scalability, maintainability, and performance.

┌─────────────────────────────────────────────────────────────────┐
│ Presentation Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ WordPress │ │ Admin │ │ Public │ │
│ │ Theme │ │ Dashboards │ │ Search UI │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Search │ │ Product │ │ Analytics │ │
│ │ Engine │ │ Management │ │ & Reports │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Data Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ Elasticsearch│ │ Scrapyd │ │
│ │ Database │ │ Search │ │ Crawler │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ External Sources │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ E-commerce │ │ E-commerce │ │ E-commerce │ │
│ │ Site A │ │ Site B │ │ Site C │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Purpose: Main application framework and user interface

Components:

  • Custom Theme (/wp-content/themes/products/)
  • Page Templates for different functionalities
  • AJAX Handlers for dynamic interactions
  • Authentication & Authorization

Key Files:

  • functions.php - Core theme functions and hooks
  • page-*.php - Specialized page templates
  • /inc/ - Modular PHP includes

Purpose: Primary data storage for products and metadata

Server: 162.55.174.116:5432

Key Tables:

  • product - Main product catalog (60K+ records)
  • core_category - Hierarchical category structure
  • product_crawl_history - Historical price/availability tracking
  • product_screenshot - Screenshot metadata
  • category_product_crawl - Scraping queue management

Features:

  • ACID compliance for data integrity
  • Advanced indexing for performance
  • Full support for JSON/JSONB data types
  • Efficient time-series data handling

Purpose: Fast full-text search with Lithuanian language support

Server: 91.99.113.45:9200

Index Configuration:

  • Lithuanian Analyzer with snowball stemming
  • Multi-field mapping (title, brand, description, SKU)
  • Fuzzy matching for typo tolerance
  • Aggregations for faceted filtering

Synchronization:

  • Three-phase sync process
  • Batch processing (500 products/batch)
  • State persistence for resumable operations
  • Real-time monitoring dashboard

Purpose: Automated data collection from e-commerce sites

Server: 78.56.0.236:6800 (Scrapyd)

Components:

  • Scrapyd Daemon - Job scheduling and execution
  • Spider Collection - Python-based scrapers
  • Job Queue - Database-backed queue
  • Status Monitoring - Real-time job tracking

Process Flow:

  1. Queue products for scraping
  2. Schedule jobs with Scrapyd
  3. Execute spiders on remote server
  4. Collect and normalize data
  5. Update database with new information

Purpose: Scheduled tasks for maintenance and updates

Cron Jobs:

  • Elasticsearch Sync (every 5 minutes)
  • Crawler Management (every 15 minutes)
  • Screenshot Capture (daily)
  • Category Updates (daily)

State Management:

  • JSON-based state persistence
  • Failed job tracking
  • Progress monitoring
  • Error recovery mechanisms
External Site → Scrapyd Spider → PostgreSQL → Elasticsearch → Search UI

Steps:

  1. Scraping: Spider extracts product data
  2. Storage: Raw data stored in PostgreSQL
  3. Indexing: Product indexed in Elasticsearch
  4. Search: User queries via search interface
  5. Results: Relevant products displayed
Scheduled Scan → Compare Prices → Store History → Generate Charts

Steps:

  1. Scheduled: Cron job triggers product rescan
  2. Comparison: New price compared to historical data
  3. Storage: Changes stored in product_crawl_history
  4. Visualization: Chart.js renders price trends
User Query → WordPress Handler → Elasticsearch → Results Processing → Display

Steps:

  1. Input: User enters search query
  2. Processing: Query sanitized and enhanced
  3. Execution: Elasticsearch performs search
  4. Filtering: Apply category, price, brand filters
  5. Rendering: Results formatted and displayed

Purpose: Keep search index synchronized with database

Method: Automated synchronization script

Process:

  • Query products from PostgreSQL
  • Transform data for Elasticsearch
  • Batch index documents
  • Track sync status in database
  • Handle errors and retries

Purpose: Manage and monitor scraping jobs

Method: HTTP API integration

Endpoints:

  • daemonstatus.json - Server health
  • listjobs.json - Job queue status
  • schedule.json - Schedule new jobs
  • cancel.json - Cancel running jobs

Purpose: Dynamic user interactions

Method: AJAX with WordPress admin-ajax.php

Endpoints:

  • get_scrapyd_stats - Dashboard statistics
  • search_products - Product search
  • get_price_history - Historical pricing
  • update_product - Product modifications
  • WordPress Users - Built-in user management
  • Role-Based Access - Admin-only console access
  • Nonce Verification - AJAX request validation
  • Session Management - WordPress session handling
  • SQL Injection Prevention - Parameterized queries
  • XSS Protection - Output escaping
  • Input Validation - Server-side validation
  • HTTPS - Encrypted data transmission
  • Firewall Rules - Port restrictions
  • Private Networks - Database isolation
  • API Authentication - Service credentials
  • Rate Limiting - Abuse prevention

Database:

  • PostgreSQL replication (master-slave)
  • Read replicas for reporting
  • Connection pooling (PgBouncer)

Elasticsearch:

  • Multi-node cluster
  • Shard distribution
  • Replica configuration

Scrapyd:

  • Multiple Scrapyd servers
  • Load distribution
  • Spider deployment automation

Application Server:

  • Increased memory for PHP
  • More CPU cores for processing
  • SSD storage for faster I/O

Database Server:

  • Larger RAM for caching
  • NVMe storage for performance
  • CPU upgrade for query processing

Application Level:

  • WordPress object cache (Redis/Memcached)
  • Database query results
  • API responses

CDN Level:

  • Static assets (CSS, JS, images)
  • Cacheable HTML pages
  • Geographic distribution
  • Error Logs - PHP error logging
  • Debug Logs - WordPress debug mode
  • Access Logs - Web server logs
  • Query Performance - Slow query log
  • Connection Pool - Active connections
  • Storage Usage - Disk space monitoring
  • Index Health - Cluster status
  • Query Performance - Search latency
  • Sync Status - Index freshness
  • Job Success Rate - Completion percentage
  • Error Tracking - Failed jobs
  • Performance Metrics - Items per second
  • Local WordPress installation
  • Local PostgreSQL database
  • Elasticsearch via Docker
  • Mock Scrapyd for testing
  • Dedicated staging server
  • Database snapshot from production
  • Separate Elasticsearch index
  • Limited Scrapyd access
  • Load-balanced web servers
  • High-availability PostgreSQL
  • Multi-node Elasticsearch cluster
  • Distributed Scrapyd servers
  • Search Queries: < 100ms (p95)
  • Product Details: < 200ms (p95)
  • Dashboard Load: < 500ms (p95)
  • API Endpoints: < 150ms (p95)
  • Search: 100+ queries/second
  • Indexing: 500 products/minute
  • Scraping: 1000+ products/hour
  • API: 50+ requests/second
  • Bulk Sync: 60K products in ~2 hours
  • Incremental: Real-time (< 5 min delay)
  • Screenshot: 100 products/hour
  • Price History: Full catalog daily