Architecture Overview

System Architecture

DealAI.lt is built on a multi-tier architecture that separates concerns into distinct layers: data collection, storage, processing, search, and presentation. This design enables scalability, maintainability, and performance.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Presentation Layer                        │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│   │   WordPress  │  │  Admin       │  │  Public      │         │
│   │   Theme      │  │  Dashboards  │  │  Search UI   │         │
│   └──────────────┘  └──────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                      Application Layer                           │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│   │  Search      │  │  Product     │  │  Analytics   │         │
│   │  Engine      │  │  Management  │  │  & Reports   │         │
│   └──────────────┘  └──────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                         Data Layer                               │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│   │  PostgreSQL  │  │ Elasticsearch│  │   Scrapyd    │         │
│   │   Database   │  │  Search      │  │   Crawler    │         │
│   └──────────────┘  └──────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                      External Sources                            │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│   │  E-commerce  │  │  E-commerce  │  │  E-commerce  │         │
│   │   Site A     │  │   Site B     │  │   Site C     │         │
│   └──────────────┘  └──────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────────┘

Core Components

1. WordPress Application Layer

Purpose: Main application framework and user interface

Components:

Custom Theme (/wp-content/themes/products/)
Page Templates for different functionalities
AJAX Handlers for dynamic interactions
Authentication & Authorization

Key Files:

functions.php - Core theme functions and hooks
page-*.php - Specialized page templates
/inc/ - Modular PHP includes

2. Database Layer (PostgreSQL)

Purpose: Primary data storage for products and metadata

Server: 162.55.174.116:5432

Key Tables:

product - Main product catalog (60K+ records)
core_category - Hierarchical category structure
product_crawl_history - Historical price/availability tracking
product_screenshot - Screenshot metadata
category_product_crawl - Scraping queue management

Features:

ACID compliance for data integrity
Advanced indexing for performance
Full support for JSON/JSONB data types
Efficient time-series data handling

3. Search Layer (Elasticsearch)

Purpose: Fast full-text search with Lithuanian language support

Server: 91.99.113.45:9200

Index Configuration:

Lithuanian Analyzer with snowball stemming
Multi-field mapping (title, brand, description, SKU)
Fuzzy matching for typo tolerance
Aggregations for faceted filtering

Synchronization:

Three-phase sync process
Batch processing (500 products/batch)
State persistence for resumable operations
Real-time monitoring dashboard

4. Web Scraping Infrastructure

Purpose: Automated data collection from e-commerce sites

Server: 78.56.0.236:6800 (Scrapyd)

Components:

Scrapyd Daemon - Job scheduling and execution
Spider Collection - Python-based scrapers
Job Queue - Database-backed queue
Status Monitoring - Real-time job tracking

Process Flow:

Queue products for scraping
Schedule jobs with Scrapyd
Execute spiders on remote server
Collect and normalize data
Update database with new information

5. Automation Layer

Purpose: Scheduled tasks for maintenance and updates

Cron Jobs:

Elasticsearch Sync (every 5 minutes)
Crawler Management (every 15 minutes)
Screenshot Capture (daily)
Category Updates (daily)

State Management:

JSON-based state persistence
Failed job tracking
Progress monitoring
Error recovery mechanisms

Data Flow Architecture

Product Discovery Flow

External Site → Scrapyd Spider → PostgreSQL → Elasticsearch → Search UI

Steps:

Scraping: Spider extracts product data
Storage: Raw data stored in PostgreSQL
Indexing: Product indexed in Elasticsearch
Search: User queries via search interface
Results: Relevant products displayed

Price History Flow

Scheduled Scan → Compare Prices → Store History → Generate Charts

Steps:

Scheduled: Cron job triggers product rescan
Comparison: New price compared to historical data
Storage: Changes stored in product_crawl_history
Visualization: Chart.js renders price trends

Search Query Flow

User Query → WordPress Handler → Elasticsearch → Results Processing → Display

Steps:

Input: User enters search query
Processing: Query sanitized and enhanced
Execution: Elasticsearch performs search
Filtering: Apply category, price, brand filters
Rendering: Results formatted and displayed

Integration Points

PostgreSQL ↔ Elasticsearch

Purpose: Keep search index synchronized with database

Method: Automated synchronization script

Process:

Query products from PostgreSQL
Transform data for Elasticsearch
Batch index documents
Track sync status in database
Handle errors and retries

WordPress ↔ Scrapyd

Purpose: Manage and monitor scraping jobs

Method: HTTP API integration

Endpoints:

daemonstatus.json - Server health
listjobs.json - Job queue status
schedule.json - Schedule new jobs
cancel.json - Cancel running jobs

Frontend ↔ Backend

Purpose: Dynamic user interactions

Method: AJAX with WordPress admin-ajax.php

Endpoints:

get_scrapyd_stats - Dashboard statistics
search_products - Product search
get_price_history - Historical pricing
update_product - Product modifications

Security Architecture

Authentication & Authorization

WordPress Users - Built-in user management
Role-Based Access - Admin-only console access
Nonce Verification - AJAX request validation
Session Management - WordPress session handling

Data Security

SQL Injection Prevention - Parameterized queries
XSS Protection - Output escaping
Input Validation - Server-side validation
HTTPS - Encrypted data transmission

Network Security

Firewall Rules - Port restrictions
Private Networks - Database isolation
API Authentication - Service credentials
Rate Limiting - Abuse prevention

Scalability Considerations

Horizontal Scaling

Database:

PostgreSQL replication (master-slave)
Read replicas for reporting
Connection pooling (PgBouncer)

Elasticsearch:

Multi-node cluster
Shard distribution
Replica configuration

Scrapyd:

Multiple Scrapyd servers
Load distribution
Spider deployment automation

Vertical Scaling

Application Server:

Increased memory for PHP
More CPU cores for processing
SSD storage for faster I/O

Database Server:

Larger RAM for caching
NVMe storage for performance
CPU upgrade for query processing

Caching Strategies

Application Level:

WordPress object cache (Redis/Memcached)
Database query results
API responses

CDN Level:

Static assets (CSS, JS, images)
Cacheable HTML pages
Geographic distribution

Monitoring & Observability

Application Monitoring

Error Logs - PHP error logging
Debug Logs - WordPress debug mode
Access Logs - Web server logs

Database Monitoring

Query Performance - Slow query log
Connection Pool - Active connections
Storage Usage - Disk space monitoring

Search Monitoring

Index Health - Cluster status
Query Performance - Search latency
Sync Status - Index freshness

Scraping Monitoring

Job Success Rate - Completion percentage
Error Tracking - Failed jobs
Performance Metrics - Items per second

Deployment Architecture

Development Environment

Local WordPress installation
Local PostgreSQL database
Elasticsearch via Docker
Mock Scrapyd for testing

Staging Environment

Dedicated staging server
Database snapshot from production
Separate Elasticsearch index
Limited Scrapyd access

Production Environment

Load-balanced web servers
High-availability PostgreSQL
Multi-node Elasticsearch cluster
Distributed Scrapyd servers

Performance Characteristics

Response Times

Search Queries: < 100ms (p95)
Product Details: < 200ms (p95)
Dashboard Load: < 500ms (p95)
API Endpoints: < 150ms (p95)

Throughput

Search: 100+ queries/second
Indexing: 500 products/minute
Scraping: 1000+ products/hour
API: 50+ requests/second

Data Processing

Bulk Sync: 60K products in ~2 hours
Incremental: Real-time (< 5 min delay)
Screenshot: 100 products/hour
Price History: Full catalog daily

Next Steps

Technology Stack - Detailed technology overview
Installation - Set up your environment
Database System - PostgreSQL deep dive
Elasticsearch - Search implementation