Automation & Cron Jobs
Automation Overview
Section titled “Automation Overview”DealAI.lt relies on automated scheduled tasks (cron jobs) to keep data fresh, maintain search indexes, and manage scraping operations without manual intervention.
Cron Job Schedule
Section titled “Cron Job Schedule”Production Cron Configuration
Section titled “Production Cron Configuration”# Elasticsearch Auto-Sync (every 5 minutes)*/5 * * * * /usr/bin/php /var/www/html/wp-content/themes/products/scripts/elasticsearch-auto-sync.php >> /var/log/dealai/elasticsearch-sync.log 2>&1
# Product Crawler Manager (every 15 minutes)*/15 * * * * /usr/bin/php /var/www/html/wp-content/themes/products/scripts/product-crawler-manager.php >> /var/log/dealai/crawler-manager.log 2>&1
# Screenshot Manager (daily at 2 AM)0 2 * * * /usr/bin/php /var/www/html/wp-content/themes/products/scripts/product-screenshot-manager.php >> /var/log/dealai/screenshot-manager.log 2>&1
# Category Update (daily at 3 AM)0 3 * * * /usr/bin/php /var/www/html/wp-content/themes/products/scripts/cron/update-core-categories.php >> /var/log/dealai/category-update.log 2>&1
# Database Maintenance (weekly, Sunday 4 AM)0 4 * * 0 /usr/bin/psql -h 162.55.174.116 -U dealai_user -d dealai_products -c "VACUUM ANALYZE;" >> /var/log/dealai/db-maintenance.log 2>&1
# Log Rotation (daily at 5 AM)0 5 * * * /usr/sbin/logrotate /etc/logrotate.d/dealai >> /var/log/dealai/logrotate.log 2>&1Automation Scripts
Section titled “Automation Scripts”Elasticsearch Auto-Sync
Section titled “Elasticsearch Auto-Sync”File: /scripts/elasticsearch-auto-sync.php
Frequency: Every 5 minutes
Purpose: Maintain search index freshness
Features:
- Three-phase sync (Initialize → Bulk → Incremental)
- State persistence for resumable operations
- Batch processing (500 products/batch)
- Failed batch tracking
- Memory and time limit management
Execution Flow:
// Check if already runningif (is_sync_running()) { exit("Sync already in progress\n");}
// Load state$state = load_sync_state();
// Execute appropriate phaseswitch ($state['phase']) { case 'initialize': initialize_sync(); break;
case 'bulk': bulk_sync_phase(); break;
case 'incremental': incremental_sync(); break;}Monitoring:
# View real-time logstail -f /var/log/dealai/elasticsearch-sync.log
# Check sync statecat /var/www/html/wp-content/themes/products/scripts/elasticsearch-sync-state.jsonProduct Crawler Manager
Section titled “Product Crawler Manager”File: /scripts/product-crawler-manager.php
Frequency: Every 15 minutes
Purpose: Schedule product rescans
Process:
- Select 30 oldest products (not scanned in 24h)
- Add to crawl queue
- Schedule Scrapyd job
- Track job status
- Update database on completion
Implementation:
function main() { echo "[" . date('Y-m-d H:i:s') . "] Starting product crawler manager\n";
// Get products needing rescan $products = get_products_for_rescan(30);
if (empty($products)) { echo "No products need rescanning\n"; return; }
echo "Found " . count($products) . " products to rescan\n";
// Add to queue $queue_id = add_products_to_queue($products);
// Schedule spider $job_id = schedule_product_spider($queue_id, $products);
if ($job_id) { echo "Scheduled job: $job_id\n"; } else { echo "Failed to schedule spider\n"; }}Screenshot Manager
Section titled “Screenshot Manager”File: /scripts/product-screenshot-manager.php
Frequency: Daily at 2 AM
Purpose: Capture product page screenshots
Features:
- Automated screenshot capture
- Image optimization
- Thumbnail generation
- Database storage
Implementation:
function capture_screenshots() { $products = get_products_for_screenshots(100);
foreach ($products as $product) { try { $screenshot_path = capture_product_screenshot($product['product_url']); $thumbnail_path = create_thumbnail($screenshot_path);
store_screenshot_metadata($product['id'], [ 'screenshot_url' => $screenshot_path, 'thumbnail_url' => $thumbnail_path, 'captured_at' => date('Y-m-d H:i:s') ]);
echo "Captured screenshot for product {$product['id']}\n"; } catch (Exception $e) { error_log("Screenshot failed for {$product['id']}: " . $e->getMessage()); } }}Category Update
Section titled “Category Update”File: /scripts/cron/update-core-categories.php
Frequency: Daily at 3 AM
Purpose: Synchronize category data
Tasks:
- Update category hierarchy
- Recalculate product counts
- Refresh category paths
- Remove empty categories
Lock Files
Section titled “Lock Files”Prevent concurrent execution:
class CronLock { private $lockFile; private $lockHandle;
public function __construct($script_name) { $this->lockFile = sys_get_temp_dir() . "/$script_name.lock"; }
public function acquire() { $this->lockHandle = fopen($this->lockFile, 'w');
if (!flock($this->lockHandle, LOCK_EX | LOCK_NB)) { return false; }
return true; }
public function release() { if ($this->lockHandle) { flock($this->lockHandle, LOCK_UN); fclose($this->lockHandle); unlink($this->lockFile); } }}
// Usage$lock = new CronLock('elasticsearch-sync');
if (!$lock->acquire()) { exit("Script already running\n");}
// Your code here
$lock->release();Error Handling
Section titled “Error Handling”Logging
Section titled “Logging”function log_cron_message($message, $level = 'INFO') { $timestamp = date('Y-m-d H:i:s'); $log_message = "[$timestamp] [$level] $message\n";
// Console output echo $log_message;
// File logging error_log($log_message, 3, '/var/log/dealai/cron.log');}Notifications
Section titled “Notifications”function send_error_notification($script, $error) { $subject = "Cron Job Error: $script"; $message = "Error occurred in $script at " . date('Y-m-d H:i:s') . "\n\n"; $message .= "Error: $error\n";
}Monitoring
Section titled “Monitoring”Check Cron Status
Section titled “Check Cron Status”# List running cron jobsps aux | grep php | grep scripts
# Check last execution timesls -lt /var/log/dealai/
# View recent logstail -100 /var/log/dealai/elasticsearch-sync.logHealth Check Script
Section titled “Health Check Script”function check_cron_health() { $checks = [ 'elasticsearch_sync' => check_last_sync_time(), 'crawler_manager' => check_last_crawl_time(), 'screenshot_manager' => check_last_screenshot_time() ];
foreach ($checks as $job => $status) { if (!$status['healthy']) { send_alert("Cron job $job is unhealthy: " . $status['message']); } }}
function check_last_sync_time() { $state = load_sync_state(); $last_update = $state['last_updated'] ?? 0; $elapsed = time() - $last_update;
return [ 'healthy' => $elapsed < 600, // 10 minutes 'last_run' => $last_update, 'elapsed' => $elapsed, 'message' => $elapsed >= 600 ? "Sync hasn't run in $elapsed seconds" : "OK" ];}Performance Tuning
Section titled “Performance Tuning”Memory Management
Section titled “Memory Management”// Increase memory limit for large operationsini_set('memory_limit', '512M');
// Increase execution timeset_time_limit(3600); // 1 hour
// Garbage collectiongc_enable();Batch Size Optimization
Section titled “Batch Size Optimization”// Adjust based on available resources$batch_size = 500; // Products per batch
// For low-memory environmentsif (get_available_memory() < 1024 * 1024 * 512) { // 512MB $batch_size = 100;}Log Management
Section titled “Log Management”Log Rotation Configuration
Section titled “Log Rotation Configuration”File: /etc/logrotate.d/dealai
/var/log/dealai/*.log { daily missingok rotate 14 compress delaycompress notifempty create 0640 www-data www-data sharedscripts postrotate systemctl reload rsyslog > /dev/null 2>&1 || true endscript}Log Analysis
Section titled “Log Analysis”# Count errorsgrep ERROR /var/log/dealai/elasticsearch-sync.log | wc -l
# Find recent failuresgrep -A 5 "Failed" /var/log/dealai/*.log | tail -20
# Success ratetotal=$(grep "Processing batch" /var/log/dealai/elasticsearch-sync.log | wc -l)failed=$(grep "Batch failed" /var/log/dealai/elasticsearch-sync.log | wc -l)success=$((total - failed))echo "Success rate: $(($success * 100 / $total))%"Best Practices
Section titled “Best Practices”Resource Management
Section titled “Resource Management”- Schedule intensive tasks during off-peak hours
- Limit concurrent operations
- Monitor resource usage
- Implement rate limiting
Error Recovery
Section titled “Error Recovery”- Implement retry logic
- Log all errors with context
- Send notifications for critical failures
- Maintain resumable state
Monitoring
Section titled “Monitoring”- Track execution times
- Monitor success/failure rates
- Set up alerts for failures
- Regular log review
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Cron Not Running:
# Check cron servicesudo systemctl status cron
# Verify crontabcrontab -l
# Check logsgrep CRON /var/log/syslogScript Fails:
# Test manuallyphp /var/www/html/wp-content/themes/products/scripts/elasticsearch-auto-sync.php
# Check permissionsls -l /var/www/html/wp-content/themes/products/scripts/
# Verify PHP CLIwhich phpphp -vNext Steps
Section titled “Next Steps”- Elasticsearch Console - Monitor sync progress
- Job Monitoring - Track scraping jobs
- Troubleshooting - Debug issues