Skip to content

Automation & Cron Jobs

DealAI.lt relies on automated scheduled tasks (cron jobs) to keep data fresh, maintain search indexes, and manage scraping operations without manual intervention.

# Elasticsearch Auto-Sync (every 5 minutes)
*/5 * * * * /usr/bin/php /var/www/html/wp-content/themes/products/scripts/elasticsearch-auto-sync.php >> /var/log/dealai/elasticsearch-sync.log 2>&1
# Product Crawler Manager (every 15 minutes)
*/15 * * * * /usr/bin/php /var/www/html/wp-content/themes/products/scripts/product-crawler-manager.php >> /var/log/dealai/crawler-manager.log 2>&1
# Screenshot Manager (daily at 2 AM)
0 2 * * * /usr/bin/php /var/www/html/wp-content/themes/products/scripts/product-screenshot-manager.php >> /var/log/dealai/screenshot-manager.log 2>&1
# Category Update (daily at 3 AM)
0 3 * * * /usr/bin/php /var/www/html/wp-content/themes/products/scripts/cron/update-core-categories.php >> /var/log/dealai/category-update.log 2>&1
# Database Maintenance (weekly, Sunday 4 AM)
0 4 * * 0 /usr/bin/psql -h 162.55.174.116 -U dealai_user -d dealai_products -c "VACUUM ANALYZE;" >> /var/log/dealai/db-maintenance.log 2>&1
# Log Rotation (daily at 5 AM)
0 5 * * * /usr/sbin/logrotate /etc/logrotate.d/dealai >> /var/log/dealai/logrotate.log 2>&1

File: /scripts/elasticsearch-auto-sync.php
Frequency: Every 5 minutes
Purpose: Maintain search index freshness

Features:

  • Three-phase sync (Initialize → Bulk → Incremental)
  • State persistence for resumable operations
  • Batch processing (500 products/batch)
  • Failed batch tracking
  • Memory and time limit management

Execution Flow:

// Check if already running
if (is_sync_running()) {
exit("Sync already in progress\n");
}
// Load state
$state = load_sync_state();
// Execute appropriate phase
switch ($state['phase']) {
case 'initialize':
initialize_sync();
break;
case 'bulk':
bulk_sync_phase();
break;
case 'incremental':
incremental_sync();
break;
}

Monitoring:

Terminal window
# View real-time logs
tail -f /var/log/dealai/elasticsearch-sync.log
# Check sync state
cat /var/www/html/wp-content/themes/products/scripts/elasticsearch-sync-state.json

File: /scripts/product-crawler-manager.php
Frequency: Every 15 minutes
Purpose: Schedule product rescans

Process:

  1. Select 30 oldest products (not scanned in 24h)
  2. Add to crawl queue
  3. Schedule Scrapyd job
  4. Track job status
  5. Update database on completion

Implementation:

function main() {
echo "[" . date('Y-m-d H:i:s') . "] Starting product crawler manager\n";
// Get products needing rescan
$products = get_products_for_rescan(30);
if (empty($products)) {
echo "No products need rescanning\n";
return;
}
echo "Found " . count($products) . " products to rescan\n";
// Add to queue
$queue_id = add_products_to_queue($products);
// Schedule spider
$job_id = schedule_product_spider($queue_id, $products);
if ($job_id) {
echo "Scheduled job: $job_id\n";
} else {
echo "Failed to schedule spider\n";
}
}

File: /scripts/product-screenshot-manager.php
Frequency: Daily at 2 AM
Purpose: Capture product page screenshots

Features:

  • Automated screenshot capture
  • Image optimization
  • Thumbnail generation
  • Database storage

Implementation:

function capture_screenshots() {
$products = get_products_for_screenshots(100);
foreach ($products as $product) {
try {
$screenshot_path = capture_product_screenshot($product['product_url']);
$thumbnail_path = create_thumbnail($screenshot_path);
store_screenshot_metadata($product['id'], [
'screenshot_url' => $screenshot_path,
'thumbnail_url' => $thumbnail_path,
'captured_at' => date('Y-m-d H:i:s')
]);
echo "Captured screenshot for product {$product['id']}\n";
} catch (Exception $e) {
error_log("Screenshot failed for {$product['id']}: " . $e->getMessage());
}
}
}

File: /scripts/cron/update-core-categories.php
Frequency: Daily at 3 AM
Purpose: Synchronize category data

Tasks:

  • Update category hierarchy
  • Recalculate product counts
  • Refresh category paths
  • Remove empty categories

Prevent concurrent execution:

class CronLock {
private $lockFile;
private $lockHandle;
public function __construct($script_name) {
$this->lockFile = sys_get_temp_dir() . "/$script_name.lock";
}
public function acquire() {
$this->lockHandle = fopen($this->lockFile, 'w');
if (!flock($this->lockHandle, LOCK_EX | LOCK_NB)) {
return false;
}
return true;
}
public function release() {
if ($this->lockHandle) {
flock($this->lockHandle, LOCK_UN);
fclose($this->lockHandle);
unlink($this->lockFile);
}
}
}
// Usage
$lock = new CronLock('elasticsearch-sync');
if (!$lock->acquire()) {
exit("Script already running\n");
}
// Your code here
$lock->release();
function log_cron_message($message, $level = 'INFO') {
$timestamp = date('Y-m-d H:i:s');
$log_message = "[$timestamp] [$level] $message\n";
// Console output
echo $log_message;
// File logging
error_log($log_message, 3, '/var/log/dealai/cron.log');
}
function send_error_notification($script, $error) {
$subject = "Cron Job Error: $script";
$message = "Error occurred in $script at " . date('Y-m-d H:i:s') . "\n\n";
$message .= "Error: $error\n";
wp_mail('[email protected]', $subject, $message);
}
Terminal window
# List running cron jobs
ps aux | grep php | grep scripts
# Check last execution times
ls -lt /var/log/dealai/
# View recent logs
tail -100 /var/log/dealai/elasticsearch-sync.log
function check_cron_health() {
$checks = [
'elasticsearch_sync' => check_last_sync_time(),
'crawler_manager' => check_last_crawl_time(),
'screenshot_manager' => check_last_screenshot_time()
];
foreach ($checks as $job => $status) {
if (!$status['healthy']) {
send_alert("Cron job $job is unhealthy: " . $status['message']);
}
}
}
function check_last_sync_time() {
$state = load_sync_state();
$last_update = $state['last_updated'] ?? 0;
$elapsed = time() - $last_update;
return [
'healthy' => $elapsed < 600, // 10 minutes
'last_run' => $last_update,
'elapsed' => $elapsed,
'message' => $elapsed >= 600 ? "Sync hasn't run in $elapsed seconds" : "OK"
];
}
// Increase memory limit for large operations
ini_set('memory_limit', '512M');
// Increase execution time
set_time_limit(3600); // 1 hour
// Garbage collection
gc_enable();
// Adjust based on available resources
$batch_size = 500; // Products per batch
// For low-memory environments
if (get_available_memory() < 1024 * 1024 * 512) { // 512MB
$batch_size = 100;
}

File: /etc/logrotate.d/dealai

/var/log/dealai/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0640 www-data www-data
sharedscripts
postrotate
systemctl reload rsyslog > /dev/null 2>&1 || true
endscript
}
Terminal window
# Count errors
grep ERROR /var/log/dealai/elasticsearch-sync.log | wc -l
# Find recent failures
grep -A 5 "Failed" /var/log/dealai/*.log | tail -20
# Success rate
total=$(grep "Processing batch" /var/log/dealai/elasticsearch-sync.log | wc -l)
failed=$(grep "Batch failed" /var/log/dealai/elasticsearch-sync.log | wc -l)
success=$((total - failed))
echo "Success rate: $(($success * 100 / $total))%"
  1. Schedule intensive tasks during off-peak hours
  2. Limit concurrent operations
  3. Monitor resource usage
  4. Implement rate limiting
  1. Implement retry logic
  2. Log all errors with context
  3. Send notifications for critical failures
  4. Maintain resumable state
  1. Track execution times
  2. Monitor success/failure rates
  3. Set up alerts for failures
  4. Regular log review

Cron Not Running:

Terminal window
# Check cron service
sudo systemctl status cron
# Verify crontab
crontab -l
# Check logs
grep CRON /var/log/syslog

Script Fails:

Terminal window
# Test manually
php /var/www/html/wp-content/themes/products/scripts/elasticsearch-auto-sync.php
# Check permissions
ls -l /var/www/html/wp-content/themes/products/scripts/
# Verify PHP CLI
which php
php -v