Skip to content

The SpiderCloud gem implements a lightweight interface to the Spider Cloud API for web scraping, crawling, screenshots, and link extraction.

License

Notifications You must be signed in to change notification settings

EndlessInternational/spidercloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpiderCloud

The SpiderCloud gem provides a lightweight Ruby interface to the Spider Cloud API for web scraping, crawling, screenshots, and link extraction.

Installation

Add this line to your application's Gemfile:

gem 'spidercloud'

Or install directly:

gem install spidercloud

Quick Start

require 'spider_cloud'

# Configure your API key
SpiderCloud.api_key 'your-api-key'

# Scrape a single page
response = SpiderCloud.scrape( 'https://example.com' )
puts response.result.content

# Crawl a website (limited to 5 pages)
response = SpiderCloud.crawl( 'https://example.com', limit: 5 )
response.result.each { | page | puts page.url }

# Take a screenshot
response = SpiderCloud.screenshot( 'https://example.com' )
response.result.save_to( 'screenshot.png' )

# Extract links
response = SpiderCloud.links( 'https://example.com', limit: 5 )
puts response.result.urls

Configuration

Set your API key globally:

SpiderCloud.api_key 'your-api-key'

Or pass it per-request:

request = SpiderCloud::ScrapeRequest.new( api_key: 'your-api-key' )
response = request.submit( 'https://example.com' )

Endpoints

SpiderCloud supports four main endpoints:

  • Scrape - Extract content from a single URL
  • Crawl - Crawl multiple pages from a starting URL
  • Screenshot - Capture screenshots of web pages
  • Links - Discover and extract links from a website

Using Options

Each endpoint accepts options that can be built using the options builder:

options = SpiderCloud::ScrapeOptions.build do
  return_format :markdown
  readability true
  stealth true
  wait_for do
    selector '#content'
  end
end

response = SpiderCloud.scrape( 'https://example.com', options )

Or pass options as a hash:

response = SpiderCloud.scrape( 'https://example.com', {
  return_format: :markdown,
  readability: true
} )

Response Handling

All endpoints return a Faraday response with an attached result object:

response = SpiderCloud.scrape( 'https://example.com' )

# Check if the HTTP request succeeded
response.success?       # => true/false

# Access the parsed result
response.result.success?  # => true/false
response.result.content   # => "# Page Title\n\nContent..."
response.result.url       # => "https://example.com"
response.result.status    # => 200

Error Handling

When a request fails, the result will be an ErrorResult:

response = SpiderCloud.scrape( 'https://example.com' )

unless response.result.success?
  puts response.result.error_type        # => :authentication_error
  puts response.result.error_description # => "The API key is invalid."
end

Content Formats

The return_format option controls the output format:

  • :markdown - Markdown format
  • :commonmark - CommonMark format
  • :raw - Raw HTML (default)
  • :text - Plain text
  • :html2text - HTML converted to text
  • :xml - XML format
  • :bytes - Raw bytes
  • :empty - No content (useful for links-only)

Proxy Support

Spider Cloud supports multiple proxy types:

options = SpiderCloud::ScrapeOptions.build do
  proxy :residential
  proxy_enabled true
  country_code 'US'
end

Proxy types: :residential, :mobile, :isp

Wait Conditions

Wait for specific conditions before extracting content:

options = SpiderCloud::ScrapeOptions.build do
  wait_for do
    # Wait for a CSS selector
    selector '#loaded'

    # Or wait for network idle
    idle_network do
      timeout do
        seconds 5
        nanoseconds 0
      end
    end

    # Or wait for a delay
    delay do
      timeout do
        seconds 2
        nanoseconds 0
      end
    end
  end
end

AI/LLM Integration

Configure GPT to process scraped content:

options = SpiderCloud::ScrapeOptions.build do
  gpt_config do
    prompt 'Summarize this page in 3 sentences'
    model 'gpt-4'
    max_tokens 500
  end
end

Browser Configuration

Control browser behavior:

options = SpiderCloud::ScrapeOptions.build do
  stealth true
  fingerprint true
  block_ads true
  block_analytics true
  viewport do
    width 1920
    height 1080
  end
  device :desktop  # :mobile, :tablet, :desktop
end

Automation Scripts

Execute actions before scraping:

options = SpiderCloud::ScrapeOptions.build do
  automation_scripts( {
    '/login' => [
      { 'Fill' => { 'selector' => '#email', 'value' => 'user@example.com' } },
      { 'Fill' => { 'selector' => '#password', 'value' => 'secret' } },
      { 'Click' => 'button[type=submit]' },
      { 'WaitForNavigation' => true }
    ]
  } )
end

License

The gem is available under the MIT License. See LICENSE for details.

About

The SpiderCloud gem implements a lightweight interface to the Spider Cloud API for web scraping, crawling, screenshots, and link extraction.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages