SpiderCloud

The SpiderCloud gem provides a lightweight Ruby interface to the Spider Cloud API for web scraping, crawling, screenshots, and link extraction.

Installation

Add this line to your application's Gemfile:

gem 'spidercloud'

Or install directly:

gem install spidercloud

Quick Start

require 'spider_cloud'

# Configure your API key
SpiderCloud.api_key 'your-api-key'

# Scrape a single page
response = SpiderCloud.scrape( 'https://example.com' )
puts response.result.content

# Crawl a website (limited to 5 pages)
response = SpiderCloud.crawl( 'https://example.com', limit: 5 )
response.result.each { | page | puts page.url }

# Take a screenshot
response = SpiderCloud.screenshot( 'https://example.com' )
response.result.save_to( 'screenshot.png' )

# Extract links
response = SpiderCloud.links( 'https://example.com', limit: 5 )
puts response.result.urls

Configuration

Set your API key globally:

SpiderCloud.api_key 'your-api-key'

Or pass it per-request:

request = SpiderCloud::ScrapeRequest.new( api_key: 'your-api-key' )
response = request.submit( 'https://example.com' )

Endpoints

SpiderCloud supports four main endpoints:

Scrape - Extract content from a single URL
Crawl - Crawl multiple pages from a starting URL
Screenshot - Capture screenshots of web pages
Links - Discover and extract links from a website

Using Options

Each endpoint accepts options that can be built using the options builder:

options = SpiderCloud::ScrapeOptions.build do
  return_format :markdown
  readability true
  stealth true
  wait_for do
    selector '#content'
  end
end

response = SpiderCloud.scrape( 'https://example.com', options )

Or pass options as a hash:

response = SpiderCloud.scrape( 'https://example.com', {
  return_format: :markdown,
  readability: true
} )

Response Handling

All endpoints return a Faraday response with an attached result object:

response = SpiderCloud.scrape( 'https://example.com' )

# Check if the HTTP request succeeded
response.success?       # => true/false

# Access the parsed result
response.result.success?  # => true/false
response.result.content   # => "# Page Title\n\nContent..."
response.result.url       # => "https://example.com"
response.result.status    # => 200

Error Handling

When a request fails, the result will be an ErrorResult:

response = SpiderCloud.scrape( 'https://example.com' )

unless response.result.success?
  puts response.result.error_type        # => :authentication_error
  puts response.result.error_description # => "The API key is invalid."
end

Content Formats

The return_format option controls the output format:

:markdown - Markdown format
:commonmark - CommonMark format
:raw - Raw HTML (default)
:text - Plain text
:html2text - HTML converted to text
:xml - XML format
:bytes - Raw bytes
:empty - No content (useful for links-only)

Proxy Support

Spider Cloud supports multiple proxy types:

options = SpiderCloud::ScrapeOptions.build do
  proxy :residential
  proxy_enabled true
  country_code 'US'
end

Proxy types: :residential, :mobile, :isp

Wait Conditions

Wait for specific conditions before extracting content:

options = SpiderCloud::ScrapeOptions.build do
  wait_for do
    # Wait for a CSS selector
    selector '#loaded'

    # Or wait for network idle
    idle_network do
      timeout do
        seconds 5
        nanoseconds 0
      end
    end

    # Or wait for a delay
    delay do
      timeout do
        seconds 2
        nanoseconds 0
      end
    end
  end
end

AI/LLM Integration

Configure GPT to process scraped content:

options = SpiderCloud::ScrapeOptions.build do
  gpt_config do
    prompt 'Summarize this page in 3 sentences'
    model 'gpt-4'
    max_tokens 500
  end
end

Browser Configuration

Control browser behavior:

options = SpiderCloud::ScrapeOptions.build do
  stealth true
  fingerprint true
  block_ads true
  block_analytics true
  viewport do
    width 1920
    height 1080
  end
  device :desktop  # :mobile, :tablet, :desktop
end

Automation Scripts

Execute actions before scraping:

options = SpiderCloud::ScrapeOptions.build do
  automation_scripts( {
    '/login' => [
      { 'Fill' => { 'selector' => '#email', 'value' => 'user@example.com' } },
      { 'Fill' => { 'selector' => '#password', 'value' => 'secret' } },
      { 'Click' => 'button[type=submit]' },
      { 'WaitForNavigation' => true }
    ]
  } )
end

License

The gem is available under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
lib		lib
readme		readme
references		references
test		test
.gitignore		.gitignore
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
spidercloud.gemspec		spidercloud.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpiderCloud

Installation

Quick Start

Configuration

Endpoints

Using Options

Response Handling

Error Handling

Content Formats

Proxy Support

Wait Conditions

AI/LLM Integration

Browser Configuration

Automation Scripts

License

About

Uh oh!

Releases

Packages

Languages

License

EndlessInternational/spidercloud

Folders and files

Latest commit

History

Repository files navigation

SpiderCloud

Installation

Quick Start

Configuration

Endpoints

Using Options

Response Handling

Error Handling

Content Formats

Proxy Support

Wait Conditions

AI/LLM Integration

Browser Configuration

Automation Scripts

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages