The SpiderCloud gem provides a lightweight Ruby interface to the Spider Cloud API for web scraping, crawling, screenshots, and link extraction.
Add this line to your application's Gemfile:
gem 'spidercloud'Or install directly:
gem install spidercloudrequire 'spider_cloud'
# Configure your API key
SpiderCloud.api_key 'your-api-key'
# Scrape a single page
response = SpiderCloud.scrape( 'https://example.com' )
puts response.result.content
# Crawl a website (limited to 5 pages)
response = SpiderCloud.crawl( 'https://example.com', limit: 5 )
response.result.each { | page | puts page.url }
# Take a screenshot
response = SpiderCloud.screenshot( 'https://example.com' )
response.result.save_to( 'screenshot.png' )
# Extract links
response = SpiderCloud.links( 'https://example.com', limit: 5 )
puts response.result.urlsSet your API key globally:
SpiderCloud.api_key 'your-api-key'Or pass it per-request:
request = SpiderCloud::ScrapeRequest.new( api_key: 'your-api-key' )
response = request.submit( 'https://example.com' )SpiderCloud supports four main endpoints:
- Scrape - Extract content from a single URL
- Crawl - Crawl multiple pages from a starting URL
- Screenshot - Capture screenshots of web pages
- Links - Discover and extract links from a website
Each endpoint accepts options that can be built using the options builder:
options = SpiderCloud::ScrapeOptions.build do
return_format :markdown
readability true
stealth true
wait_for do
selector '#content'
end
end
response = SpiderCloud.scrape( 'https://example.com', options )Or pass options as a hash:
response = SpiderCloud.scrape( 'https://example.com', {
return_format: :markdown,
readability: true
} )All endpoints return a Faraday response with an attached result object:
response = SpiderCloud.scrape( 'https://example.com' )
# Check if the HTTP request succeeded
response.success? # => true/false
# Access the parsed result
response.result.success? # => true/false
response.result.content # => "# Page Title\n\nContent..."
response.result.url # => "https://example.com"
response.result.status # => 200When a request fails, the result will be an ErrorResult:
response = SpiderCloud.scrape( 'https://example.com' )
unless response.result.success?
puts response.result.error_type # => :authentication_error
puts response.result.error_description # => "The API key is invalid."
endThe return_format option controls the output format:
:markdown- Markdown format:commonmark- CommonMark format:raw- Raw HTML (default):text- Plain text:html2text- HTML converted to text:xml- XML format:bytes- Raw bytes:empty- No content (useful for links-only)
Spider Cloud supports multiple proxy types:
options = SpiderCloud::ScrapeOptions.build do
proxy :residential
proxy_enabled true
country_code 'US'
endProxy types: :residential, :mobile, :isp
Wait for specific conditions before extracting content:
options = SpiderCloud::ScrapeOptions.build do
wait_for do
# Wait for a CSS selector
selector '#loaded'
# Or wait for network idle
idle_network do
timeout do
seconds 5
nanoseconds 0
end
end
# Or wait for a delay
delay do
timeout do
seconds 2
nanoseconds 0
end
end
end
endConfigure GPT to process scraped content:
options = SpiderCloud::ScrapeOptions.build do
gpt_config do
prompt 'Summarize this page in 3 sentences'
model 'gpt-4'
max_tokens 500
end
endControl browser behavior:
options = SpiderCloud::ScrapeOptions.build do
stealth true
fingerprint true
block_ads true
block_analytics true
viewport do
width 1920
height 1080
end
device :desktop # :mobile, :tablet, :desktop
endExecute actions before scraping:
options = SpiderCloud::ScrapeOptions.build do
automation_scripts( {
'/login' => [
{ 'Fill' => { 'selector' => '#email', 'value' => 'user@example.com' } },
{ 'Fill' => { 'selector' => '#password', 'value' => 'secret' } },
{ 'Click' => 'button[type=submit]' },
{ 'WaitForNavigation' => true }
]
} )
endThe gem is available under the MIT License. See LICENSE for details.