{"kind":"Skill","metadata":{"namespace":"community","name":"apify-automation","version":"0.1.0"},"spec":{"description":"Automate web scraping and data extraction with Apify -- run Actors, manage datasets, create reusable tasks, and retrieve crawl results through the Composio Apify integration.","files":{"SKILL.md":"---\nname: Apify Automation\ndescription: \"Automate web scraping and data extraction with Apify -- run Actors, manage datasets, create reusable tasks, and retrieve crawl results through the Composio Apify integration.\"\nrequires:\n  mcp:\n    - rube\n---\n\n# Apify Automation\n\nRun **Apify** web scraping Actors and manage datasets directly from Claude Code. Execute crawlers synchronously or asynchronously, retrieve structured data, create reusable tasks, and inspect run logs without leaving your terminal.\n\n**Toolkit docs:** [composio.dev/toolkits/apify](https://composio.dev/toolkits/apify)\n\n---\n\n## Setup\n\n1. Add the Composio MCP server to your configuration:\n   ```\n   https://rube.app/mcp\n   ```\n2. Connect your Apify account when prompted. The agent will provide an authentication link.\n3. Browse available Actors at [apify.com/store](https://apify.com/store). Each Actor has its own unique input schema -- always check the Actor's documentation before running.\n\n---\n\n## Core Workflows\n\n### 1. Run an Actor Synchronously and Get Results\n\nExecute an Actor and immediately retrieve its dataset items in a single call. Best for quick scraping jobs.\n\n**Tool:** `APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS`\n\nKey parameters:\n- `actorId` (required) -- Actor ID in format `username/actor-name` (e.g., `compass/crawler-google-places`)\n- `input` -- JSON input object matching the Actor's schema. Each Actor has unique field names -- check [apify.com/store](https://apify.com/store) for the exact schema.\n- `limit` -- max items to return\n- `offset` -- skip items for pagination\n- `format` -- `json` (default), `csv`, `jsonl`, `html`, `xlsx`, `xml`\n- `timeout` -- run timeout in seconds\n- `waitForFinish` -- max wait time (0-300 seconds)\n- `fields` -- comma-separated list of fields to include\n- `omit` -- comma-separated list of fields to exclude\n\nExample prompt: *\"Run the Google Places scraper for 'restaurants in New York' and return the first 50 results\"*\n\n---\n\n### 2. Run an Actor Asynchronously\n\nTrigger an Actor run without waiting for completion. Use for long-running scraping jobs.\n\n**Tool:** `APIFY_RUN_ACTOR`\n\nKey parameters:\n- `actorId` (required) -- Actor slug or ID\n- `body` -- JSON input object for the Actor\n- `memory` -- memory limit in MB (must be power of 2, minimum 128)\n- `timeout` -- run timeout in seconds\n- `maxItems` -- cap on returned items\n- `build` -- specific build tag (e.g., `latest`, `beta`)\n\nFollow up with `APIFY_GET_DATASET_ITEMS` to retrieve results using the run's `datasetId`.\n\nExample prompt: *\"Start the web scraper Actor for example.com asynchronously with 1024MB memory\"*\n\n---\n\n### 3. Retrieve Dataset Items\n\nFetch data from a specific dataset with pagination, field selection, and filtering.\n\n**Tool:** `APIFY_GET_DATASET_ITEMS`\n\nKey parameters:\n- `datasetId` (required) -- dataset identifier\n- `limit` (default/max 1000) -- items per page\n- `offset` (default 0) -- pagination offset\n- `format` -- `json` (recommended), `csv`, `xlsx`\n- `fields` -- include only specific fields\n- `omit` -- exclude specific fields\n- `clean` -- remove Apify-specific metadata\n- `desc` -- reverse order (newest first)\n\nExample prompt: *\"Get the first 500 items from dataset myDatasetId in JSON format\"*\n\n---\n\n### 4. Inspect Actor Details\n\nView Actor metadata, input schema, and configuration before running it.\n\n**Tool:** `APIFY_GET_ACTOR`\n\nKey parameters:\n- `actorId` (required) -- Actor ID in format `username/actor-name` or hex ID\n\nExample prompt: *\"Show me the details and input schema for the apify/web-scraper Actor\"*\n\n---\n\n### 5. Create Reusable Tasks\n\nConfigure reusable Actor tasks with preset inputs for recurring scraping jobs.\n\n**Tool:** `APIFY_CREATE_TASK`\n\nConfigure a task once, then trigger it repeatedly with consistent input parameters. Useful for scheduled or recurring data collection workflows.\n\nExample prompt: *\"Create an Apify task for the Google Search scraper with default query 'AI startups' and US location\"*\n\n---\n\n### 6. Manage Runs and Datasets\n\nList Actor runs, browse datasets, and inspect run details for monitoring and debugging.\n\n**Tools:** `APIFY_GET_LIST_OF_RUNS`, `APIFY_DATASETS_GET`, `APIFY_DATASET_GET`, `APIFY_GET_LOG`\n\nFor listing runs:\n- Filter by Actor and optionally by status\n- Get `datasetId` from run details for data retrieval\n\nFor dataset management:\n- `APIFY_DATASETS_GET` -- list all your datasets with pagination\n- `APIFY_DATASET_GET` -- get metadata for a specific dataset\n\nFor debugging:\n- `APIFY_GET_LOG` -- retrieve execution logs for a run or build\n\nExample prompt: *\"List the last 10 runs for the web scraper Actor and show logs for the most recent one\"*\n\n---\n\n## Known Pitfalls\n\n- **Actor input schemas vary wildly:** Every Actor has its own unique input fields. Generic field names like `queries` or `search_terms` will be rejected. Always check the Actor's page on [apify.com/store](https://apify.com/store) for exact field names (e.g., `searchStringsArray` for Google Maps, `startUrls` for web scrapers).\n- **URL format requirements:** Always include the full protocol (`https://` or `http://`) in URLs. Many Actors require URLs as objects with a `url` property: `{\"startUrls\": [{\"url\": \"https://example.com\"}]}`.\n- **Dataset pagination cap:** `APIFY_GET_DATASET_ITEMS` has a max `limit` of 1000 per call. For large datasets, loop with `offset` to collect all items.\n- **Enum values are lowercase:** Most Actors expect lowercase enum values (e.g., `relevance` not `RELEVANCE`, `all` not `ALL`).\n- **Sync timeout at 5 minutes:** `APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS` has a maximum `waitForFinish` of 300 seconds. For longer runs, use `APIFY_RUN_ACTOR` (async) and poll with `APIFY_GET_DATASET_ITEMS`.\n- **Data volume costs:** Large datasets can be expensive to fetch. Prefer moderate limits and incremental processing to avoid timeouts or memory pressure.\n- **JSON format recommended:** While CSV/XLSX formats are available, JSON is the most reliable for automated processing. Avoid CSV/XLSX for downstream automation.\n\n---\n\n## Quick Reference\n\n| Tool Slug | Description |\n|---|---|\n| `APIFY_RUN_ACTOR_SYNC_GET_DATASET_ITEMS` | Run Actor synchronously and get results immediately |\n| `APIFY_RUN_ACTOR` | Run Actor asynchronously (trigger and return) |\n| `APIFY_RUN_ACTOR_SYNC` | Run Actor synchronously, return output record |\n| `APIFY_GET_ACTOR` | Get Actor metadata and input schema |\n| `APIFY_GET_DATASET_ITEMS` | Retrieve items from a dataset (paginated) |\n| `APIFY_DATASET_GET` | Get dataset metadata (item count, etc.) |\n| `APIFY_DATASETS_GET` | List all user datasets |\n| `APIFY_CREATE_TASK` | Create a reusable Actor task |\n| `APIFY_GET_TASK_INPUT` | Inspect a task's stored input |\n| `APIFY_GET_LIST_OF_RUNS` | List runs for an Actor |\n| `APIFY_GET_LOG` | Get execution logs for a run |\n\n---\n\n*Powered by [Composio](https://composio.dev)*\n"},"import":{"commit_sha":"f2b5e29bc315f04c8e09591ba275f4c4f7d4b8fe","imported_at":"2026-05-18T20:07:47Z","license_text":"","owner":"ComposioHQ","repo":"ComposioHQ/awesome-claude-skills","source_url":"https://github.com/ComposioHQ/awesome-claude-skills/tree/f2b5e29bc315f04c8e09591ba275f4c4f7d4b8fe/composio-skills/apify-automation"}},"content_hash":[188,67,123,184,88,158,241,232,223,14,252,214,57,244,37,117,111,7,128,92,235,168,173,233,176,189,83,140,194,169,246,78],"trust_level":"unsigned","yanked":false}
