Data Extraction
rtrvr.ai excels at extracting structured data from websites, whether it's from a single page, multiple pages in a sequence, or multiple tabs to Google Sheets. This page will guide you through how to use the 'Export', 'Explore', and 'Extract' features.
Core Concepts
Data extraction in rtrvr.ai is primarily handled through the 'Export', 'Explore', and 'Extract' functions within the Web Agent. Here's how they differ:
- →
Export: Performs actions and extracts data from the current active tab. The output will have multiple rows per tab, suitable for capturing all relevant information from a single page.
- →
Explore: Designed for navigating and extracting data from paginated listings (e.g., Amazon product search results). You have two options:
- →Sequential Extraction: Extract data from each page in the sequence, resulting in multiple rows per page.
- →Linked Page Extraction: Open each link from the listing as a new tab and extract data, producing one row per linked page (tab).
- →
Extract: Focuses on extracting data from multiple tabs that you select in the 'Across Tabs' panel. The Web Agent will perform actions on each selected tab and extract data, resulting in one row per tab.
Using the 'Export' Function
The 'Export' function is used to perform actions and extract data from the currently active tab. It's ideal when you need to capture all relevant information from a single webpage, potentially involving interactions like button clicks or form submissions.
Examples
- →
"Click the 'Add to Cart' button and then extract the product name, price, and quantity" Performs an action (clicking a button) and then extracts data from the modified page. The output would have multiple rows representing the extracted information.
- →
"Fill out the contact form with the provided details and extract the confirmation message" Fills a form and extracts the resulting message. The output would contain rows with the extracted message.
Using the 'Explore' Function
The 'Explore' function is your go-to for handling paginated listings, such as search results, product listings, or any website where content is spread across multiple pages.
Explore Modes
1. Sequential Extraction
In this mode, 'Explore' will navigate through each page of the listing sequentially and extract data. The output will contain multiple rows for each page, capturing all relevant information from each page in the sequence.
Examples
- →"Extract all product names and prices from each page of the search results" Navigates through each page of the search results and extracts product information. The output will contain multiple rows for each page.
2. Linked Page Extraction
In this mode, 'Explore' will identify links on each page of the listing (e.g., links to individual product pages), open each link as a new tab, and then extract data from that newly opened tab. The result is one row per linked page (tab).
Examples
- →
"For each product, extract the name, price, and description" Crawls through a product listing, opens each product link in a new tab, and extracts information from each product page. The output will have one row per product tab.
- →
"For every PDF file linked on this page, extract the paper title and authors" Opens each linked PDF in a new tab and extracts data. The output contains one row per PDF tab.
Using the 'Extract' Function
The 'Extract' function lets you work with multiple tabs simultaneously. You select the tabs you want to process in the 'Across Tabs' panel, and the Web Agent will perform actions and extract data from each of them. The result is one row per tab.
Examples
- →
"Extract the title and URL of each open tab" Extracts data from all tabs you've selected. The output has one row per tab with the title and URL.
- →
"For each tab, click the 'Download' button and extract the filename" Performs an action on each selected tab and extracts data. The output would have one row per tab with the extracted filename.
Guiding Extraction with Recordings
You can provide recordings to guide the AI Web Agent in performing specific actions before data extraction. This is particularly useful for complex interactions or when the agent needs to follow a specific sequence of steps.
How to Use Recordings
- →
Record Your Actions: Use the recording feature to capture the steps you want the agent to perform. This could involve clicking buttons, filling forms, navigating menus, etc.
- →
Supply Recording for Extraction: When using 'Export' or 'Extract', you can select recording under Advanced Options along with your extraction instructions. The agent will first execute the actions given in prompt guided by the recorded actions, and then proceed with the data extraction.
- →
Special Case: 'Explore' - Next Page Button: For the 'Explore' function, the recording is specifically used to guide the agent in finding the "Next Page" button or link. This ensures accurate navigation through paginated content.
Example: Using a Recording with 'Explore'
- →
Scenario: You want to extract product data from an e-commerce site with a uniquely designed pagination system.
- →
Recording: Create a recording that demonstrates how to find and click the "Next Page" button on the site.
- →
Explore Command: Use the 'Explore' function with the selected recording. For example:
"Explore (Sequential): Extract product name and price on main page"
along with the recording selection for the "Next Page" action. The agent will use this recording to navigate through the pages and extract the specified data.
Automatic Schema Detection
For both 'Export' and 'Extract', you can leave the prompt empty. In this case, rtrvr.ai will automatically determine the most relevant data to extract based on the structure of the web pages. This makes it even easier to quickly gather data without needing to specify precise extraction instructions.
Special Image Handling
When you extract image source URLs and use the column name 'image', rtrvr.ai will automatically wrap the URLs with the '=IMAGE()' function when exporting to Google Sheets. This will make the images render directly within the spreadsheet.
Tips for Effective Data Extraction
- →
Choose the Right Function: Use Export for single active tab, Explore for paginated listings (with sequential or linked page extraction), and Extract for processing multiple selected tabs.
- →
Be Specific (When Needed): If you need particular data, clearly state what you want to extract. For instance, instead of "Extract the info", say "Extract the product name, price, and description". You can also leave the prompt empty for automatic schema detection with 'Explore' and 'Extract'.
- →
Use Clear Labels: When specifying data elements, use labels that are easy for rtrvr.ai to understand (e.g., "product name," "author," "price"). Remember that using 'image' as a column name will enable special image rendering in Google Sheets.
- →
Test and Refine: Start with a small test set of data to confirm the extraction is working as you expect, and refine your command if needed.