Content Enrichment API
Overview
The Content Enrichment API analyzes URLs, text, or JSON payloads to extract topics, entities, sentiment, and other metadata. It uses a pipeline of enrichers -- including Google Natural Language, Diffbot, TextRazor, and built-in sentiment analysis -- to produce a comprehensive content profile used for audience targeting and content recommendations.
Content enrichment must be enabled on your account (
enrich_contentsetting). Contact your account representative to enable this feature.
API Reference
Enrich Content
POST /v2/content/enrich
Enriches a URL, text, or JSON payload with extracted topics and metadata.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | URL to fetch and enrich |
text | string | Plain text to enrich |
json | string | JSON blob to enrich (text is auto-extracted) |
entry_uid | string | Existing content entity UID to re-enrich |
entry_fields | string | Specific fields to use from the entry |
At least one of url, text, or json must be provided.
Response
{
"input": "example.com/article",
"topics": {
"Technology": 0.85,
"Artificial Intelligence": 0.72,
"Cloud Computing": 0.65
},
"inferred_topics": {
"Computer Science": 0.68,
"Software Engineering": 0.55
}
}| Field | Type | Description |
|---|---|---|
input | string | The original URL or text that was enriched |
topics | map | Topic labels mapped to relevance scores (0-1) |
inferred_topics | map | Derived topics with relevance scores |
Classify Content
POST /v2/content/classify
Returns detailed enrichment steps and results, showing how content was processed through the enrichment pipeline. Useful for debugging classification results.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | URL to classify |
Response
{
"content": {
"url": "https://example.com/article",
"description": "Article Title",
"sentiment": 0.8,
"tags": [...]
},
"steps": [
{
"source": "meta",
"status": "success",
"fields_extracted": ["title", "description", "author"]
},
{
"source": "google_nlp",
"status": "success",
"entities_found": 5
}
]
}Get Content Entity
GET /v2/content/entity
Retrieves a stored content entity by URL or hashed URL.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | URL of the content entity |
hashedurl | string | Pre-hashed URL identifier |
Align Content with Audiences
POST /v2/content/align
Aligns content topics with audience segments to find the best-matching audiences for a piece of content.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
method | string | Similarity method: jaccard, cosine, or embed (default: embed) |
config | string | Affinity config ID for taxonomy |
limit | int | Maximum results to return (default: 10) |
entry_uid | string | Entry to use for topic extraction |
Request Body
{
"topics": {
"Technology": 0.8,
"Cloud": 0.6
}
}Response
[
{
"segment_id": "seg_123",
"segment_name": "Tech Enthusiasts",
"segment_size": 5000,
"alignment": 0.87,
"segment_topics": {
"Cloud": 0.65,
"Technology": 0.72
}
}
]Enrichment Pipeline
Content passes through an ordered pipeline of enrichers. Each enricher extracts different types of information:
| Enricher | Source Key | What It Extracts |
|---|---|---|
| Meta | meta | HTML meta tags, Open Graph, title, description, images, author, published date |
| Diffbot | diffbot, diffbot_meta | Article extraction, content type classification, structured data |
| Google NLP | google_nlp | Named entities with salience scores (filtered by relevance > 0.05) |
| Google NLP Entity | google_nlp_entity | Typed entities: PERSON, LOCATION, ORGANIZATION, EVENT (requires knowledge graph metadata) |
| Google Categories | google_category | Content categories (IPTC taxonomy) |
| TextRazor | textrazor | Topics with relevance scores |
| Google Vision | google_vision | Image analysis: labels, safe search |
| Sentiment | sentiment | Sentiment score (-1.0 to 1.0) |
| LLM | llm | AI-generated topics via OpenAI or Vertex AI |
| Custom Topics | custom_topic | Matches against account-defined topic filters |
| Embeddings | embedding | Vector embeddings for semantic search |
Not all enrichers run on every piece of content. The pipeline selects enrichers based on your account's configured content sources and the type of input provided.
Text Processing
- Maximum text size: 4,000 characters for Google NLP analysis
- Text is intelligently extracted from HTML using the main content area
- Falls back to heading and paragraph tags if no main content is detected
- Language detection is automatic; enrichers that don't support the detected language are skipped
Entity Extraction
Google NLP entity extraction filters results by:
- Salience threshold: Entities must have salience > 0.05 (or > 0.01 with knowledge graph metadata)
- Entity types: PERSON, LOCATION, ORGANIZATION, EVENT (for the entity-specific enricher)
- Name constraints: Maximum 50 characters, maximum 3 spaces
- Knowledge graph: Entities with Wikipedia/Freebase metadata are prioritized
Configuration
Content enrichment behavior can be configured through account settings:
| Setting | Type | Description |
|---|---|---|
enrich_content | boolean | Enable/disable content enrichment |
content_crawl_delay | int | Delay between HTTP requests in milliseconds (minimum 1 second) |
content_respect_directives | string | How to handle robots.txt: always, robots_txt, or never |
content_domain_allowlist | string[] | Only enrich content from these domains |
content_domain_blocklist | string[] | Never enrich content from these domains |
content_path_allowlist | string[] | Only enrich URLs matching these paths |
content_path_blocklist | string[] | Exclude URLs matching these paths |
content_topic_allowlist | string[] | Only include these topics in results |
content_topic_blocklist | string[] | Exclude these topics from results |
content_max_topics | int | Maximum number of topics per document |
content_boosted_attributes | string[] | HTML fields to prioritize in text extraction |
Input Sources
The enrichment pipeline processes content from various input streams:
- Web URLs: Standard web page crawling with robots.txt respect
- Email content: From Mailchimp, Dotdigital, SendGrid, Campaign Monitor, or MessageCentral streams
- Contentful webhooks: CMS content with rich text support
- Spotify: Music track metadata from Spotify URIs
- Direct text/JSON: Via the API with
textorjsonparameters
URL Processing
- URLs are normalized (scheme, trailing slashes)
- Redirects and canonical URLs are tracked
- Duplicate content is detected via bloom filter (2M entries, 1% false positive rate)
- Crawl delays from robots.txt are respected (minimum 1 second)