Content Enrichment API
Enrich content with topics, entities, sentiment, and classifications using multiple analysis engines.
Overview
The Content Enrichment API analyzes URLs, text, or JSON payloads to extract topics, entities, sentiment, and other metadata. It uses a pipeline of enrichers -- including Google Natural Language, Diffbot, TextRazor, and built-in sentiment analysis -- to produce a comprehensive content profile used for audience targeting and content recommendations.
Content enrichment must be enabled on your account (
enrich_contentsetting). Contact your account representative to enable this feature.
API Reference
Enrich Content
POST /v2/content/enrich
Enriches a URL, text, or JSON payload with extracted topics and metadata.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | URL to fetch and enrich |
text | string | Plain text to enrich |
json | string | JSON blob to enrich (text is auto-extracted) |
entry_uid | string | Existing content entity UID to re-enrich |
entry_fields | string | Specific fields to use from the entry |
At least one of url, text, or json must be provided.
Response
{
"input": "example.com/article",
"topics": {
"Technology": 0.85,
"Artificial Intelligence": 0.72,
"Cloud Computing": 0.65
},
"inferred_topics": {
"Computer Science": 0.68,
"Software Engineering": 0.55
}
}| Field | Type | Description |
|---|---|---|
input | string | The original URL or text that was enriched |
topics | map | Topic labels mapped to relevance scores (0-1) |
inferred_topics | map | Derived topics with relevance scores |
Classify Content
POST /v2/content/classify
Returns detailed enrichment steps and results, showing how content was processed through the enrichment pipeline. Useful for debugging classification results.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | URL to classify |
Response
{
"content": {
"url": "https://example.com/article",
"description": "Article Title",
"sentiment": 0.8,
"tags": [...]
},
"steps": [
{
"source": "meta",
"status": "success",
"fields_extracted": ["title", "description", "author"]
},
{
"source": "google_nlp",
"status": "success",
"entities_found": 5
}
]
}Get Content Entity
GET /v2/content/entity
Retrieves a stored content entity by URL or hashed URL.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | URL of the content entity |
hashedurl | string | Pre-hashed URL identifier |
Align Content with Audiences
POST /v2/content/align
Aligns content topics with audience segments to find the best-matching audiences for a piece of content.
Query Parameters
| Parameter | Type | Description |
|---|---|---|
method | string | Similarity method: jaccard, cosine, or embed (default: embed) |
config | string | Affinity config ID for taxonomy |
limit | int | Maximum results to return (default: 10) |
entry_uid | string | Entry to use for topic extraction |
Request Body
{
"topics": {
"Technology": 0.8,
"Cloud": 0.6
}
}Response
[
{
"segment_id": "seg_123",
"segment_name": "Tech Enthusiasts",
"segment_size": 5000,
"alignment": 0.87,
"segment_topics": {
"Cloud": 0.65,
"Technology": 0.72
}
}
]Enrichment Pipeline
Content passes through an ordered pipeline of enrichers. Each enricher extracts different types of information:
| Enricher | Source Key | What It Extracts |
|---|---|---|
| Meta | meta | HTML meta tags, Open Graph, title, description, images, author, published date |
| Diffbot | diffbot, diffbot_meta | Article extraction, content type classification, structured data |
| Google NLP | google_nlp | Named entities with salience scores (filtered by relevance > 0.05) |
| Google NLP Entity | google_nlp_entity | Typed entities: PERSON, LOCATION, ORGANIZATION, EVENT (requires knowledge graph metadata) |
| Google Categories | google_category | Content categories (IPTC taxonomy) |
| TextRazor | textrazor | Topics with relevance scores |
| Google Vision | google_vision | Image analysis: labels, safe search |
| Sentiment | sentiment | Sentiment score (-1.0 to 1.0) |
| LLM | llm | AI-generated topics via OpenAI or Vertex AI |
| Custom Topics | custom_topic | Matches against account-defined topic filters |
| Embeddings | embedding | Vector embeddings for semantic search |
Not all enrichers run on every piece of content. The pipeline selects enrichers based on your account's configured content sources and the type of input provided.
Text Processing
- Maximum text size: 4,000 characters for Google NLP analysis
- Text is intelligently extracted from HTML using the main content area
- Falls back to heading and paragraph tags if no main content is detected
- Language detection is automatic; enrichers that don't support the detected language are skipped
Entity Extraction
Google NLP entity extraction filters results by:
- Salience threshold: Entities must have salience > 0.05 (or > 0.01 with knowledge graph metadata)
- Entity types: PERSON, LOCATION, ORGANIZATION, EVENT (for the entity-specific enricher)
- Name constraints: Maximum 50 characters, maximum 3 spaces
- Knowledge graph: Entities with Wikipedia/Freebase metadata are prioritized
Configuration
Content enrichment behavior can be configured through account settings:
| Setting | Type | Description |
|---|---|---|
enrich_content | boolean | Enable/disable content enrichment |
content_crawl_delay | int | Delay between HTTP requests in milliseconds (minimum 1 second) |
content_respect_directives | string | How to handle robots.txt: always, robots_txt, or never |
content_domain_allowlist | string[] | Only enrich content from these domains |
content_domain_blocklist | string[] | Never enrich content from these domains |
content_path_allowlist | string[] | Only enrich URLs matching these paths |
content_path_blocklist | string[] | Exclude URLs matching these paths |
content_topic_allowlist | string[] | Only include these topics in results |
content_topic_blocklist | string[] | Exclude these topics from results |
content_max_topics | int | Maximum number of topics per document |
content_boosted_attributes | string[] | HTML fields to prioritize in text extraction |
Input Sources
The enrichment pipeline processes content from various input streams:
- Web URLs: Standard web page crawling with robots.txt respect
- Email content: From Mailchimp, Dotdigital, SendGrid, Campaign Monitor, or MessageCentral streams
- Contentful webhooks: CMS content with rich text support
- Spotify: Music track metadata from Spotify URIs
- Direct text/JSON: Via the API with
textorjsonparameters
URL Processing
- URLs are normalized (scheme, trailing slashes)
- Redirects and canonical URLs are tracked
- Duplicate content is detected via bloom filter (2M entries, 1% false positive rate)
- Crawl delays from robots.txt are respected (minimum 1 second)
Updated 2 days ago
