Content Enrichment API

Enrich content with topics, entities, sentiment, and classifications using multiple analysis engines.

Overview

The Content Enrichment API analyzes URLs, text, or JSON payloads to extract topics, entities, sentiment, and other metadata. It uses a pipeline of enrichers -- including Google Natural Language, Diffbot, TextRazor, and built-in sentiment analysis -- to produce a comprehensive content profile used for audience targeting and content recommendations.

📘

Content enrichment must be enabled on your account (enrich_content setting). Contact your account representative to enable this feature.

API Reference

Enrich Content

POST /v2/content/enrich

Enriches a URL, text, or JSON payload with extracted topics and metadata.

Query Parameters

ParameterTypeDescription
urlstringURL to fetch and enrich
textstringPlain text to enrich
jsonstringJSON blob to enrich (text is auto-extracted)
entry_uidstringExisting content entity UID to re-enrich
entry_fieldsstringSpecific fields to use from the entry

At least one of url, text, or json must be provided.

Response

{
  "input": "example.com/article",
  "topics": {
    "Technology": 0.85,
    "Artificial Intelligence": 0.72,
    "Cloud Computing": 0.65
  },
  "inferred_topics": {
    "Computer Science": 0.68,
    "Software Engineering": 0.55
  }
}
FieldTypeDescription
inputstringThe original URL or text that was enriched
topicsmapTopic labels mapped to relevance scores (0-1)
inferred_topicsmapDerived topics with relevance scores

Classify Content

POST /v2/content/classify

Returns detailed enrichment steps and results, showing how content was processed through the enrichment pipeline. Useful for debugging classification results.

Query Parameters

ParameterTypeDescription
urlstringURL to classify

Response

{
  "content": {
    "url": "https://example.com/article",
    "description": "Article Title",
    "sentiment": 0.8,
    "tags": [...]
  },
  "steps": [
    {
      "source": "meta",
      "status": "success",
      "fields_extracted": ["title", "description", "author"]
    },
    {
      "source": "google_nlp",
      "status": "success",
      "entities_found": 5
    }
  ]
}

Get Content Entity

GET /v2/content/entity

Retrieves a stored content entity by URL or hashed URL.

Query Parameters

ParameterTypeDescription
urlstringURL of the content entity
hashedurlstringPre-hashed URL identifier

Align Content with Audiences

POST /v2/content/align

Aligns content topics with audience segments to find the best-matching audiences for a piece of content.

Query Parameters

ParameterTypeDescription
methodstringSimilarity method: jaccard, cosine, or embed (default: embed)
configstringAffinity config ID for taxonomy
limitintMaximum results to return (default: 10)
entry_uidstringEntry to use for topic extraction

Request Body

{
  "topics": {
    "Technology": 0.8,
    "Cloud": 0.6
  }
}

Response

[
  {
    "segment_id": "seg_123",
    "segment_name": "Tech Enthusiasts",
    "segment_size": 5000,
    "alignment": 0.87,
    "segment_topics": {
      "Cloud": 0.65,
      "Technology": 0.72
    }
  }
]

Enrichment Pipeline

Content passes through an ordered pipeline of enrichers. Each enricher extracts different types of information:

EnricherSource KeyWhat It Extracts
MetametaHTML meta tags, Open Graph, title, description, images, author, published date
Diffbotdiffbot, diffbot_metaArticle extraction, content type classification, structured data
Google NLPgoogle_nlpNamed entities with salience scores (filtered by relevance > 0.05)
Google NLP Entitygoogle_nlp_entityTyped entities: PERSON, LOCATION, ORGANIZATION, EVENT (requires knowledge graph metadata)
Google Categoriesgoogle_categoryContent categories (IPTC taxonomy)
TextRazortextrazorTopics with relevance scores
Google Visiongoogle_visionImage analysis: labels, safe search
SentimentsentimentSentiment score (-1.0 to 1.0)
LLMllmAI-generated topics via OpenAI or Vertex AI
Custom Topicscustom_topicMatches against account-defined topic filters
EmbeddingsembeddingVector embeddings for semantic search
📘

Not all enrichers run on every piece of content. The pipeline selects enrichers based on your account's configured content sources and the type of input provided.

Text Processing

  • Maximum text size: 4,000 characters for Google NLP analysis
  • Text is intelligently extracted from HTML using the main content area
  • Falls back to heading and paragraph tags if no main content is detected
  • Language detection is automatic; enrichers that don't support the detected language are skipped

Entity Extraction

Google NLP entity extraction filters results by:

  • Salience threshold: Entities must have salience > 0.05 (or > 0.01 with knowledge graph metadata)
  • Entity types: PERSON, LOCATION, ORGANIZATION, EVENT (for the entity-specific enricher)
  • Name constraints: Maximum 50 characters, maximum 3 spaces
  • Knowledge graph: Entities with Wikipedia/Freebase metadata are prioritized

Configuration

Content enrichment behavior can be configured through account settings:

SettingTypeDescription
enrich_contentbooleanEnable/disable content enrichment
content_crawl_delayintDelay between HTTP requests in milliseconds (minimum 1 second)
content_respect_directivesstringHow to handle robots.txt: always, robots_txt, or never
content_domain_allowliststring[]Only enrich content from these domains
content_domain_blockliststring[]Never enrich content from these domains
content_path_allowliststring[]Only enrich URLs matching these paths
content_path_blockliststring[]Exclude URLs matching these paths
content_topic_allowliststring[]Only include these topics in results
content_topic_blockliststring[]Exclude these topics from results
content_max_topicsintMaximum number of topics per document
content_boosted_attributesstring[]HTML fields to prioritize in text extraction

Input Sources

The enrichment pipeline processes content from various input streams:

  • Web URLs: Standard web page crawling with robots.txt respect
  • Email content: From Mailchimp, Dotdigital, SendGrid, Campaign Monitor, or MessageCentral streams
  • Contentful webhooks: CMS content with rich text support
  • Spotify: Music track metadata from Spotify URIs
  • Direct text/JSON: Via the API with text or json parameters

URL Processing

  • URLs are normalized (scheme, trailing slashes)
  • Redirects and canonical URLs are tracked
  • Duplicate content is detected via bloom filter (2M entries, 1% false positive rate)
  • Crawl delays from robots.txt are respected (minimum 1 second)