Content Enrichment API

Overview

The Content Enrichment API analyzes URLs, text, or JSON payloads to extract topics, entities, sentiment, and other metadata. It uses a pipeline of enrichers -- including Google Natural Language, Diffbot, TextRazor, and built-in sentiment analysis -- to produce a comprehensive content profile used for audience targeting and content recommendations.

📘

Content enrichment must be enabled on your account (enrich_content setting). Contact your account representative to enable this feature.

API Reference

Enrich Content

POST /v2/content/enrich

Enriches a URL, text, or JSON payload with extracted topics and metadata.

Query Parameters

ParameterTypeDescription
urlstringURL to fetch and enrich
textstringPlain text to enrich
jsonstringJSON blob to enrich (text is auto-extracted)
entry_uidstringExisting content entity UID to re-enrich
entry_fieldsstringSpecific fields to use from the entry

At least one of url, text, or json must be provided.

Response

{
  "input": "example.com/article",
  "topics": {
    "Technology": 0.85,
    "Artificial Intelligence": 0.72,
    "Cloud Computing": 0.65
  },
  "inferred_topics": {
    "Computer Science": 0.68,
    "Software Engineering": 0.55
  }
}
FieldTypeDescription
inputstringThe original URL or text that was enriched
topicsmapTopic labels mapped to relevance scores (0-1)
inferred_topicsmapDerived topics with relevance scores

Classify Content

POST /v2/content/classify

Returns detailed enrichment steps and results, showing how content was processed through the enrichment pipeline. Useful for debugging classification results.

Query Parameters

ParameterTypeDescription
urlstringURL to classify

Response

{
  "content": {
    "url": "https://example.com/article",
    "description": "Article Title",
    "sentiment": 0.8,
    "tags": [...]
  },
  "steps": [
    {
      "source": "meta",
      "status": "success",
      "fields_extracted": ["title", "description", "author"]
    },
    {
      "source": "google_nlp",
      "status": "success",
      "entities_found": 5
    }
  ]
}

Get Content Entity

GET /v2/content/entity

Retrieves a stored content entity by URL or hashed URL.

Query Parameters

ParameterTypeDescription
urlstringURL of the content entity
hashedurlstringPre-hashed URL identifier

Align Content with Audiences

POST /v2/content/align

Aligns content topics with audience segments to find the best-matching audiences for a piece of content.

Query Parameters

ParameterTypeDescription
methodstringSimilarity method: jaccard, cosine, or embed (default: embed)
configstringAffinity config ID for taxonomy
limitintMaximum results to return (default: 10)
entry_uidstringEntry to use for topic extraction

Request Body

{
  "topics": {
    "Technology": 0.8,
    "Cloud": 0.6
  }
}

Response

[
  {
    "segment_id": "seg_123",
    "segment_name": "Tech Enthusiasts",
    "segment_size": 5000,
    "alignment": 0.87,
    "segment_topics": {
      "Cloud": 0.65,
      "Technology": 0.72
    }
  }
]

Enrichment Pipeline

Content passes through an ordered pipeline of enrichers. Each enricher extracts different types of information:

EnricherSource KeyWhat It Extracts
MetametaHTML meta tags, Open Graph, title, description, images, author, published date
Diffbotdiffbot, diffbot_metaArticle extraction, content type classification, structured data
Google NLPgoogle_nlpNamed entities with salience scores (filtered by relevance > 0.05)
Google NLP Entitygoogle_nlp_entityTyped entities: PERSON, LOCATION, ORGANIZATION, EVENT (requires knowledge graph metadata)
Google Categoriesgoogle_categoryContent categories (IPTC taxonomy)
TextRazortextrazorTopics with relevance scores
Google Visiongoogle_visionImage analysis: labels, safe search
SentimentsentimentSentiment score (-1.0 to 1.0)
LLMllmAI-generated topics via OpenAI or Vertex AI
Custom Topicscustom_topicMatches against account-defined topic filters
EmbeddingsembeddingVector embeddings for semantic search
📘

Not all enrichers run on every piece of content. The pipeline selects enrichers based on your account's configured content sources and the type of input provided.

Text Processing

  • Maximum text size: 4,000 characters for Google NLP analysis
  • Text is intelligently extracted from HTML using the main content area
  • Falls back to heading and paragraph tags if no main content is detected
  • Language detection is automatic; enrichers that don't support the detected language are skipped

Entity Extraction

Google NLP entity extraction filters results by:

  • Salience threshold: Entities must have salience > 0.05 (or > 0.01 with knowledge graph metadata)
  • Entity types: PERSON, LOCATION, ORGANIZATION, EVENT (for the entity-specific enricher)
  • Name constraints: Maximum 50 characters, maximum 3 spaces
  • Knowledge graph: Entities with Wikipedia/Freebase metadata are prioritized

Configuration

Content enrichment behavior can be configured through account settings:

SettingTypeDescription
enrich_contentbooleanEnable/disable content enrichment
content_crawl_delayintDelay between HTTP requests in milliseconds (minimum 1 second)
content_respect_directivesstringHow to handle robots.txt: always, robots_txt, or never
content_domain_allowliststring[]Only enrich content from these domains
content_domain_blockliststring[]Never enrich content from these domains
content_path_allowliststring[]Only enrich URLs matching these paths
content_path_blockliststring[]Exclude URLs matching these paths
content_topic_allowliststring[]Only include these topics in results
content_topic_blockliststring[]Exclude these topics from results
content_max_topicsintMaximum number of topics per document
content_boosted_attributesstring[]HTML fields to prioritize in text extraction

Input Sources

The enrichment pipeline processes content from various input streams:

  • Web URLs: Standard web page crawling with robots.txt respect
  • Email content: From Mailchimp, Dotdigital, SendGrid, Campaign Monitor, or MessageCentral streams
  • Contentful webhooks: CMS content with rich text support
  • Spotify: Music track metadata from Spotify URIs
  • Direct text/JSON: Via the API with text or json parameters

URL Processing

  • URLs are normalized (scheme, trailing slashes)
  • Redirects and canonical URLs are tracked
  • Duplicate content is detected via bloom filter (2M entries, 1% false positive rate)
  • Crawl delays from robots.txt are respected (minimum 1 second)