OpenAI's hexagonal purple logo against a gradient turquoise background symbolizes the intersection of artificial intelligence and software development. This minimalist design represents OpenAI's significant role in AI development tools and APIs that Cuttlesoft integrates into custom software solutions. The clean geometric pattern reflects the structured approach needed when implementing AI capabilities in enterprise applications, healthcare systems, and government software. As a technology-agnostic development company working with Python, React, and Ruby, Cuttlesoft closely follows OpenAI's developments to enhance our clients' applications with artificial intelligence and machine learning capabilities.

OpenAI's DevDay 2024 unveiled four major API updates to reshape development with GenAI. At the heart of the announcements were four major API updates, each opening new AI development and application frontiers.

Realtime API

The new Realtime API marks a significant leap forward in natural language processing, supporting seamless speech-to-speech conversations. Now, developers can build applications similar to ChatGPT's advanced voice mode. The API will allow for fast, speech-to-speech conversations, but it also brings audio input/output to the Chat Completions API, allowing developers to pass either text, audio, or even both as input.

Key features:

  • Seamless speech-to-speech conversations
  • Audio input/output for the Chat Completions
  • API Support for text, audio, or both as input

The new RealtimeAPI works by utilizing a persistent WebSocket. Before, you had first to transcribe audio with models like OpenAI's Whisper and then pass that to a model for inference. This approach had many drawbacks, like the loss of tonality, like emphasizing certain words or a speaker's accent -- but worst, it was terribly slow and required additional pipelining to build a proper, voice-enhanced end-to-end system.

Additionally, we're excited to see the Realtime API supports function calling, opening up a world of possibilities for creating more dynamic and responsive applications across various industries. This new API feature can potentially create more intuitive, efficient, and powerful business tools by going beyond simple voice interactions to invoking functions and changing application behavior.

How it works:

  • Utilizes a persistent WebSocket connection
  • Eliminates the need for separate transcription and inference steps
  • Preserves tonality, emphasis, and accent
  • Reduces latency and simplifies the development process

Initial pricing seems quite high (at approximately $0.06/minute of input, and $0.20/minute of output), although it's expected to decrease over time, as with many other pricing tables for OpenAI products.

Pricing (as of 2024-10-04):

  • Text: $5 per million input tokens, $20 per million output tokens
  • Audio: $100 per million input tokens, $200 per million output tokens

Despite the costs, early adopters can gain a competitive edge by integrating this technology into their products. If you're eager to get started, use the gpt-4o-realtime-preview model if you're building with WebSockets. If you're testing out the new audio capabilities in the Chat Completions API, use gpt-4o-audio-preview.

The OpenAI dev team also released a repository for developers to quickly set up a demonstration of the Realtime API: openai-realtime-console

Vision Fine-tuning

The ability to fine-tune a model with images empowers developers to create specialized visual AI models, unlocking innovative multi-modal applications across industries.

Best of all, you can continue using the same methods to train previous models that only supported fine-tuning with text. In the same JSONL files, images can be provided as HTTP or data URLs containing base64 encoded images. Just make sure that your images are 10 MB or less, in JPEG, PNG, or WEBP format and the RGB or RGBA image mode. Don't use images featuring People, Faces, Children, or CAPTCHAs -- as OpenAI's moderation will remove those from your training dataset.

Each example can include up to 10 images;

{
  "messages": [
    { "role": "system", "content": "You are an assistant that identifies common and uncommon garden weeds." },
    { "role": "user", "content": "Can you help me identify these weeds?" },
    { "role": "user", "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/garden_weed1.jpeg"
          }
        },
        {
          "type": "image_base64",
          "image_base64": {
            "data": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
          }
        }
      ]
    },
    { "role": "assistant", "content": "These are Dandelion (Taraxacum officinale), a common weed with bright yellow flowers." }
  ]
}

Key points:

  • Uses the same methods as text-based fine-tuning
  • Supports images provided as HTTP or data URLs with base64 encoding
  • Image requirements:
  • Excludes images featuring people, faces, children, or CAPTCHAs

Prompt Caching

Prompt caching was first introduced by Google and later added by Anthropic, so we're happy to see OpenAI introducing this, too. Prompt Caching addresses a common pain point in AI application development: the cost and latency associated with repetitive API calls. At a basic level, Prompt Caching involves storing and reusing the results of previous computations for similar or identical prompts. Instead of processing the same or similar prompts from scratch each time, the system can retrieve pre-computed results, saving time, money, and computational resources.

Caching is available on GPT-4o, GPT-4o mini, o1-preview and o1-mini, and any fine-tuned versions as well. Expect savings of 50% of the costs of regular inputs and outputs.

The best part? Caching is automatic on prompts with 1024 tokens or more! The API will cache the entire message (system, user, and assistant), images, tools, and structured outputs.

Key benefits:

  • Reduces processing time for repeated prompts
  • Lowers costs associated with API calls
  • Conserves computational resources

How it works:

  • Automatic caching for prompts with 1024 tokens or more
  • Caches entire messages (system, user, and assistant), images, tools, and structured outputs

You can monitor caching in the responses of the Chat Completion API under usage.prompt_tokens_details.cached_tokens.

Model Distillation

This new feature, called, Model Distillation, allows you to create more cost-effective models by taking the outputs of a powerful (and expensive) model like o1-preview, and fine-tuning a version of gpt-4o mini with it. The entire process can be done within the OpenAI platform, too, and gives developers even more options for controlling the balance of performance and cost with less-capable models in the OpenAI arsenal.

Key advantages:

  • Creates more affordable, specialized models
  • Balances performance and cost effectively
  • Integrates evaluation tools for quality assurance

Something we like most about this is the introduction of evals to the distillation process. Finally, OpenAI is giving us all the tooling necessary to train and fine-tune models and perform important quality checks on their outputs.

Looking Ahead

These updates represent significant advancements in OpenAI's development tooling. As the technologies mature and prices eventually decrease, we can expect to see a new wave of AI applications that are more responsive, efficient, and capable than ever before.

Related Posts

A minimalist illustration against a teal background shows a white keyhole rising from what appears to be a diagonal surface or slope. The clean, geometric design suggests unlocking solutions, breakthrough moments, or finding clarity after a long wait - fitting imagery for Django's 19-year journey to implement composite primary keys.
December 11, 2024 • Frank Valcarcel

Persistence Pays Off: Django Gets Composite Primary Keys

A 19-year-old Django ticket finally lands in version 5.2, bringing composite primary key support to one of web development’s most popular frameworks. This isn’t just a story about a database feature – it’s about the evolution of open source, the importance of fresh perspectives, and the power of patient engineering.

An image of a sitemap's code with the next.js logo superimposed over top to describe the contents of this post on advanced sitemaps with Next.js and the pages router.
August 30, 2024 • Frank Valcarcel

Advanced Sitemaps with Next.js

Learn how to optimize SEO for Next.js websites by implementing dynamic sitemaps and adding video extensions. This guide covers creating server-side sitemaps, splitting large sitemaps, and caching strategies.

Let's work together

Tell us about your project and how Cuttlesoft can help. Schedule a consultation with one of our experts today.

Contact Us