Data Crossroads #68

Oct 27, 2024

Your curator is a creature of habit and routine. Part of his routine was curing this newsletter, of course. But when travel and change of location and favorite computer keyboard to write the newsletter on throw a spanner into said routine, much was disrupted for a few weeks. We should be back on usual schedule.

Anthropic introduces computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

From the Athropic blog:

Today, we’re announcing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. The upgraded Claude 3.5 Sonnet delivers across-the-board improvements over its predecessor, with particularly significant gains in coding—an area where it already led the field. Claude 3.5 Haiku matches the performance of Claude 3 Opus, our prior largest model, on many evaluations for the same cost and similar speed to the previous generation of Haiku.
We’re also introducing a groundbreaking new capability in public beta: computer use. Available today on the API, developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. At this stage, it is still experimental—at times cumbersome and error-prone. We're releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.
Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have already begun to explore these possibilities, carrying out tasks that require dozens, and sometimes even hundreds, of steps to complete. For example, Replit is using Claude 3.5 Sonnet's capabilities with computer use and UI navigation to develop a key feature that evaluates apps as they’re being built for their Replit Agent product.
[…]
With computer use, we're trying something fundamentally new. Instead of making specific tools to help Claude complete individual tasks, we're teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people. Developers can use this nascent capability to automate repetitive processes, build and test software, and conduct open-ended tasks like research.
To make these general skills possible, we've built an API that allows Claude to perceive and interact with computer interfaces. Developers can integrate this API to enable Claude to translate instructions (e.g., “use data from my computer and online to fill out this form”) into computer commands (e.g. check a spreadsheet; move the cursor to open a web browser; navigate to the relevant web pages; fill out a form with the data from those pages; and so on). On OSWorld, which evaluates AI models' ability to use computers like people do, Claude 3.5 Sonnet scored 14.9% in the screenshot-only category—notably better than the next-best AI system's score of 7.8%. When afforded more steps to complete the task, Claude scored 22.0%.
While we expect this capability to improve rapidly in the coming months, Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks. Because computer use may provide a new vector for more familiar threats such as spam, misinformation, or fraud, we're taking a proactive approach to promote its safe deployment. We've developed new classifiers that can identify when computer use is being used and whether harm is occurring. You can read more about the research process behind this new skill, along with further discussion of safety measures, in our post on developing computer use.

Big news. Anthropic continues to impress.

Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent

From Simon Willison’s blog:

The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.
I didn’t particularly feel like copying and pasting all of the numbers out one at a time, so I decided to try something different: could I record a screen capture while browsing around my Gmail account and then extract the numbers from that video using Google Gemini?
This turned out to work incredibly well.
AI Studio and QuickTime #
I recorded the video using QuickTime Player on my Mac: File -> New Screen Recording. I dragged a box around a portion of my screen containing my Gmail account, then clicked on each of the emails in turn, pausing for a couple of seconds on each one.
I uploaded the resulting file directly into Google’s AI Studio tool and prompted the following:
Turn this into a JSON array where each item has a yyyy-mm-dd date and a floating point dollar amount for that date
... and it worked. It spat out a JSON array like this:
[
  {
    "date": "2023-01-01",
    "amount": 2...
  },
  ...
]

Meta releases quantized Llama models with increased speed and a reduced memory footprint

From Meta’s AI blog:

Today, we’re releasing our first lightweight quantized Llama models that are small and performant enough to run on many popular mobile devices. At Meta, we’re uniquely positioned to provide quantized models because of access to compute resources, training data, full evaluations, and safety.
As our first quantized models in this Llama category, these instruction-tuned models apply the same quality and safety requirements as the original 1B and 3B models, while achieving 2-4x speedup. We also achieve an average reduction of 56% in model size and a 41% average reduction in memory usage compared to the original BF16 format.
We used two techniques for quantizing Llama 3.2 1B and 3B models: Quantization-Aware Training with LoRA adaptors, which prioritize accuracy, and SpinQuant, a state-of-the-art post-training quantization method that prioritizes portability.
Inferences using both quantization techniques are supported in the Llama Stack reference implementation via PyTorch’s ExecuTorch framework.
We built these quantized models in close collaboration with our industry-leading partners and are making them available on Qualcomm and MediaTek SoCs with Arm CPUs.

Small models can super useful. Welcome small models.

How we saved hundreds of engineering hours by writing tests with LLMs

John Wang writing for Assembled’s blog:

At Assembled, engineering velocity is our competitive edge. We pride ourselves on delivering new features at a fast pace. But how do we maintain quality without slowing down? The answer lies in robust testing. As Martin Fowler aptly puts it:
[Testing] can drastically reduce the number of bugs that get into production… But the biggest benefit isn't about merely avoiding production bugs, it's about the confidence that you get to make changes to the system.

Martin Fowler
Despite this, writing comprehensive tests is often overlooked due to time constraints or the complexity involved. Large Language Models (LLMs) have shifted this dynamic by making it significantly easier and faster to generate robust tests. Tasks that previously required hours can now be completed in just 5–10 minutes.
We've observed tangible benefits within our team:
An engineer who previously wrote few tests began consistently writing them after utilizing LLMs for test generation.
Another engineer, known for writing thorough tests, saved weeks of time by using LLMs to streamline the process.
Collectively, our engineers have saved hundreds of hours, reallocating that time to developing new features and refining existing ones.
In this blog post, we'll explore how we’ve used LLMs to enhance our testing practices.

May engineers I know consider writing good tests a boring activity compared to developing code. Good thing that LLMs can’t get bored.

Data Crossroads

Data Crossroads #68

Anthropic introduces computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent

AI Studio and QuickTime #

Meta releases quantized Llama models with increased speed and a reduced memory footprint

How we saved hundreds of engineering hours by writing tests with LLMs

For the extra curious

Discussion about this post