Data Crossroads #33

Feb 04, 2024

Meta releases Code Llama 70B model

Today we’re releasing Code Llama 70B: a new, more performant version of our LLM for code generation — available under the same license as previous Code Llama models. Download the models https://bit.ly/3Oil6bQ
• CodeLlama-70B
• CodeLlama-70B-Python
• CodeLlama-70B-Instruct

Models that generate code are very interesting since they can help with the automation of complex workflows. As a first step, they should be able to read a human-maintained spec to create complex software systems of every kind. Eventually, they might even become self-modifying, evolving into better and better models without human intervention.

Lumiere: A Space-Time Diffusion Model for Video Generation

From the folks at Google Research and others:

We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

From logos to pamphlets to intro videos and eventually to movies, it’s all coming.

Taking AI beyond chat conversations: Google puts AI to work with Search ads

Shashi Thakur writing on the Google blog:

Generative AI can empower advertisers from streamlining campaign creation to increasing the effectiveness of ads as the consumer Search experience evolves. Last year, we introduced a new era of AI-powered ads along with a commitment to ensuring advertisers have the opportunity to reach potential customers along their search journeys. Today, we’re sharing an update on our progress.

Such deep integrations should become more and more common this year. Also read about more AI applications and real investments: Publicis to invest 300 mln euros in AI plan over next three years

DeepSeek AI releases DeepSeek Coder

From their website:

DeepSeek Coder comprises a series of code language models trained from scratch on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on repo-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, resulting in foundational models (DeepSeek-Coder-Base). We further fine-tune the base model with 2B tokens of instruction data to get instruction-tuned models, namedly DeepSeek-Coder-Instruct.
[…]
We evaluate DeepSeek Coder on various coding-related benchmarks. The result shows that DeepSeek-Coder-Base-33B significantly outperforms existing open-source code LLMs. Compared with CodeLLama-34B, it leads by 7.9%, 9.3%, 10.8% and 5.9% respectively on HumanEval Python, HumanEval Multilingual, MBPP and DS-1000. Surprisingly, our DeepSeek-Coder-Base-7B reaches the performance of CodeLlama-34B. And the DeepSeek-Coder-Instruct-33B model after instruction tuning outperforms GPT-3.5-turbo on HumanEval and achieves comparable result with GPT-3.5-turbo on MBPP.

One more model that’s good at code generation.

MobileDiffusion: Rapid text-to-image generation on-device

Yang Zhao and Tingbo Hou on the Google Research blog:

Text-to-image diffusion models have shown exceptional capabilities in generating high-quality images from text prompts. However, leading models feature billions of parameters and are consequently expensive to run, requiring powerful desktops or servers (e.g., Stable Diffusion, DALL·E, and Imagen). While recent advancements in inference solutions on Android via MediaPipe and iOS via Core ML have been made in the past year, rapid (sub-second) text-to-image generation on mobile devices has remained out of reach.

On-device models are a big win for privacy.

LLaVA-1.6: Improved reasoning, OCR, and world knowledge

From the LLaVA blog:

Today, we are thrilled to present LLaVA-1.6, with improved reasoning, OCR, and world knowledge. LLaVA-1.6 even exceeds Gemini Pro on several benchmarks.
Compared with LLaVA-1.5, LLaVA-1.6 has several improvements:
Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution.
Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.
Better visual conversation for more scenarios, covering different applications. Better world knowledge and logical reasoning.
Efficient deployment and inference with SGLang.

LLaVA is important in keeping multi-modal models open source.

Data Crossroads