A selection of the most important recent news, articles, and papers about AI.
General News, Articles, and Analyses
NVLM: Open Frontier-Class Multimodal LLMs – NVIDIA ADLR
https://research.nvidia.com/labs/adlr/NVLM-1/
Authors: Wenliang Dai; Nayeon Lee; Boxin Wang; Zhuolin Yang; Zihan Liu; Jon Barker; Tuomas Rintamaki; Mohammad Shoeybi; Bryan Catanzaro; and Wei Ping
(Tuesday, September 17, 2024) “We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, after multimodal training, NVLM 1.0 shows improved accuracy on text-only tasks over its LLM backbone. We are open-sourcing the model weights and training code in Megatron-Core for the community.”
Managing the Two Faces of Generative AI
https://cisr.mit.edu/publication/2024_0901_GenAI_VanderMeulenWixom
Authors: Nick van der Meulen and Barbara H. Wixom
(Thursday, September 19, 2024) “As generative AI (GenAI) becomes more prevalent, organizations are implementing it in two distinct ways: as broadly applicable tools to enhance individual productivity, and as tailored solutions to achieve strategic business objectives. Based on a series of three consecutive virtual roundtable discussions with data and technology executives on the MIT CISR Data Research Advisory Board, this briefing describes both approaches and highlights their unique challenges and management principles for success.”
Hacking Generative AI for Fun and Profit | WIRED
https://www.wired.com/story/sundai-club-generative-ai-hackathon-group/
Author: Will Knight
(Wednesday, October 2, 2024) “The Sundai Club meets once a month with a goal of pushing the limits of generative AI. Earlier this year, its members built me a handy tool for journalists.”
New algorithms open possibilities for training AI models on analog chips
https://research.ibm.com/blog/analog-in-memory-training-algorithms
(Thursday, October 3, 2024) “When analog chips are used for language models, their physical properties limit them to inference. But IBM Research scientists are working on several new algorithms that equip these energy efficient processors to train models.”
OpenAI has secured a $157 billion valuation. Now comes the hard part.
(Thursday, October 3, 2024) “OpenAI just closed the most lucrative funding round in Silicon Valley history. Now comes the hard part: emerging victorious in a fiercely competitive AI industry. Though Sam Altman’s company cemented its status as a frontrunner in the generative AI boom this week, securing a new $157 billion valuation after raising $6.6 billion of fresh capital from marquee investors, its leading position is hardly guaranteed.”
Technical Papers, Articles, and Preprints
Fast and robust analog in-memory deep neural network training | Nature Communications
Authors: Rasch, Malte J.; Carta, Fabio; Fagbohungbe, Omobayode; and Gokmen, Tayfun
(Tuesday, August 20, 2024) “Analog in-memory computing is a promising future technology for efficiently accelerating deep learning networks. While using in-memory computing to accelerate the inference phase has been studied extensively, accelerating the training phase has received less attention, despite its arguably much larger compute demand to accelerate. While some analog in-memory training algorithms have been suggested, they either invoke significant amount of auxiliary digital compute—accumulating the gradient in digital floating point precision, limiting the potential speed-up—or suffer from the need for near perfectly programming reference conductance values to establish an algorithmic zero point. Here, we propose two improved algorithms for in-memory training, that retain the same fast runtime complexity while resolving the requirement of a precise zero point. We further investigate the limits of the algorithms in terms of conductance noise, symmetry, retention, and endurance which narrow down possible device material choices adequate for fast and robust in-memory deep neural network training. Analog in-memory computing recent hardware implementations focused mainly on accelerating inference deployment. In this work, to improve the training process, the authors propose algorithms for supervised training of deep neural networks on analog in-memory AI accelerator hardware.”
[2409.11402] NVLM: Open Frontier-Class Multimodal LLMs
https://arxiv.org/abs/2409.11402
Authors: Dai, Wenliang; Lee, Nayeon; Wang, Boxin; Yang, Zhuoling; Liu, Zihan; Barker, Jon; Rintamaki, Tuomas; Shoeybi, Mohammad; Catanzaro, Bryan; and Ping, Wei
(Tuesday, September 17, 2024) “We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.”