Rotem Israeli - Research Engineer

Hugging Face Hugging Face GitHub GitHub LinkedIn LinkedIn
ControlNet for Diffusion Transformers šŸŽØ

ā€¢ Built a ControlNet-like module for fine-grained control over text-to-image diffusion models, extending the ControlNet-XS feedback system.

ā€¢ Evaluated against Sana's ControlNet architecture, achieving better performance across all metrics.

ā€¢ Used zero convolution layers to inject conditioning without disrupting pretrained features.

ā€¢ Implemented the architecture with efficient training, lazy data loading, and reduced memory overhead.

ControlNet Architecture
ControlNet Evaluation
Visual Question Answering šŸ”

ā€¢ Developed a Visual Question Answering (VQA) system by combining vision models, a connector for visual-text alignment, and a language model, inspired by LLaVA.

ā€¢ First trained the connector and then fine-tuned the language model using LoRA.

ā€¢ Optimized feature extraction by experimenting with and combining multiple vision models, including SigLIP, MobileCLIP, DINOv2, and EfficientSAM.

ā€¢ Enhanced visual representations through dynamic high-resolution processing with LLaVA-NeXT and the sĀ² wrapper.

ā€¢ Evaluated multiple language models (Gemma, Qwen, SmolLM, OpenELM) to improve response accuracy and system performance.

LLaVA Next
World Model Inspired by Google's Genie šŸ§ž

ā€¢ Built an efficient world model with three components: Frame Tokenizer for visual feature extraction, Latent Action Model for inferring actions, and Dynamics Model for predicting future frames.

ā€¢ Used EfficientVit for tokenizing images into discrete latents, then decoded them into continuous features with MobileStyleGAN.

ā€¢ Replaced Genie's ST-Transformer with a lightweight MLP to infer actions between frame pairs and applied quantization to latent frames.

ā€¢ Experimented with and replaced various components to enable real-time simulation, finding that a lightweight MLP performs similarly to large transformers, and working on the image level with EfficientVit and MobileStyleGAN exponentially increased speed.

Mobile Face Transformation and Manipulation App šŸ“±

ā€¢ Developed a real-time face transformation app using MobileStyleGAN, EfficientFormer, CLIP, and StyleGAN2.

ā€¢ Trained an encoder to inject facial features at various stages of the StyleGAN decoder, creating a detailed transformation pipeline optimized for CoreML, achieving 30fps on mobile devices.

ā€¢ Contributed to the app's success at the MobileXGenAI Hackathon hosted by Samsung Next.

ā€¢ Combined multiple losses from foundation models and facial feature extractors to ensure high-quality transformations.

I implemented the training pipeline described in arxiv.org/abs/2406.10601, which allows for high-quality reconstruction of fine image details while preserving editability by utilizing both w-latents and F-latents, ensuring effective manipulation of real image attributes even in challenging cases.
StyleGAN Inversion
Research Engineer at nlpearl.ai

ā€¢ Developed real-time systems to detect conversational pauses and suggest optimal starter sentences for AI agents using fine-tuned LLMs with specialized prediction heads.

ā€¢ Experimented with various architectures, including encoder-based and decoder-pretrained models, applying LoRA and multi-stage training to enhance prediction accuracy.

ā€¢ Designed a small language model (SLM) to generate task-specific tokens, enabling multi-task outputs from a single fine-tuned model for efficient real-time inference.

Research Engineer at Israeli Navy

ā€¢ Led long-term research initiatives focused on adapting foundation models, such as EnCodec and WavTokenizer, to sonar and audio data, employing multi-stage training, freezing layers, and fine-tuning with LoRA for task-specific optimizations.

ā€¢ Prioritized large-scale research and development efforts while collaborating on additional projects across the department.

ā€¢ Trained self-supervised models, including masked autoencoders, on large amounts of unlabeled audio data and spectrograms, with a focus on scaling solutions for real-world sonar applications.

ā€¢ Applied semi-supervised learning, pseudo-labeling, and mixup techniques to improve model generalization, especially with limited labeled data.

ā€¢ Developed expert ensembles and distilled them into student models, significantly improving robustness and inference efficiency in production environments.

ā€¢ Spearheaded extensive data cleaning and preprocessing workflows to address noise and inconsistencies, ensuring high data quality for critical sonar operations.

ā€¢ Utilized neural architecture search to optimize models for specific sonar and audio tasks, with a focus on performance improvements through RBF-KAN for final layers and linear layers elsewhere.

ā€¢ Integrated state-of-the-art techniques from leading research papers and Kaggle competition winners to tackle complex sonar challenges, contributing to strategic advancements in military research.