I am currently serving as a lead research scientist at NAVER AI Lab, where my focus lies in the domains of machine learning, multi-modal learning (e.g., vision-language, language-audio, and audio-visual), and computer vision. At NAVER, my primary research goal aims to the development of generalizable machine learning models to challenging yet practical scenarios. Prior to joining NAVER, I held a position as a research engineer at KAKAO Corp from 2016 to 2018, where my work focused on recommendation systems and machine learning applications.
Ensuring the real-world applicability of machine learning (ML) models poses a primary challenge, namely, the ability to generalize effectively to unseen scenarios encountered beyond the training phase. There are three prominent scenarios frequently encountered in practical applications: (1) when input data significantly differs from the training data; (2) when the model faces the target behavior beyond the scope of training targets, such as unexplored labels; and (3) when the application needs human opinions or subjective value judgments. Addressing all three scenarios relies on more than just massive large-scale datasets; it demands the inclusion of human knowledge that extends beyond web-crawled content. Yet, the question remains: How can we effectively integrate large-scale training and human knowledge guidance? To answer the question, my research aims to develop large-scale ML models exhibiting greater controllability and interpretability, thereby enabling human intervention to guide model behavior, even beyond the training phase. My work revolves around three main research themes towards this goal: Language-combined Representation Learning, Machine learning reliability and Optimization techniques for large-scale ML.
More detailed statement can be found in my research statement.
Language-combined Representation Learning. Language serves as the most natural method for encoding human knowledge. If our ML model can comprehend human language alongside the target modality, we can understand the model better by interventing the space with human language. However, as language descriptions are the product of conscious choices of the key relevant concepts to report from input data, language-combined representation learning methods often suffer from the multiplicity (or many-to-many problem) between modalities. My recent works address this problem through understanding and addressing the multiplicity problem by probabilistic representation learning. In this paradigm, an input is mapped to a probabilistic distribution, rather than a deterministic vector. This approach enhances the interpretability of datasets and user controllability. Furthermore, I am keen on establishing a robust evaluation framework for vision-language models in terms of their multiplicity and robustness.
How can we make a model comprehend human language alongside the target modality? To answer the question, I recently have worked on text-conditioned diffusion models. Especially, I am interested in utilizing the power of recent diffusion models for text-conditioned feature transforms or data augmentation. However, we need more versatility and controllability to adopt diffusion models to the desired tasks, e.g., localized conditions via providing region masks. My recent works have focused on the versatility and controllability of diffusion models, and applying diffusion models to non-generative downstream tasks, such as composed image retrieval (CIR) tasks.
Machine learning reliability. Existing machine learning models cannot understand the problem itself [Shortcut learning tutorial]. This causes many realistic problems, such as discrimination by machines, poor generalizability to unseen (or minor) corruptions / environments / groups. Current state-of-the-art machines only do "predict", rather than "logical thinking based on logical reasoning". As models prefer to learn by shortcuts [WCST-ML], just training models as usual will lead to biased models. One of my research interest is to investigate these phenomena with various tools.
If it is difficult to make machines understand the problem itself, what can we do? Our model should not learn undesirable shortcut features [ReBias] [StyleAugment], or should be robust to unseen corruptions [CutMix] [RegEval] [ReLabel] [PiT] or significant distribution shifts [SWAD] [MIRO]. Also we need to make a machine not discriminative to certain demographic groups [CGL] [FairDRO]. We expect a model says "I don't know" when they get unexpected inputs [PCME] [PCME++]. At least, we expect a model can explain why it makes a such decision [MTSA] [MTSA WS] [WSOL eval] [WSOL Eval journal], how different model design choices will change model decisions [NetSim] and how it can be fixed (e.g., More data collection? More annotations? Filtering?). My research focuses on expanding machine knowledge from "just prediction" to "logical reasoning". Especially, my recent researches have contentrated to tackle various generalization down stream tasks, such as de-biasing, domain generalization, algorithmic fairness and adversarial robustness.
Correct and fair evaluation is crucial for research development. However, existing evaluation protocols and metrics often lack reliability in measuring how machines learn proper knowledge. I also have actively engaged in addressing this issue by working with fair evaluation benchmarks and metrics.
Optimization techniques for large-scale ML. Last but not least, I have actively worked on developing general optimization techniques for large-scale machine learning models, including data augmentation, optimizer, network architecture, objective function. My research emphasizes two key objectives: empirical impact and theoretical soundness. Especially, my aim is to develop easy-to-use techniques that seamlessly function as plug-and-play solutions.
Finally, I also have worked on domain specific optimization techniques by utilizing properties of the given data, e.g., compositionaly of Korean/Chinese letters, low- and high- frequency information for better audio understanding, or harmonic information for multi-source audio understanding.
(C: peer-reviewed conference, W: peer-reviewed workshop, A: arxiv preprint, O: others)
(❋authors contributed equally)
See also at my Google Scholar.
Topics: Reliable ML Vision-Language Modality-specific tasks Generative models Other topics
Distributed at 2019 Hangul's day (한글날), [Full font list]
Deployed in Jan. 2019
Feb. 2016 - Feb. 2018
Deployed in 2017
Deployed in 2017
Aug. 2015 - Dec. 2015
Jun. 2012 - Jan. 2013