AI-assisted Forensics
Multimodal Data Aanalysis
SERENA
SERENA is an AI-driven forensic system designed to reconstruct digital traces from A2P messages. It combines large language models with prompt engineering to extract and organize legally meaningful information automatically. SERENA identifies and visualizes user behaviors with high accuracy, overcoming the limits of traditional forensic tools. Experiments with real-world data show 94–100% precision and recall, improving the efficiency and reliability of digital investigations.
Suspicious Conversation Detection
This study develops an AI-assisted forensic framework that detects illicit drug and cryptocurrency transactions hidden within messenger conversations. By combining NER, semantic retrieval (ChromaDB), and RAG-based LLM reasoning, the system interprets slang and contextual intent beyond simple keywords. The framework enhances the accuracy, transparency, and efficiency of digital investigations through context-aware evidence analysis.
Source Identification
Video
Video source identification using machine learning: A case study of 16 instant messaging applications
Due to the rise in cybercrimes involving illegal video sharing, identifying the source application of video files has become crucial. Traditional methods focused on device identification, but information is often lost when videos are shared via Instant Messaging Applications (IMAs) like Telegram, which re-encode the files.
This paper proposes a machine learning-based methodology to classify the source application by extracting features from the storage format and internal metadata of video files. Analyzing 16 widely used IMAs, the researchers achieved an identification accuracy of approximately 99.96% using the ExtraTrees model. The team also developed and released an open-source tool based on this method.
Audio
Audio source identification using machine learning: An empirical study of voice messages generated by 12 instant messaging applications
Smartphones and instant messaging applications (IMAs) have made voice messages a fast and intuitive alternative to text. In criminal cases, these messages can be key digital evidence. To be admissible in court, authenticity is typically established by identifying the originating device or software. Earlier studies on audio source identification focus on built‑in recorders, VoIP calls, or dedicated audio devices, and give little attention to voice messages created in real IMA environments. This research presents a new method to identify the source IMA of a voice message. The method builds one feature vector by combining statistical features from three layers: (1) the structure of the ISOBMFF container, (2) stream properties of the AAC codec, and (3) compressed‑data statistics. A voting ensemble classifier reached 98.6% accuracy.



Forensics on AI-related Systems
Deepfake Apps and Services
Deepfake technology, while advancing rapidly, has been increasingly exploited in crimes such as digital sex offenses, fraud, and political manipulation. Existing research has mainly focused on detecting fake content but struggles to prove who created it or how. To address these limitations, this study proposes an analysis methodology that traces user-specific artifacts generated during the operation of deepfake creation tools. The proposed framework automatically searches for newly emerging deepfake tools and enables systematic differential analysis under standardized criteria, despite the diverse manipulation methods and execution characteristics of each tool. And we conducted case studies on web-based and local deepfake creation tools in the Windows environment, demonstrating the framework’s capability for structured and reproducible analysis. As a result, the proposed analytical framework can strengthen legal accountability for deepfake misuse and provide practical support for digital forensic investigations.
Local LLM Applications
A wide variety of applications have been developed to simplify the use of Large Language Models (LLMs), raising the importance of systematically analyzing their forensic artifacts. This study proposes a structured framework for LLM application environments, categorizing applications into backend runtime, client interface, and integrated platform components. Through experimental analysis of representative applications, we identify and classify artifacts such as chat records, uploaded fils, generated files, and model setup histories. These artifacts provide valuable insight into user behavior and intent. For instance, LLM-generated files can serve as direct evidence in criminal investigations, particularly in cases involving the creation or distribution of illicit media, such as CSAM. The structured environment model further enables investigators to anticipate artifacts even in applications not directly analyzed. This study lays a foundational methodology for LLM application forensics, offering practical guidance for forensic investigations. To support practical adoption and reproducibility, we also release LangurTrace, an open-source tool that automates the collection and analysis of these artifacts.
References
[Paper #1]