Repair of Corrupted Document Files

Integrated Framework for Damaged Document Reconstruction

PDF and Microsoft Office document formats are widely used as digital records across administrative, legal, academic, and business domains owing to their ubiquity and reliability. They are increasingly collected as evidence in digital forensic investigations. Therefore, the ability to repair corrupted documents is crucial in forensic investigations, as losing data contained in them could result in the loss of critical evidence.

In this study, we propose a novel PDF(Portable Document Format) repair framework that automatically reconstructs object relationships along with a pre-constructed font database, enabling effective repair even when embedded fonts or Unicode mappings are missing. We evaluated the framework on 1,000 multilingual PDF files covering ten real-world corruption scenarios, and it consistently outperformed existing tools. Our dataset and proof-of-concept tool are available at:repdf.site.

Microsoft Office documents are stored in two primary formats: the legacy CFBF(Compound File Binary Format), which organizes data hierarchically as storages and streams, and the modern OOXML(Office Open XML), a ZIP-based container that holds multiple XML and media components. Our repair technique repairs damaged CFBF files by reconstructing the FAT(File Allocation Table) chain to recover data streams, while for OOXML, it extracts valid XML and media components, rebuilds relationships, and repackages them into valid Office documents.

#Corrupted File Repair #Electronic Document #PDF #MS Office #Reference Data
Corrupted Document_repdf-site Corrupted Document_ooxml
TBD

Detection of Malicious Document Files

Structural and Component-Level Analysis for Advanced Detection of Document-Based Malware

In the digital era, document files—especially Microsoft Office and PDF formats—have become indispensable tools for information exchange. However, their widespread use and complex structures also make them major vectors for malware attacks, where embedded macros, scripts, and URLs are frequently exploited to deliver malicious payloads.
This study aims to develop advanced detection techniques for document-based malware through an in-depth analysis of document authoring tools, internal file structures, and the relationships among their components. We examine how major authoring software such as Microsoft Office, LibreOffice, Hancom Office, and Adobe Acrobat generate and convert documents, and construct a large-scale benign dataset comprising diverse elements including text, images, OLE objects, and macros. Based on the structural patterns observed in benign files, we identify and validate potential malicious indicators—such as obfuscated macros, encoded scripts, and suspicious URLs—using both static and dynamic analysis. Furthermore, we analyze inter-component relationships, such as image embedding structures and script execution flows, to distinguish normal from malicious behaviors and propose new detection rules accordingly.
The findings of this research are expected to enhance the precision and reliability of document malware detection, minimize false positives and negatives, and contribute to strengthening overall document security and digital forensics research.

#Document-based Malware #MS Office #PDF #Malicious Pattern Analysis

Analysis and Reconstruction of Data Fragments

Multimedia Data

An in-depth forensic examination of video files edited by Apple Photos

Uncover the hidden frames!
With the widespread availability of mobile and desktop video-editing tools, it has become increasingly feasible for individuals to alter digital evidence in ways that serve their interests. On Apple iOS and macOS platforms, the native Photos application stands out for its ability to edit videos without re-encoding them, leaving behind traces of manipulation such as metadata changes and unreferenced frames. Although many video players and commercial forensic tools overlook these meaningful artifacts, they can be crucial for revealing malicious editing behavior by a suspect. In this paper, we explore how the Photos application can be used to manipulate video files for potentially adversarial purposes and examine its impact on the underlying file structure. We then propose and implement detection methods that cover operations such as trimming, cropping, and rotation to identify these manipulations and recover any residual unreferenced frames. By testing various devices and operating system versions, we demonstrate the broad applicability of our approach, showing that between 1 and 245 unreferenced frames can be recovered. As a result, our research provides the forensic community with robust methods for classifying suspicious video files, identifying their editing techniques, and extracting residual data that can be valuable as evidence.

#Multimedia forensics #Video tampering #Apple devices #AVC #HEVC

SQLCipher-Encrypted Data

The widespread use of SQLite for sensitive data storage has introduced page-level encryption to ensure confidentiality, but this mechanism complicates forensic analysis by concealing structural patterns essential for data recovery. Traditional carving and metadata-based approaches fail to restore encrypted or deleted records, and prior studies have largely focused on key acquisition rather than data reconstruction. To address this limitation, this study proposes a systematic recovery framework that identifies high-entropy clusters containing encrypted Write-Ahead Logging (WAL) fragments, merges distributed chunks, and reconstructs lost content from unallocated space. Experimental validation demonstrates the successful decryption and restoration of thousands of valid records, which are automatically extracted into analyzable CSV files,bridging the gap between encryption,enabled confidentiality and practical forensic accessibility.

#Unallocated Area Data Carving #Encrypted SQLite #SQLite WAL journal #Data Recovery
Multimedia Data 1 Multimedia Data 2
TBD

Multimedia Forensics

Structure-based Forgery Detection

Metadata-based audio file authenticity analysis framework: Galaxy ecosystem as a study

We talked about this! Don’t you remember?!
Verifying the authenticity of audio recordings is difficult when files are lightly edited, re-encoded, or moved across devices in ways that keep waveforms plausible while altering provenance. This paper presents a metadata-centered framework that complements signal-level detectors. We catalog on-device media artifacts, focusing on Android MediaStore records such as creation and modification times, acquisition times, application package provenance, and bitrates, and relate them to ISOBMFF fields. We combine these sources with application and filesystem traces to reconstruct timelines through preparation, acquisition, and examination phases. Controlled case studies cover genuine Android recordings and two tampered scenarios involving smartphone to smartwatch and smartphone to Windows PC transfers with trimming and copy-back insertion. Patterns such as synchronized timestamp resets, unexpected package provenance, and bitrate shifts reveal edits even when audio sounds natural. The approach is limited by device-specific schemas and access constraints but offers a reproducible, low-overhead basis for authenticating everyday audio evidence.

#Audio authenticity #Android MediaStore #Multimedia forensics #Galaxy devices

Development of DFIR Infrastructures

Software Hash Database

In digital forensics, irrelevant information refers to files that are not related to the case, such as executable files and libraries used by the operating system and applications, resource files, and additional resource files generated during software operation. Since these files do not serve as meaningful evidence in investigations, they are classified as irrelevant information. Among traditional approaches to excluding irrelevant information, the most widely used technique is hash-based filtering. In this study, we propose a Software Reference Data Warehouse that expands upon the conventional software reference database concept. While existing software reference databases primarily store hash values of known files, the proposed Software Reference Data Warehouse collects additional metadata, including file attribute information, software-related information, and behavioral information of software - from Windows, Android, iOS, and Linux systems.

#Irrelevant Information #Hash #Metadata #Windows #Android#iOS#Linux

Forensic Tool Testing

TBD

Multi-purpose Synthetic Datasets

Data recovery is an essential aspect of digital forensic science and practice. It includes methods for restoring partitioning schemes or file systems (Metadata Recovery), recovering files using various residual metadata within the file system (File Recovery), and reassembling data based on file format characteristics (File Carving). Among these, file carving presents the most challenges and limitations. there is a lack of studies examining the applicability of data recovery techniques in real-world environments and whether similar recovery performance can be consistently achieved.
To overcome these issues, this project proposes a methodology for designing new datasets to support the advancement of file carving techniques. By statistically analyzing the characteristics of the files stored inside a number ofstorage devices used in real-world environments, we identify the factors necessary for an objective performance evaluation of file carving algorithms and reflect them in the dataset design process.

#Data Recovery #File Carving #Datasets #Tool Testing
Hash Database 1Hash Database 2
Design Drawing File 1 Design Drawing File 2 MS Office

Similarity Comparison

Design Drawing Files

Guarding Engineering Design Drawings: Detecting Data Leaks via Structural Similarity in AutoCAD and OrCAD Files

As industrial environments become increasingly digitalized, engineering drawings and related technical files have emerged as key assets representing an organization’s core technologies. However, leaks of such high-value design data are growing more frequent, threatening both corporate competitiveness and national technological security.
In this study, we aim to detect technology leaks by analyzing the structural characteristics of engineering design files. Targeting AutoCAD and OrCAD drawings, we examine their internal file structures to identify distinctive features and develop a similarity-based comparison algorithm. Using a custom dataset built from real and generated drawings, the proposed method demonstrates strong effectiveness and practicality. The results highlight its potential for protecting industrial technology and supporting digital forensic investigations.

#Design Drawing File Forensics #Structural Similarity Analysis #Data Leak Detection

MS Office

Layout-based similar document search through representative image creation: MS PowerPoint as a case study

Find my twin.
With the recent development of technology, the work environment is all digitized, and digital documents are utilized in most of the work. For digital forensic investigators who need to quickly select documents related to a case, numerous digital documents cause a lot of difficulties in investigations. In particular, in eDiscovery, it is important to find meaningful digital evidence by analyzing associations between many documents and files within a limited time. In the case of digital forensic investigation, if documents with similar types are selected among numerous documents by identifying the types of digital documents, only documents created by a specific organization can be grouped.
In this paper, we present a method of generating an image that can represent a document among images stored as many as the number of pages of a document for searching similar documents, and a method of searching similar documents using an image hash for similarity analysis between representative images. About 50,000 Microsoft PowerPoint files in the Govdocs1 data set and about 6,000 Microsoft PowerPoint files in the NapierOne data set demonstrate the practicality of the method presented in this paper.

#Document File Forensics #Document Layout Analysis #Image Hash #Image Similarity