Automated Metadata Extraction and Analysis
Metadata is often an overlooked yet highly valuable source of information during an OSINT investigation. Metadata refers to the hidden information embedded in documents, images, and other files. This can include details about the file creator, software used, geolocation, and timestamps. Extracting and analyzing this metadata can provide crucial insights about a target, revealing sensitive or hidden data that could lead to a larger discovery.
In this guide, we'll walk through the process of using ExifTool, FOCA, and Metagoofil to extract metadata from publicly available documents and media files. We’ll also show how to automate this process, enabling you to analyze large datasets more efficiently.
Step 1: Installing the Tools
Before we begin, install the necessary tools to extract metadata.
ExifTool:
- Purpose: ExifTool is a powerful command-line application for reading, writing, and editing metadata in a wide range of file formats.
sudo apt-get install exiftool
FOCA:
Purpose: FOCA (Fingerprinting Organizations with Collected Archives) is used to extract metadata and hidden information from public documents, especially from formats like DOC, XLS, PDF, and JPG.
Download FOCA from the official website.
Metagoofil:
- Purpose: Metagoofil is a tool for extracting metadata from public documents found through search engines. It can scrape files from a target domain and retrieve the metadata embedded within.
sudo apt-get install metagoofil
Step 2: Extracting Metadata with ExifTool
ExifTool is versatile and can be used to extract metadata from images, PDFs, Word documents, and more. It works across various file types, making it a go-to tool for handling metadata extraction in bulk.
Basic Usage:
Here’s how to extract metadata from a single file using ExifTool:
exiftool example.jpg
This command will display a comprehensive list of metadata fields for example.jpg
, including information such as:
- Camera make/model (for photos)
- GPS location (if embedded)
- Software used to edit the file
- Timestamps and file history
Automating Bulk Metadata Extraction:
To extract metadata from an entire directory of files, use this command:
exiftool -r /path/to/directory
The -r
option ensures that ExifTool processes all files in the directory and its subdirectories. You can save the output to a log file for further analysis:
exiftool -r /path/to/directory > metadata_log.txt
Filtering Specific Metadata Fields:
You may not need all metadata fields for your investigation. To extract only specific fields, use the -TAG
option:
exiftool -model -gpslatitude -gpslongitude example.jpg
This will extract only the model of the camera and the GPS coordinates.
Automation Tip:
Create a cron job or scheduled task to periodically extract metadata from newly added files in a directory. This is particularly useful for continuously monitoring a target’s uploaded media or documents.
Step 3: Using FOCA for Document Metadata Extraction
FOCA excels at extracting metadata from documents such as Word, Excel, PDFs, and presentations. It can also analyze the results to find potentially sensitive information, like server paths or document history.
How to Use FOCA:
- Open FOCA and enter the target domain (e.g., example.com).
- FOCA will automatically search for publicly available documents linked to the domain.
- After the documents are found, FOCA will download them and extract metadata.
Key Metadata Fields Extracted by FOCA:
- Author names
- Software versions
- Pathnames used during file creation
- Last modified dates
- Hidden text or comments
Automating FOCA:
FOCA doesn’t have built-in command-line automation but you can automate its use by writing a script that automatically downloads target documents from a website and feeds them into FOCA for metadata analysis.
Here’s a basic Python script to automate document downloading and feeding files into FOCA:
import requests
import os
# Specify target domain
domain = "example.com"
extensions = ['pdf', 'docx', 'xlsx']
# Create a directory to store downloaded files
os.makedirs("documents", exist_ok=True)
# Function to download files
def download_documents(domain, extensions):
for ext in extensions:
url = f"https://www.google.com/search?q=site:{domain}+filetype:{ext}"
response = requests.get(url)
# Save the downloaded file
filename = f"documents/{domain}.{ext}"
with open(filename, 'wb') as file:
file.write(response.content)
print(f"Downloaded: {filename}")
# Run the download function
download_documents(domain, extensions)
Once files are downloaded, you can manually analyze them in FOCA for metadata extraction.
Step 4: Using Metagoofil for Public Document Scraping and Metadata Extraction
Metagoofil is a fantastic tool for scraping public documents from a target domain and extracting metadata automatically. It can search for documents on the web, download them, and extract metadata, making it perfect for automating large-scale metadata collection from an organization’s website.
Metagoofil Usage:
Let’s say you want to scrape and analyze all the PDF documents available on a company’s domain:
metagoofil -d example.com -t pdf -o results -f found_files.html
In this example:
-d
specifies the domain you’re targeting.
-t
defines the file type (PDF in this case).
-o
specifies the output directory.
-f
generates an HTML file listing the found files.
Once Metagoofil completes its task, it will generate a report with metadata such as:
- Author names
- Software versions
- Last modification dates
- Email addresses (if available)
This can provide a wealth of intelligence, especially when investigating organizational structures or identifying potential internal weaknesses.
Automating Metagoofil:
To automate the process of scraping and analyzing metadata on a scheduled basis, create a cron job that runs the Metagoofil command at regular intervals. For example, to run it daily:
0 0 * * * /usr/bin/metagoofil -d example.com -t pdf -o /path/to/output -f found_files.html
This will ensure that you are constantly gathering new files and metadata, which can be useful for tracking changes in public documents over time.
Step 5: Analyzing the Metadata for Hidden Details
Once you've extracted the metadata from documents and media files, it's time to analyze the data. Look for hidden details such as:
- Geolocation Data: GPS coordinates embedded in image files can reveal where the photo was taken.
- Document Author Names: This can reveal organizational roles or expose individuals responsible for creating sensitive documents.
- Software Version Information: Knowing the software used to create a file (e.g., Adobe Acrobat version) might give you clues about potential vulnerabilities.
- Timestamps and History: Dates of creation, last modification, and access can help you understand the timeline of events related to the document.
Automation Tip:
You can combine tools like ExifTool and FOCA with data analysis scripts to filter out important metadata and visualize patterns. For example, plotting GPS coordinates on a map or analyzing trends in file creation and modification times.
Step 6: Reporting and Logging the Results
Once you've automated the metadata extraction process, it's important to log and report the results for further analysis. Here's an example Python script that logs metadata extracted from images using ExifTool:
import os
import subprocess
def extract_metadata(directory):
# Create or open a log file to store the metadata
with open('metadata_report.txt', 'a') as logfile:
# Loop through all files in the directory
for filename in os.listdir(directory):
filepath = os.path.join(directory, filename)
# Run ExifTool on each file and capture the output
result = subprocess.run(['exiftool', filepath], capture_output=True, text=True)
# Write the metadata to the log file
logfile.write(f"Metadata for {filename}:\n")
logfile.write(result.stdout)
logfile.write("\n" + "="*50 + "\n")
# Run the metadata extraction on the 'images' folder
extract_metadata('images')
This script will generate a report containing the metadata for each file in the images
folder and append the results to metadata_report.txt
. You can modify this script to handle different file types or directories based on your investigation.
Conclusion:
Metadata can often reveal more about a file than its contents. By using tools like ExifTool, FOCA, and Metagoofil, you can extract metadata from various file formats to uncover hidden information such as geolocation, authorship, and timestamps. Automating this process using scripts and cron jobs enables you to efficiently handle large datasets and continuously monitor a target for new information.
Summary of Tools and Their Usage:
ExifTool – A versatile tool for extracting metadata from images, PDFs, and documents.
FOCA – Used for extracting metadata from public documents (DOC, PDF, XLS) linked to a specific domain.
Metagoofil – Scrapes public documents from a target domain and extracts metadata automatically.
Automation – Use Python scripts and cron jobs to schedule and
automate metadata extraction.
By automating this process, you can stay one step ahead in your OSINT investigations, consistently gathering hidden details that could provide critical insights.