JoanC

DLV file format

4 posts in this topic

Hi there,

I have a few years worth of .DLV files and I would like to do some serious data mining on them. I want to extract all the data into a database and do all sorts of analysis. One way to do it would be opening each file, one by one, exporting the data to CSV (twice per file, actually, because there are more than ten variables) and then import all the .csv files into my database. But since I have more than one thousand .dlv files, I'd rather take a more direct approach and parse the files myself with a custom-made program.

I have done some analysis, and I have seen that the timestamps are stored in Unix format (4 bytes), and the actual data I need is 2-byte integers. Still, the file has some oddities and I can't interpret the headers at all. I would like to get my hands on some documentation about the file format. I don't need a user manual, it would be more of a programmer's guide. Does this exist?

Joan

Share this post


Link to post
Share on other sites

Hi Joan,

No the DLV file is binary format, and also a compound file stream which is not documented anywhere and not normally easy to reverse engineer.

Can you use CX-Supervisor itself !  There is a script command called "ExportLog" which can save the CSV file,  and all items not just 10 at a time!  You could even use OpenFile to select the filename you want, or maybe loop round your thousands of files automatically.

I think you have to specify the file in series by number insead of date so you would have to fiddle a bit, but its a starting clue and sounds like you're clever enough to run with it !

Take a look but give an update if you need more help.

Regards,

BB

Share this post


Link to post
Share on other sites

Well, it is indeed binary, but I have done some research and I just found out it's readable with the right tools.

The format is actually Composite Document File V2, and there are some tools that can read that. The file has several streams, one per item, Then each stream is a binary file with a fixed record format, easily readable with a C program.

I am using oletools to extract the streams but exploring oledump.py, I'll keep you posted.

I will look into ExportLog, too, thank you very much, Berti.

Share this post


Link to post
Share on other sites

OK, I have a winner. As Berti says, maybe I could do it with scripts from CX-Supervisor itself. The problem is that I would need a complete installation of CX-Supervisor to do that, and then I should learn the scripting language. Plus, the data extraction is just the start. I need to do a lot more processing once I have got the data, and I want to do it in Python. So, the integration will be easier if the whole thing is done in Python.

As I said, DLVs are in the "Composite Document File V2" format. The file contains several so-called streams, each of which is one of the variables you can plot in Data Log Viewer. I first used oletools to examine the data. It's a set of small programs that can read these files. One of them is olebrowse, which gives you a simple GUI that allows you to read the file, see its structure and extract streams from it. However, oletools is oriented to computer security analysis, so it's not the best tool for me.

Then you have oledump.py, which fits the job perfectly. It's a tool written in Python by Didier Stevens and it seems to be in the public domain. Thanks, Didier, if you ever read this.

You really need only two of the many features of this tool:

python3 oledump.py -i file.dlv

gives you a listing of the streams (variables) contained in the file, with the index number, stream size, and name.

python3 oledump.py -s<stream #> -d file.dlv

dumps stream #7 as a binary file to standard output, so if you want to dump stream #7, e.g., you have to do:

python3 oledump.py -s7 -d file.dlv > variable7.bin

Now, this is a binary file, with big-endian integers. Your mileage may vary here. In my case, I have found that each record has:

  1. a single byte
  2. a long integer (4 bytes), which is the date and time in POSIX format
  3. a short integer (2 bytes), meaning unknown
  4. another long integer, which is the value measured.

I have written a small piece of Python code that reads the file.

import struct
import binascii
from datetime import datetime, timezone, timedelta

filename = "s7.bin"

# This is the record format
values = '< c L H L' # < big-endian
                     # c one char (unknown)
                     # L long integer (date)
                     # H short integer (unknown)
                     # L long integer (value)

struct_len = struct.calcsize(values)    # record size
struct_unpack = struct.Struct(values).unpack_from # shorthand, really

with open(filename, "rb") as f:
    while True:
        data = f.read(struct_len)   # read each record
        if not data:
            break
        record = struct_unpack(data)    # unpack the record
        # Now record[1] is a Unix timestamp with the date
        # and record[3] is the value
        # record[0] and record[2] may be padding, or timezone related (?)

        # this is one way to print the time with timezone=UTC+1
        print(datetime.fromtimestamp(record[1],tz=timezone(timedelta(hours=1))).strftime('%d-%m-%Y %H:%M:%S0'))
        print(record[3])

The whole thing is really fast.

Of course, use at your own risk, I'm not implying that this may be of use to anybody, and so on.

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now