Binary file compare engine?

Main development forum.

Binary file compare engine?

Postby kimmov » Tue Jun 09, 2009 10:41 am

This was inspired by discussion with Matthias in patch item #2802436 switch to quick content for binarys.

Matthias's original idea was to change for quick compare engine for binary files. I don't see much sense in it, as I said in the patch item, full contents engine actually compares binary files faster than quick compare (it was news for me too, I only realized it after looking at the code more closely). And second fact is we anyway compare bigger files with quick compare engine. So the switch from full contents to quick compare would have happened only for small binary files where it doesn't have speed advantage.

But this got me thinking how we compare binary files. We waste time in quick compare engine by handling EOL bytes and whitespace ignore and some other data. Which has no any meaning when handling/comparing binary files. So I think the best way so to speed up comparing binary files is to add new binary file compare engine.

Features / behavior of the new engine are:
  • tight byte per byte compare. No EOL bytes, no whitespaces, no text statistics or whatever
  • compare data in smal blocks (at first, we can tune block size and strategy later)
  • optimized only for binary data
  • stop compare when first different byte is found (this could be a huge time saver, we may need to compare only few bytes of gigabytes of data)
  • no filtering (naturally)
  • no upper limit in file size (64-bit indexes etc)
That we can use this engine we'd need pre-compare binary file check. If the file is determined to be binary file we use this new binary file compare engine. Otherwise we use selected compare engine.
kimmov
 
Posts: 562
Joined: Thu Sep 11, 2008 8:51 pm
Location: Finland

Re: Binary file compare engine?

Postby matthias1955 » Tue Jun 09, 2009 9:04 pm

Yes, I can fully agree. that's the correct way to speed it up.
the original diffutils_compare_files() was reading the full file to Ram, so if we have less memory, what happend?
Windows is swapping data to HD, that takes time.

Using the same start as CompareFiles(FileLocation *location). => CompareBinaryFiles(FileLocation *location).
So the files are once opened and compared fast.
the datablock should be something like 512 x 2 x 2... so about 256 -512kB can be good.
so we can compare allways full HD data records. not to big, so we have no memory problem if we run more instance at same time.
matthias1955
 
Posts: 162
Joined: Wed Dec 17, 2008 1:55 pm

Re: Binary file compare engine?

Postby kimmov » Tue Jun 09, 2009 9:14 pm

matthias1955 wrote:Yes, I can fully agree. that's the correct way to speed it up.
the original diffutils_compare_files() was reading the full file to Ram, so if we have less memory, what happend?

diffutils compare does not compare files larger than 4 MB (unless you change the limit to registry).
kimmov
 
Posts: 562
Joined: Thu Sep 11, 2008 8:51 pm
Location: Finland

Re: Binary file compare engine?

Postby matthias1955 » Thu Jun 11, 2009 9:08 am

Kimmov wrote
files larger than 4 MB (unless you change the limit to registry).

One more reason to start :)

As discussed in bug item ID: 2802248 we should start with FileInfo class.
Same intem or new threat?
matthias1955
 
Posts: 162
Joined: Wed Dec 17, 2008 1:55 pm

Re: Binary file compare engine?

Postby kimmov » Thu Jun 11, 2009 10:16 am

matthias1955 wrote:
Kimmov wrote
files larger than 4 MB (unless you change the limit to registry).

One more reason to start :)

I don't understand?

matthias1955 wrote:As discussed in bug item ID: 2802248 we should start with FileInfo class.
Same intem or new threat?

I think refactoring fileinfo and new binary file compare engine are two independent tasks so order does not matter. What matters it gets done right this time so I'd like to see some plans before any code gets submitted as patch. As I may require rewriting whole stuff if I don't like it. And I'm serious, we must get the fileinfo structs correct this time. Not just "works" once again. And lots of thinking should be gone to how we structure the file info - there is many kinds of information needed in different places/stages.

Designing this is not trivial task. But current structs can be used as starting point for the info we need at least. But while we are in designing, future must be thought too - what info we may need in future. So structs must be possible to expand later.

And absolutely new tracker items.
kimmov
 
Posts: 562
Joined: Thu Sep 11, 2008 8:51 pm
Location: Finland

Re: Binary file compare engine?

Postby kimmov » Fri Jun 12, 2009 5:58 pm

Hmm. Looking at the current code, perhaps the first step should be refactoring of the FileLocation structure. Its is used a bit weird way in compare engines so doing refactoring first means we don't need to refactor this new engine again.
kimmov
 
Posts: 562
Joined: Thu Sep 11, 2008 8:51 pm
Location: Finland

Re: Binary file compare engine?

Postby matthias1955 » Fri Jun 12, 2009 8:37 pm

that's really important. We can can use it everywhere.
So in multoformattext we ask twice for encodeing. Means we open and close the file.
Same in unicode. So in furture we can replace some code by using one class, what makes it easier to maintain also.

I think we should have

pathname /* original*/
encoding /*FileTextEncoding includeing binarys*/
stat /* File status from fstat() */
desc /* File descriptor */
handle /*Filehandle*/
matthias1955
 
Posts: 162
Joined: Wed Dec 17, 2008 1:55 pm

Re: Binary file compare engine?

Postby kimmov » Fri Jun 12, 2009 9:13 pm

matthias1955 wrote:that's really important. We can can use it everywhere.

That is not a good idea. It would mean we handle information we don't need in several places. Different parts of the code need different file information. It would be interesting to list current file info structures and see if the info needed can be organized into some nice class hierarchy.

matthias1955 wrote:So in furture we can replace some code by using one class, what makes it easier to maintain also.

See above. Easy maintenance is a good thing, but if it means structure sizes double and we have inconsistent/doubled information it is not good.

matthias1955 wrote:pathname /* original*/
encoding /*FileTextEncoding includeing binarys*/
stat /* File status from fstat() */
desc /* File descriptor */
handle /*Filehandle*/

We don't use stat in folder compare so stat struct would be waste of space there. file descriptor and file handle (of the same file) in same struct sound like a bad idea. Which one should be used and where? Can both be open at the same time? For reading? You need pretty complex rules for handling that kind of structure. So it won't help us.

Based on above I already doubt it can be one struct. But perhaps hierarchy with nice base class and few derived classes could fit better. But again, need to see what info is needed and where.
kimmov
 
Posts: 562
Joined: Thu Sep 11, 2008 8:51 pm
Location: Finland

Re: Binary file compare engine?

Postby matthias1955 » Sat Jun 13, 2009 7:31 pm

inconsistent/doubled information it is not good.

that's clear, we take desc /* File descriptor */ as the handle we can get easy out out that.

I thought first we should just collect what we needat different places, than create a good hierarchy ...
matthias1955
 
Posts: 162
Joined: Wed Dec 17, 2008 1:55 pm

Re: Binary file compare engine?

Postby kimmov » Sat Jun 13, 2009 9:13 pm

Different parts of the code use different APIs and different kinds of handles and/or descriptors. You simply cannot convert all code to use just single type as everything is not possible to with single API. So best is to left the whole descriptor/handle out of the struct. We should avoid keeping files open any longer than we must to. So usually keeping the handle to open file is just the wrong thing to do. Handle's lifetime is (should be) very short, while filename and other file info many times won't change during WinMerge process lifetime.
kimmov
 
Posts: 562
Joined: Thu Sep 11, 2008 8:51 pm
Location: Finland

Next

Return to Developers

Who is online

Users browsing this forum: No registered users and 2 guests

cron