KF Math Thread - Discuss Math

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.
My problem is defining the fragmentation function. I've considered using the entropy of the free and used regions of the tape, I've considered having a measure involving the minimum and maximum free sizes in the tape, etc
Well, for linear storage devices, fragmentation makes you seek over unrelated data, and this impacts performance. So, a function to minimize could be the sum of non-d_i bytes in the span of data item d_i, over all data blocks on the device, as this is the total "fragmentation" you would have to seek over when reading all files from e.g. disc.

jemalloc does essentially what you've outlined, though it allocates bins of fixed power-of-two size in advance (though it does expand each such region dynamically)

The transformation you mentioned, if let arbitrary would just be whatever function makes phi(x)=0, but since the idea started by modeling a physical device, there should probably be a cost to each transformation mimicking real disc copies, etc. For example, you could count changed bytes/minimal block sizes.

The mathematical way to solve this would probably be to define a functional like F(τ) = Σ[d(t) - (τ(d))(t) + φ(τ(d))] and find where its Gateaux/Frechet derivative is zero (or differential, the two are mostly equivalent) — this looks like discrete (2nd order) Euler-Lagrange with no derivatives. Maybe the bit difference part could be written as Bolza type boundary conditions.

The CS way to solve it is probably running something like A* with heuristics favoring denser packing, or any other optimizing algorithm (GA, PSO, anthill).

Hope it helps stir some thoughts :D
 
I've given the problem some more thought (enough thought to lose edit permission on my previous post), and I've managed to reformulate it much more nicely.

Firstly, the defragmentation transform τ really only shuffles around the data blocks (otherwise, we'd end up losing existing data). If we write the data on the disc/tape as a function:
d: [1, n]∩ℕ -> B
Then τ is a bijection (permutation) on Dom(d) which means it's an element of the group S_n.
This lets us write the cost of the defragmentation more simply as either the total distance we move/swap the blocks (e.g. Σ|τ(k)-k|) or the number of places swapped (e.g. |{1 : k in N, τ(k) = k}|) for k=1,2,...,n.

Lastly, if we have a set of data items written to disc/tape D = { D_i }, with each data item being a family of indices corresponding to the location of each of its data blocks in storage (i.e. D_i = { i_1, i_2, ..., i_k } ⊆ Dom(d), D_i pairwise disjoint), then we can write the fragmentation of a single block D_i as:

For ascending i_k in D_i, with k = 1, 2, ..., m
Σ_{k=1}^{m-1} ( i_{k+1) - i_k - 1 )

This counts all the non-consecutive blocks making up data block D_i.

There's a couple of cool things about this approach. For one, S_n is pretty well researched, so there's probably some cool combinatorial/algebraic method I'm unaware of. Second, we've completely avoided evaluating d(k), instead replacing it with just the size of the disc/tape. And lastly, if we make our defrag cost look at data item indices, we only have to work with one type of object (which may or may not be easier).

This sounds like combinatorial optimisation, something like minimum linear arrangement, but I'm no expert in optimisation, much less so in discrete optimisation.
 
  • Agree
  • Informative
Reactions: y a t s and N Space
This is the one thread I've seen on here after all these years where I find myself wishing the Farms had embedded LaTeX support.
 
  • Feels
Reactions: y a t s
Back