How do you structure files found from a database?

  • 🔧 Actively working on site again.

stares at error messages

Readn' Tea Leaves
kiwifarms.net
Joined
Dec 7, 2020
Let's say I have a hard drive of data. This data changes a little bit and its directory structures mostly stays the same. This is a question about where to draw the line between database and directory structure. I have a program and database that make assumptions about the layout of the data's directory structure. How do I keep the program's representation consistent with what's actually on the hard drive, so that I don't end up in a situation where the programs (a library for the data-structure and a few executable use it) that use that database and directories become dis-connected? For example, if I need to find the data for a key in the database, how must the directory structure stay enforced to keep this consistent?

So far I've guess that I'll need to make custom programs that will manage operations on the data and database precisely to keep the two from becoming inconsistent. The thing I find most unclear is if you have custom programs for managing this data-system then how should access control be done: how to force only the right programs working on the data. I've read into using Apache Jackrabbit for other projects, but not being able to use that data from the file system is a deal breaker. I think it would ultimately become too inconvenient to have to use a program to view the data instead of using the filesystem. Because it would end up being inconvenient, something like Jackrabbit wouldn't get used and would be pointless. For this reason, I'd like to stick with using the filesystem, but keeping structure seems tricky.

Part of me thinks that this data should be treated like program assets because it being consumed by the database and programs. But since it's dynamic (the content of files stored in directories might be changed by programs not part of the system) it seems like it needs data that's not strictly part of the system (programs and database) but the system just operates on. Like something in a scratch directory that might get deleted or changed the user. Where the data's root is also matters. This is Unix Like, the data's root will need to be on a consistent mount point for production, so probably on LAN and connected-to with NFS by clients.

This must seem like a really complicated question. If you have any thoughts on how to store filesystem locations in a database and keep the paths from getting changed, I'd like to read what you think. Thanks.
 
Excessive use of symbolic links or their equivalent, start banging out those like it was porn for the IT department. Don't use protection/documentation.

how to force only the right programs working on the data.
What? Identifiers, surely, if it's a closed environment that should be easy enough. Unless you mean concurrent work on the same data set, that's trickier.

"so that I don't end up in a situation where the programs (a library for the data-structure and a few executable use it) that use that database and directories become dis-connected?"
To me that sounds like HDDs not being enough and they time out. Access times can be enormous if wrong data is read or written. If it's a database and "small changes" means changes to a large database here and there it will most likely lead to disk thrashing. I know Exchange, at one point, can wreck computers because it tries to index everything.
 
Let's say I have a hard drive of data. This data changes a little bit and its directory structures mostly stays the same. This is a question about where to draw the line between database and directory structure. I have a program and database that make assumptions about the layout of the data's directory structure. How do I keep the program's representation consistent with what's actually on the hard drive, so that I don't end up in a situation where the programs (a library for the data-structure and a few executable use it) that use that database and directories become dis-connected? For example, if I need to find the data for a key in the database, how must the directory structure stay enforced to keep this consistent?

So far I've guess that I'll need to make custom programs that will manage operations on the data and database precisely to keep the two from becoming inconsistent. The thing I find most unclear is if you have custom programs for managing this data-system then how should access control be done: how to force only the right programs working on the data. I've read into using Apache Jackrabbit for other projects, but not being able to use that data from the file system is a deal breaker. I think it would ultimately become too inconvenient to have to use a program to view the data instead of using the filesystem. Because it would end up being inconvenient, something like Jackrabbit wouldn't get used and would be pointless. For this reason, I'd like to stick with using the filesystem, but keeping structure seems tricky.

Part of me thinks that this data should be treated like program assets because it being consumed by the database and programs. But since it's dynamic (the content of files stored in directories might be changed by programs not part of the system) it seems like it needs data that's not strictly part of the system (programs and database) but the system just operates on. Like something in a scratch directory that might get deleted or changed the user. Where the data's root is also matters. This is Unix Like, the data's root will need to be on a consistent mount point for production, so probably on LAN and connected-to with NFS by clients.

This must seem like a really complicated question. If you have any thoughts on how to store filesystem locations in a database and keep the paths from getting changed, I'd like to read what you think. Thanks.
This problem is not well-stated. I'm going to take a stab at describing the problem and you tell me if I have it wrong:

"I have a file system that can be written to by operations outside of my control. I need to be able to query the file system on behalf of services (programs) that need data from it and respond to that query in a structured format".​

Necessary additional information that would inform the solution is:
  • "Frequency of changes to the file system."
  • "Acceptable response times to the services that want to query it"
  • "Amount of data (both files and directory complexity)"
  • "How is a file identified by the service that will query it? E.g. file name, hash, all files modified within a certain period?"
The above is kind of a bare minimum to answering your question.

So one of the golden principles of solid database design is to not duplicate information. But what you are describing is something where that's inherent because the database seems to be just a proxy / cache layer for the services to get the information they need. In which case, maybe a database isn't even the right tool for the job. Maybe you'd want some sort of NoSQL or a DynamoDB or Redis cache that was just kept up to date through continuous polling? How are you identifying a file? One of your programs wants a file. It looks it up in the database and the database returns a path saying: "//myserver/dir1/dir2/filename.txt" and the program then uses the returned path for whatever operation it then plans. Is that the sort of thing we're talking about? What does it say to the database when it asks for the file. It requests a name, some kind of UUID generated for the file? A selection of files meeting certain search criteria?

If you don't need complex search criteria, I'd be tempted to ditch the idea of a full DB and just have Redis or equivalent with a look up. I really would try and find a way to make all operations on the file system go through a service you provide. That could be anything from GraphQL down to some micro-service in Python or PHP. Failing that you'll need some kind of event that triggers when someone makes a change on the file system or continuous polling for changes.

I think more information is needed but my first instinct would be to build some kind of service that had a cache like Redis for performance which looked up the path of the file on request from that cache but could also do a quick file stat on the path to check it was there. If so, return the path, if not invalidate that cache entry and perform whatever function you have to find the new location of the file then add to cache and return to client.
 
This problem is not well-stated. I'm going to take a stab at describing the problem and you tell me if I have it wrong:

"I have a file system that can be written to by operations outside of my control. I need to be able to query the file system on behalf of services (programs) that need data from it and respond to that query in a structured format".​

Necessary additional information that would inform the solution is:
  • "Frequency of changes to the file system."
  • "Acceptable response times to the services that want to query it"
  • "Amount of data (both files and directory complexity)"
  • "How is a file identified by the service that will query it? E.g. file name, hash, all files modified within a certain period?"
The above is kind of a bare minimum to answering your question.

So one of the golden principles of solid database design is to not duplicate information. But what you are describing is something where that's inherent because the database seems to be just a proxy / cache layer for the services to get the information they need. In which case, maybe a database isn't even the right tool for the job. Maybe you'd want some sort of NoSQL or a DynamoDB or Redis cache that was just kept up to date through continuous polling? How are you identifying a file? One of your programs wants a file. It looks it up in the database and the database returns a path saying: "//myserver/dir1/dir2/filename.txt" and the program then uses the returned path for whatever operation it then plans. Is that the sort of thing we're talking about? What does it say to the database when it asks for the file. It requests a name, some kind of UUID generated for the file? A selection of files meeting certain search criteria?

If you don't need complex search criteria, I'd be tempted to ditch the idea of a full DB and just have Redis or equivalent with a look up. I really would try and find a way to make all operations on the file system go through a service you provide. That could be anything from GraphQL down to some micro-service in Python or PHP. Failing that you'll need some kind of event that triggers when someone makes a change on the file system or continuous polling for changes.

I think more information is needed but my first instinct would be to build some kind of service that had a cache like Redis for performance which looked up the path of the file on request from that cache but could also do a quick file stat on the path to check it was there. If so, return the path, if not invalidate that cache entry and perform whatever function you have to find the new location of the file then add to cache and return to client.
Sorry, your right I should explain more about how this system is going to be used.
  1. At most there will between 5 and 10 people connecting to the system as clients. much of there work will not really be concurrent. Data stored in the directory structure will be change, but the paths to the data and what's in the database won't need be written too dynamically. Only the data that the files saved at paths in the database point to are going to be mutated; the directories and layout is not going to be changed normally by clients.
  2. In the database there are business IDs which are mapped to paths to find files stored in the system. The the IDs might look like "NLZ55698". The codes are unique, but the rules for how numbers are assigned are more flexible. Usually they are just auto incremented as new items are added over time. The letters have a special business meaning and are user defined. You'll have multiple entries with the same letters to separate the records into categories, but the letters are only symbolic and kept unique. So, you can have "AB1", "AB2" ... and so forth, but you cant have "AB1" ... "AD1".
  3. The business IDs are not used for indexing. Each table has it's own row id. Row IDs are used to preform join on other tables. These Row IDs have no meaning and are not used outside the database.
  4. The database is not only used for this directory storage system but it is also used for company operations data. Like scheduling, so for example you can easily grab the path you need by joining with business ID table to get out path for the projects you need in the schedule.
  5. The exact history of changes to data in the system, stored in the paths managed by the system, is actually not a priority. The company doesn't care about having a history of changes made. My knee jerk reaction was to start thinking about git when the project was being negotiated, but they were adamant they didn't want it.
  6. The database is very big. At minimum there are at least 60000 records for the directory structure and corresponding metadata that is strode in different tables. Each directory record will have multiple joins to other tables. If there were fewer records, if there was less of need to be arbitrary joins, then I would agree that NoSql would probably be a good fit. Sadly, the types of operations are just not going to make that a good fit. There are times when a record needs to be selected based on bool in one table, joined to find a value in another table, and then either return or alter a single value from the second table. Fine right, not too bad for NoSql, but then this needs to be one on 1000s of items at a time. Items meaning directory paths saved in the database which will have multiple entries across different tables which need to be done in bulk. The company also didn't didn't care about manually editing records. Most NoSql databases turn into JSON at some point or are JSON. The trade offs there just aren't meaningful for this project. There are probably comparable ways to setup NoSql, but since the data is already organised into tables it I don't see how NoSql could be gotten away with using joins or with having to rewrite large documents just to change one value.
  7. You are right that the directory system acts like a cache. The items (directory paths held in the database) only grow by a few thousand each year. After new paths are created in the system there is nothing special to them and like the rest of the paths they become permanent. Actually once an item in created it become permanent even if the exact data changes over time. Items are also not repurposed, so there is no need to change the meaning of where a Item row id has been used in the past.
I hope this helps you answer my question.

Thinking about it, the closest thing to the system I'm working on is how websites have directories where they store images and documents uploaded to the site. The image change sometimes or get deleted. But the directories are always there. It seems like the paths to files and the directory structure is part of the program and database, but the contents of the pasts are separate from the system. Paths are created by business id. So you get something like: "root/AB/1/5/8/1588" or "root/NSD/5/50" or "root/NSD/0/1/". But the business id alone is not very meaning full, so the actually directory that hold the files could be "1588-some title-158520" or just "1588". My indecision comes from when people are using this, if you always need to query the database to know what the directory your working with means it might be frustrating because of the need to keep things in files with these weird business id names that aren't really descriptive. This problem could be solved with over reliance of programs. Like having a cd that takes business ids instead of file paths. If systemcd is the way I go, instead of trying make things a little human || self-describing, you then end up in a situation where the user is very dependent of a number of programs working correctly with a specially directory root for the system. Since the directories aren't part of the program data, then there isn't going to be a standard that says you always have to mount X on "/var/custom-system", so it might be a good idea to a an /etc file for the program suit that dictates the root for the directory structure.

It's weird because on the one hand you have something that is a content-address database that uses business id for the keys, but at the same time

Is it kosher to use /var/lib/custom-system for data that is going to be edited by the user and not edited by an interpreter or solely a program? To compare it to MySql, the directories are controlled by the Sql system, but the data is user defined, even if the file format isn't. Is there a more appropriate mount point for the custom-system root, given that the structure is program controlled and expected to follow a standard, but the actual data is opened from the file system by the user? I would guess that it's not correct to use /var/lib for anything that is not completely managed by a program like an interpreter and that any state preserved in lib is considered is considered part of the environment of sql and would not be appropriate to be manage out side of the sql system. My program and directory-structure would not pass the tests provided by the Sql example. In my program only the directory paths are really manged by the program, but the actual contents in the deepest directories are just user edited. Meaning the data stored in the deepest directories are not managed by the program, so the system can not be directly compared to Sql. There must be a directory that the paths are considerd part of the program's persistant state, but centense of the directiors is not. Where should that be be mounted in a Unix-like system?
Code:
This is the problem


If "AB654654-some title" is the deepest directory, then

- It will be easier for the user to recognises. But not
  necessarily better for the system because users can't be
  expect to select their directories by "some tiles" due to
  the large volume of items.
- Globing will need to be addeh to the program to allow for
  searching from titles, which will necessitate a coloset
  match system. Match System is complicated and I'm not
  looking forward to adding it.
- I don't know where directiores that are both
  program-persistant-state and user-defined data should be
  mounted on.
Else If "AB654654" is the deepest directory, then

- There is less human readability, as the business IDs
  really only make sense when looking at hen in the
  database. This inevitability places a greater load-barring
  on programs that sport this directory structure; custom-cd
  is a program that will change directories to the correct path
  for any business id provided. Custom-cd has this exclusive
  function and alongside other programs that can add new
  items and preform other operations on the structure.
- There will not be as much human-readability because
   business ids don't have automatic meaning to the user.
   The user would need to know what the business id
   represents from searching it. This willrequire the existence
   of program that will instruct the user what they should be
  working on next. (Mind you this is all cmd.)

End
#include <https://upload.wikimedia.org/wikipedia/commons/f/f3/Standard-unix-filesystem-hierarchy.svg>

Accounting to the diagram, /var/opt/dictionary-system-name would be
the correct mount path. /var/lib is for data modified as programmers run. Which would aline with my supposition that because the data in my custom structure is not intend to be modified by program, it would be inappropriate to place it in /var/lib.
 
Last edited:
Sorry, your right I should explain more about how this system is going to be used.
  1. At most there will between 5 and 10 people connecting to the system as clients. much of there work will not really be concurrent. Data stored in the directory structure will be change, but the paths to the data and what's in the database won't need be written too dynamically. Only the data that the files saved at paths in the database point to are going to be mutated; the directories and layout is not going to be changed normally by clients.
  2. In the database there are business IDs which are mapped to paths to find files stored in the system. The the IDs might look like "NLZ55698". The codes are unique, but the rules for how numbers are assigned are more flexible. Usually they are just auto incremented as new items are added over time. The letters have a special business meaning and are user defined. You'll have multiple entries with the same letters to separate the records into categories, but the letters are only symbolic and kept unique. So, you can have "AB1", "AB2" ... and so forth, but you cant have "AB1" ... "AD1".
  3. The business IDs are not used for indexing. Each table has it's own row id. Row IDs are used to preform join on other tables. These Row IDs have no meaning and are not used outside the database.
  4. The database is not only used for this directory storage system but it is also used for company operations data. Like scheduling, so for example you can easily grab the path you need by joining with business ID table to get out path for the projects you need in the schedule.
  5. The exact history of changes to data in the system, stored in the paths managed by the system, is actually not a priority. The company doesn't care about having a history of changes made. My knee jerk reaction was to start thinking about git when the project was being negotiated, but they were adamant they didn't want it.
  6. The database is very big. At minimum there are at least 60000 records for the directory structure and corresponding metadata that is strode in different tables. Each directory record will have multiple joins to other tables. If there were fewer records, if there was less of need to be arbitrary joins, then I would agree that NoSql would probably be a good fit. Sadly, the types of operations are just not going to make that a good fit. There are times when a record needs to be selected based on bool in one table, joined to find a value in another table, and then either return or alter a single value from the second table. Fine right, not too bad for NoSql, but then this needs to be one on 1000s of items at a time. Items meaning directory paths saved in the database which will have multiple entries across different tables which need to be done in bulk. The company also didn't didn't care about manually editing records. Most NoSql databases turn into JSON at some point or are JSON. The trade offs there just aren't meaningful for this project. There are probably comparable ways to setup NoSql, but since the data is already organised into tables it I don't see how NoSql could be gotten away with using joins or with having to rewrite large documents just to change one value.
  7. You are right that the directory system acts like a cache. The items (directory paths held in the database) only grow by a few thousand each year. After new paths are created in the system there is nothing special to them and like the rest of the paths they become permanent. Actually once an item in created it become permanent even if the exact data changes over time. Items are also not repurposed, so there is no need to change the meaning of where a Item row id has been used in the past.
I hope this helps you answer my question.

Thinking about it, the closest thing to the system I'm working on is how websites have directories where they store images and documents uploaded to the site. The image change sometimes or get deleted. But the directories are always there. It seems like the paths to files and the directory structure is part of the program and database, but the contents of the pasts are separate from the system. Paths are created by business id. So you get something like: "root/AB/1/5/8/1588" or "root/NSD/5/50" or "root/NSD/0/1/". But the business id alone is not very meaning full, so the actually directory that hold the files could be "1588-some title-158520" or just "1588". My indecision comes from when people are using this, if you always need to query the database to know what the directory your working with means it might be frustrating because of the need to keep things in files with these weird business id names that aren't really descriptive. This problem could be solved with over reliance of programs. Like having a cd that takes business ids instead of file paths. If systemcd is the way I go, instead of trying make things a little human || self-describing, you then end up in a situation where the user is very dependent of a number of programs working correctly with a specially directory root for the system. Since the directories aren't part of the program data, then there isn't going to be a standard that says you always have to mount X on "/var/custom-system", so it might be a good idea to a an /etc file for the program suit that dictates the root for the directory structure.

It's weird because on the one hand you have something that is a content-address database that uses business id for the keys, but at the same time

Is it kosher to use /var/lib/custom-system for data that is going to be edited by the user and not edited by an interpreter or solely a program? To compare it to MySql, the directories are controlled by the Sql system, but the data is user defined, even if the file format isn't. Is there a more appropriate mount point for the custom-system root, given that the structure is program controlled and expected to follow a standard, but the actual data is opened from the file system by the user? I would guess that it's not correct to use /var/lib for anything that is not completely managed by a program like an interpreter and that any state preserved in lib is considered is considered part of the environment of sql and would not be appropriate to be manage out side of the sql system. My program and directory-structure would not pass the tests provided by the Sql example. In my program only the directory paths are really manged by the program, but the actual contents in the deepest directories are just user edited. Meaning the data stored in the deepest directories are not managed by the program, so the system can not be directly compared to Sql. There must be a directory that the paths are considerd part of the program's persistant state, but centense of the directiors is not. Where should that be be mounted in a Unix-like system?
Code:
This is the problem


If "AB654654-some title" is the deepest directory, then

- It will be easier for the user to recognises. But not
  necessarily better for the system because users can't be
  expect to select their directories by "some tiles" due to
  the large volume of items.
- Globing will need to be addeh to the program to allow for
  searching from titles, which will necessitate a coloset
  match system. Match System is complicated and I'm not
  looking forward to adding it.
- I don't know where directiores that are both
  program-persistant-state and user-defined data should be
  mounted on.
Else If "AB654654" is the deepest directory, then

- There is less human readability, as the business IDs
  really only make sense when looking at hen in the
  database. This inevitability places a greater load-barring
  on programs that sport this directory structure; custom-cd
  is a program that will change directories to the correct path
  for any business id provided. Custom-cd has this exclusive
  function and alongside other programs that can add new
  items and preform other operations on the structure.
- There will not be as much human-readability because
   business ids don't have automatic meaning to the user.
   The user would need to know what the business id
   represents from searching it. This willrequire the existence
   of program that will instruct the user what they should be
  working on next. (Mind you this is all cmd.)

End
#include <https://upload.wikimedia.org/wikipedia/commons/f/f3/Standard-unix-filesystem-hierarchy.svg>

Accounting to the diagram, /var/opt/dictionary-system-name would be
the correct mount path. /var/lib is for data modified as programmers run. Which would aline with my supposition that because the data in my custom structure is not intend to be modified by program, it would be inappropriate to place it in /var/lib.

This is a big help. I'm still not 100% certain I've got it, I'll be honest. But I'm willing to give it a go.

So to pick a few things out of order, yes - traditionally something like this would go under /var. But don't over think it. You will likely have some sort of SAN or NAS that is the actual file system and it will be mounted as a directory, perhaps somewhere under var with it's own name. A lot of this stuff is convention rather than rules. So long as you aren't doing something really whacky like sticking the user files in /usr/bin or something it's not a big deal. Permissions are the biggest issue but I can't tell if you're the actual Sysadmin running this system. I fear that you are as these days companies often don't pay for specialists who know what they're doing but expect any smart person to know all areas of IT because "they're our computer person". Anyway, that's a tangent - I just wanted to tell you that if you created a /var/businessfiles directory nobody is going to hunt you down and chastise you for it. (Well, Richard Stallman might, but he can be distracted by engaging him in a discussion of versions 2 and 3 of the GNU licence).

So to the main point, what you have is a large, multi-purpose database but one aspect of it, the one you're concerned with, is linking the result rows to the file system. So they might execute any number of different queries on the database, using all sorts of criteria. But part of the output is going to be this path or paths to directories. And that link is going to be based on joining to your table of directory paths.

And that has to be up to date with the latest files that were added.

So a valuable principle is "Single Source of Truth". Because outside parties can write to the file system outside of your control, it seems to me that the file system is source of truth and what you want is a process where information flows this way and this way only.

File System Change --------> Database Representation.

Which means to me what you likely want is something that polls the filesystem continuously and updates the database with any new directories or files that it finds.

I'd suggest writing a small script in your language of choice. I don't favour Python for large projects but it would be very suitable for something like this. And it has modules to support whatever SQL database you are using (I think you mentioned MySQL?). Have a cron task or supervisord run that script every few minutes and any files newer than the last time it completed, it executes INSERTS into your database. I am surmising from the fact your directory hierarchy corresponds to the business IDs that it can work out what values to insert them into the database.

You have a problem in that you are doing this in Linux because Linux file systems don't normally store creation time of files, only last modification time. There might be Linux file systems that do store it, I just believe that ext2/3/4 don't. If you had NTFS or ReFS (both Windows filesystems) you'd have that information. However, from the description of your problem it may not matter as you only have to make sure you are doing INSERT...IGNORE instead of INSERT to prevent problems arising from finding old files newly modified that are already in your database.

So correct me if this doesn't work and I've misunderstood the problem again (again).

  1. Execute the following at the top directory of your file system: touch timecheckfile . This does nothing but creates an empty file with the name "timecheckfile". This will become useful in a moment.
  2. Create a script, perhaps Python (never Bash). which does something like the following:
    1. Executes the following command: find . -type f -newermt timecheckfile -ls . This returns a list of files that are newer than the comparison file (which is why you just made a file to use for this purpose).
    2. Reads the returned list of files (if any) and then from the path returned in each, works out the business ID for each path. You'll want to do some string work in Python to get the field but it shouldn't be too hard.
    3. Iterate through that list of business IDs and paths and INSERT...IGNORE each into the relevant table in your database.
    4. Overwrite timecheckfile. This sets a new modified time on the file so that next time this script runs it is starting from the last time it ran instead of finding the same files over and over.
  3. Create a cron task that executes the script every ten minutes. Something like this: 0 0/10 * 1/1 * ? * /usr/bin/python dbupdate.py. Swap around for whatever times and script name you've picked, making sure that the time is acceptable for how rapidly the DB must update.
You will likely also want to have your script create a lock file. That's just a file that is created when the script starts and is deleted when the script ends and you write your script so that it checks for this as the first thing it does. Then if a new instance of the script starts running before the old one ends, your script says: "Oh, I'm out of here," and exits without doing anything. It's a safety measure because your script might take longer to run than you think.

You should also have the script email you (or whoever) if it throws an exception, the lock file is older than an hour or whatever else makes sense to you. You want to know if it's not updating.

That might be a bit simplistic. You mention needing to grab titles for later searching for example. But does that work for you as an outline approach? It's always going to have a bit of a lag and if the directory system is very large maybe you want to break it up a bit, e.g. split the directory structures up according to some pattern so different hierarchies can be on different disks for quicker search performance. I don't know what the acceptable standards are. But there's a large difference in effort between a perfect system and a system that's good enough. So keep that in mind.

EDIT: I'm still not 100% on whether we're talking files or directories only, here. You can swap find -f to find -d or vice versa to limit to only returning files or only returning directories. Python has a command to execute native bash commands from within it which is why I list this. But you could do the same in PHP, Powershell or whatever. Yes, you can use Powershell remotely to the system but for high volume file operations I suggest something running on the system itself. So actually I may have made a bad suggestion with a NAS or SAN. Maybe you just want a bunch of RAID'd disks in the server this all runs from. Think about back up and failover in the future, though!
 
This is a big help. I'm still not 100% certain I've got it, I'll be honest. But I'm willing to give it a go.

So to pick a few things out of order, yes - traditionally something like this would go under /var. But don't over think it. You will likely have some sort of SAN or NAS that is the actual file system and it will be mounted as a directory, perhaps somewhere under var with it's own name. A lot of this stuff is convention rather than rules. So long as you aren't doing something really whacky like sticking the user files in /usr/bin or something it's not a big deal. Permissions are the biggest issue but I can't tell if you're the actual Sysadmin running this system. I fear that you are as these days companies often don't pay for specialists who know what they're doing but expect any smart person to know all areas of IT because "they're our computer person". Anyway, that's a tangent - I just wanted to tell you that if you created a /var/businessfiles directory nobody is going to hunt you down and chastise you for it. (Well, Richard Stallman might, but he can be distracted by engaging him in a discussion of versions 2 and 3 of the GNU licence).

So to the main point, what you have is a large, multi-purpose database but one aspect of it, the one you're concerned with, is linking the result rows to the file system. So they might execute any number of different queries on the database, using all sorts of criteria. But part of the output is going to be this path or paths to directories. And that link is going to be based on joining to your table of directory paths.

And that has to be up to date with the latest files that were added.

So a valuable principle is "Single Source of Truth". Because outside parties can write to the file system outside of your control, it seems to me that the file system is source of truth and what you want is a process where information flows this way and this way only.

File System Change --------> Database Representation.

Which means to me what you likely want is something that polls the filesystem continuously and updates the database with any new directories or files that it finds.

I'd suggest writing a small script in your language of choice. I don't favour Python for large projects but it would be very suitable for something like this. And it has modules to support whatever SQL database you are using (I think you mentioned MySQL?). Have a cron task or supervisord run that script every few minutes and any files newer than the last time it completed, it executes INSERTS into your database. I am surmising from the fact your directory hierarchy corresponds to the business IDs that it can work out what values to insert them into the database.

You have a problem in that you are doing this in Linux because Linux file systems don't normally store creation time of files, only last modification time. There might be Linux file systems that do store it, I just believe that ext2/3/4 don't. If you had NTFS or ReFS (both Windows filesystems) you'd have that information. However, from the description of your problem it may not matter as you only have to make sure you are doing INSERT...IGNORE instead of INSERT to prevent problems arising from finding old files newly modified that are already in your database.

So correct me if this doesn't work and I've misunderstood the problem again (again).

  1. Execute the following at the top directory of your file system: touch timecheckfile . This does nothing but creates an empty file with the name "timecheckfile". This will become useful in a moment.
  2. Create a script, perhaps Python (never Bash). which does something like the following:
    1. Executes the following command: find . -type f -newermt timecheckfile -ls . This returns a list of files that are newer than the comparison file (which is why you just made a file to use for this purpose).
    2. Reads the returned list of files (if any) and then from the path returned in each, works out the business ID for each path. You'll want to do some string work in Python to get the field but it shouldn't be too hard.
    3. Iterate through that list of business IDs and paths and INSERT...IGNORE each into the relevant table in your database.
    4. Overwrite timecheckfile. This sets a new modified time on the file so that next time this script runs it is starting from the last time it ran instead of finding the same files over and over.
  3. Create a cron task that executes the script every ten minutes. Something like this: 0 0/10 * 1/1 * ? * /usr/bin/python dbupdate.py. Swap around for whatever times and script name you've picked, making sure that the time is acceptable for how rapidly the DB must update.
You will likely also want to have your script create a lock file. That's just a file that is created when the script starts and is deleted when the script ends and you write your script so that it checks for this as the first thing it does. Then if a new instance of the script starts running before the old one ends, your script says: "Oh, I'm out of here," and exits without doing anything. It's a safety measure because your script might take longer to run than you think.

You should also have the script email you (or whoever) if it throws an exception, the lock file is older than an hour or whatever else makes sense to you. You want to know if it's not updating.

That might be a bit simplistic. You mention needing to grab titles for later searching for example. But does that work for you as an outline approach? It's always going to have a bit of a lag and if the directory system is very large maybe you want to break it up a bit, e.g. split the directory structures up according to some pattern so different hierarchies can be on different disks for quicker search performance. I don't know what the acceptable standards are. But there's a large difference in effort between a perfect system and a system that's good enough. So keep that in mind.

EDIT: I'm still not 100% on whether we're talking files or directories only, here. You can swap find -f to find -d or vice versa to limit to only returning files or only returning directories. Python has a command to execute native bash commands from within it which is why I list this. But you could do the same in PHP, Powershell or whatever. Yes, you can use Powershell remotely to the system but for high volume file operations I suggest something running on the system itself. So actually I may have made a bad suggestion with a NAS or SAN. Maybe you just want a bunch of RAID'd disks in the server this all runs from. Think about back up and failover in the future, though!
This is really helpful thanks.

It's funny that you bring up Linux and file creation metadata. I always thought the reason it was never fixed until really recently (In 2018 the changed staxtx to support stx_btime, but you have to have new glibc and kernal for it to work.) is no one cared and never did anything about it. But it is a good point that with a system like this, you may want to filter for only files that have been created recently to focus the paths need to be checked for database fidelity.

I didn't think of using a cron job to keep the database and directories consistent; it's a good idea. I can get away with having the cron job glob over the custom-system's directories to check everything is in the database and then add discorded new untracked paths. The script, instead of just replacing everything every item entry for item it finds, will just match if a globed path is also in the database, and if not the insert it; this can be done is chunks of 100 entries to reduce the resources of comparing 2 lists of thousands of entries to see if they both match. I'll run the cron job less often, like once a day, and then have a program that adds directories run the cron check (run the script) before adding to the database new entries to make sure the database and directories are in a good state before adding new entries. The daily cron jobs can then just to catch any attempts to ignore the database when adding directories manually.

It might seem like working though 100 item chunks would be slow to validate the database and directories every single new item need to be added. For new items, all that needs to be consistent is the numbering, so only directories that might exist and could be past the current item number held in the database need to be checked for. Save highest. Glob -> order reverse -> check glob until highest. With this setup, if the highest stored in the database is also the first thing from the reverse glob the function without having to do any insert on the database. You still have to do the glob, but I'm guess it's the database stuff that is the most expensive and not just having an imitable list of paths in memory for the life of the function. A quick test shows that's it's only mega-bytes for the test data, so I'm not afraid of putting that into memory. There are countless other algorithms that can be made to tweak this to be cheaper.

It's good you bring up the mount point stuff, I probably get a little too concerned about doing things the right-way in the file system hierarchy. I'm thinking /var/opt/custom-system should just be where it's mounted on clients. On the server, it should be /media/username-for-service/instance-name. Since programs like clamav have its own user account for freshclam, custom-system should have its own user for permissions on the server mount point and database access.

You make a good point about how this is actually going to be stored. For the test data I'm just working with a slow eco-HDD. Since this is going to be used concurrently and really being hit all at the same time, for example when every one starts work and need to pull down their projects, then the server might hit a bottle neck getting the data out to the clients. I'll probably tell the company that is putting this in their office to buy some SSDs for the drives. When they need more space then they can just get out of using individual drives, they probably should use raid to expand the storage more then just having some redundancy. I don't know what the company does for a backup, but I'm going to recommend they just clone the volume while not mounted. With a clone of the volume, it's not hard-coupled to the drives and or raid system they might go with and they can even use the clones to migrate to new hardware if they want. The company can clone their drives after employees go home and it's very flexible internals of how much fail-over the want. Since most of the stuff in the system is mutable to users, I'm guessing they wouldn't need crazy accurate backups. Having just a second copy of the drives in case something happens seem it should be enough.
 
This is really helpful thanks.
I'm glad. Hopefully this is starting to come together for you.
It's funny that you bring up Linux and file creation metadata. I always thought the reason it was never fixed until really recently (In 2018 the changed staxtx to support stx_btime, but you have to have new glibc and kernal for it to work.) is no one cared and never did anything about it. But it is a good point that with a system like this, you may want to filter for only files that have been created recently to focus the paths need to be checked for database fidelity.

You said that old files aren't removed, people just add to it. And if a file is changed, that's not changing where it is located. Therefore you only need to check for new files. Unless I've misread what you've said.

I didn't think of using a cron job to keep the database and directories consistent; it's a good idea. I can get away with having the cron job glob over the custom-system's directories to check everything is in the database and then add discorded new untracked paths. The script, instead of just replacing everything every item entry for item it finds, will just match if a globed path is also in the database, and if not the insert it; this can be done is chunks of 100 entries to reduce the resources of comparing 2 lists of thousands of entries to see if they both match. I'll run the cron job less often, like once a day, and then have a program that adds directories run the cron check (run the script) before adding to the database new entries to make sure the database and directories are in a good state before adding new entries. The daily cron jobs can then just to catch any attempts to ignore the database when adding directories manually.

It might seem like working though 100 item chunks would be slow to validate the database and directories every single new item need to be added. For new items, all that needs to be consistent is the numbering, so only directories that might exist and could be past the current item number held in the database need to be checked for. Save highest. Glob -> order reverse -> check glob until highest. With this setup, if the highest stored in the database is also the first thing from the reverse glob the function without having to do any insert on the database. You still have to do the glob, but I'm guess it's the database stuff that is the most expensive and not just having an imitable list of paths in memory for the life of the function. A quick test shows that's it's only mega-bytes for the test data, so I'm not afraid of putting that into memory. There are countless other algorithms that can be made to tweak this to be cheaper.

You say "glob" a lot. I'm not sure why. You don't need to pattern match anything, do you? You just execute a find command like I gave you to get a list of files that are newer than the last time you had a successful search, and add them to the database.

Regarding "the script, instead of just replacing everything every item entry for item it finds, will just match if a globed path is also in the database, and if not the insert it; this can be done is chunks of 100 entries to reduce the resources of comparing 2 lists of thousands of entries to see if they both match" this doesn't sound right. Two reasons:

First, your script is the only thing adding things to the DB. Where did this idea come from that other people will be adding file paths to the DB come from?

Secondly and more importantly, it's pointless. Compare which of these takes the longest:
  1. Retrieve list of existing files from the DB, gather list of new files from file system. Compare both to remove any entries from file system list that are already in the DB, send insert requests to DB for each of the remaining new files.
  2. Retrieve list of new files from file system. Send INSERT... IGNORE command to DB for each of them.
There's next to no performance difference (in your scenario) between sending an INSERT or sending an INSERT... IGNORE. The only difference is that for the first you have to ALSO first do a check in the DB to see if you should send the command or not based on if it already exists. So why do two commands (a SELECT followed by an INSERT) when you can just do one command (INSERT...IGNORE).

Maybe I'm missing something but I think you're way-overcomplicating this and I feel like the captain in 21 Jumpstreet where he's "Infiltrate the dealers, find the suppliers" over and over again. Scan for new files, send to DB.

You make a good point about how this is actually going to be stored. For the test data I'm just working with a slow eco-HDD. Since this is going to be used concurrently and really being hit all at the same time, for example when every one starts work and need to pull down their projects, then the server might hit a bottle neck getting the data out to the clients. I'll probably tell the company that is putting this in their office to buy some SSDs for the drives. When they need more space then they can just get out of using individual drives, they probably should use raid to expand the storage more then just having some redundancy.

How much does the data come to currently and at what rate do you expect it to grow? For real - check this stuff, put in a forecast for rate of growth based on something sensible, add an explicit "margin for error" and put that in your recommendation to your bosses with your cost estimates. They'll appreciate the diligence. Unless the data is in the terabytes, get a couple of SSDs with spare capacity and RAID them for reliability.

And remember, RAID is not for backup, RAID is for reliability. It's so that the company isn't sitting there paying staff to do nothing while you copy files around because a single drive failed.

Also, fwiw, if there are bottlenecks in file delivery, your network connection is likely to be an order of magnitude slower than disk reading is my guess. So if you are concerned about this (and I'm not saying you should be) run some numbers on usage against the network connection first.

I don't know what the company does for a backup, but I'm going to recommend they just clone the volume while not mounted.

If you do that, you're going to have to change your name to "stares at thousands of error messages". For real, this sounds terrible. Not least of which because it requires downtime. There's no reason that I know of to go with such an approach. I can't give a good recommendation because I'm a Windows person but maybe someone else here can. However, I DO know that bringing down a file system, carrying out some literal clone operation of the disk (like for real, what are you going to do `dd` the blocks?) and then re-mounting it is nuts. Even just having rsync running to copy file changes to a network mounted backup server, simple and traditional as it is, would be infinitely better.

With a clone of the volume, it's not hard-coupled to the drives and or raid system they might go with and they can even use the clones to migrate to new hardware if they want. The company can clone their drives after employees go home and it's very flexible internals of how much fail-over the want. Since most of the stuff in the system is mutable to users, I'm guessing they wouldn't need crazy accurate backups. Having just a second copy of the drives in case something happens seem it should be enough.
You don't need literal copies of how the data is represented on the drive. You just need the data and its structure. For real, use some reputable back up process and copy the directory structure and files to your backup location routinely and without human intervention. Listen to me on this - I am saving your job right now.
 
Sorry, I'll try to clear somethings up.
You said that old files aren't removed, people just add to it. And if a file is changed, that's not changing where it is located. Therefore you only need to check for new files. Unless I've misread what you've said.
I must have not explained this well; you use path and file for so many thing it seems like they're the same. The path to a given business id will look like, "mount-point/ABC/2/3/5/ABC2358". Let's call "ABC2358" Item Root. Everything inside of Item Rood is not tracked by the database and can change and have any files in it. All the paths in-between mount-point and Item Root can be found from the Item business id; the path must obey the algorithm of used to create the in-between paths. And the in-between paths should not be created arbitrarily by the user. Business ids are auto-incremented from 1, so there aren't going to be weird cases where like with the US phone numbers where xxx-555-xxxx, where x is any number, does not call any number and is a reversed section of the phone number address space. It's really simple how the business ids are created.

There needs to be a way to prevent arbitrary additions to the in-between level, but allow for insertions of new business ids where the database holds the business id added and only the paths in mount-point are reflected by the business ids in the database. Because of the algorithm used to create the in-between paths, the whole in-between path can be derived and does not need to be stored. But there is nothing stopping also storing the paths, it's just not necessary. I thought your suggestion was with this in mind, but it seems like you thought when I meant file I meant the actual file at the end of the path and not just the file path fragment, Sorry this is probably what's so confusing

You said that old files aren't removed, people just add to it. And if a file is changed, that's not changing where it is located. Therefore you only need to check for new files. Unless I've misread what you've said.
Yeah, when I said files I really meant business id, which are directories in a algorithmic path scheme. The problem here is that there is nothing stopping a user with read-write permissions on mount-point from deleting things or adding them arbitrarily and also be able to add correctly new business ids. I was thinking just only giving users access the Item Roots, but then how would they be able to add new business ids without having some sort of server take a request for the users, make the database consistent with the directories, and then create a new business id. I was thinking of just having custom programs that will keep these rules maintained, but I'm not sure how to enforce these rules against users that abide them if they are going to have write permissions on mount-point.

You say "glob" a lot. I'm not sure why. You don't need to pattern match anything, do you? You just execute a find command like I gave you to get a list of files that are newer than the last time you had a successful search, and add them to the database.

Regarding "the script, instead of just replacing everything every item entry for item it finds, will just match if a globed path is also in the database, and if not the insert it; this can be done is chunks of 100 entries to reduce the resources of comparing 2 lists of thousands of entries to see if they both match" this doesn't sound right. Two reasons:

First, your script is the only thing adding things to the DB. Where did this idea come from that other people will be adding file paths to the DB come from?

Secondly and more importantly, it's pointless. Compare which of these takes the longest:
  1. Retrieve list of existing files from the DB, gather list of new files from file system. Compare both to remove any entries from file system list that are already in the DB, send insert requests to DB for each of the remaining new files.
  2. Retrieve list of new files from file system. Send INSERT... IGNORE command to DB for each of them.
There's next to no performance difference (in your scenario) between sending an INSERT or sending an INSERT... IGNORE. The only difference is that for the first you have to ALSO first do a check in the DB to see if you should send the command or not based on if it already exists. So why do two commands (a SELECT followed by an INSERT) when you can just do one command (INSERT...IGNORE).

Maybe I'm missing something but I think you're way-overcomplicating this and I feel like the captain in 21 Jumpstreet where he's "Infiltrate the dealers, find the suppliers" over and over again. Scan for new files, send to DB.
  1. Glob is Unix-like slang for using wild-cards like * to select a rage of files.
  2. I don't think I can only have the script keep the database in check because there is a need to keep the scheme in forced so that the in-between paths, "ABC/2/3/5/", are functions of what the business id is.
  3. It might seem silly to want to diagnose the dirty state before cleaning it up, but I have know way of controlling what the bad users have done and there are are at least 60000 items. More items will be added and they need to obey the scheme and be seen it the database and be need to have sequential business ids. Maybe you were thinking that the there wasn't such a strict protocol around the business ids. The reason for the extra checking is to make sure that the added items are done sequentially and that there are not collisions between things created in the database without knowing about the directories already being created. When I said at least 60000 items, the highest business id is 64552 and it's going grow a lot more once the system is in service. I really like the brain storming, but maybe a different solution would make more sense.
  4. I haven't see 21 Jumpstreet.
How much does the data come to currently and at what rate do you expect it to grow? For real - check this stuff, put in a forecast for rate of growth based on something sensible, add an explicit "margin for error" and put that in your recommendation to your bosses with your cost estimates. They'll appreciate the diligence. Unless the data is in the terabytes, get a couple of SSDs with spare capacity and RAID them for reliability.

And remember, RAID is not for backup, RAID is for reliability. It's so that the company isn't sitting there paying staff to do nothing while you copy files around because a single drive failed.

Also, fwiw, if there are bottlenecks in file delivery, your network connection is likely to be an order of magnitude slower than disk reading is my guess. So if you are concerned about this (and I'm not saying you should be) run some numbers on usage against the network connection first.
  1. You might think this is a joke, but the company I'm working for said they have 3 terabytes now and they expect that to increase by 2 TB every year. That's why I was hired me to create this system. I'm a contractor, so I'm not in-house. I'm just going their software.
  2. I've seen consumer 4TB SSD drives, so I was going to recommend that. Saying it out loud, it might be worth it to have them do platted drives, last time I looked they didn't seem much cheaper. I think there all Eco-drives now and don't spin fast, so just push the SSD new tech stuff. I should probably double-back on this and just go with platter, but the price hikes this year are bumming me out.
  3. I would usually agree that platter drive are the better choice, I was going to tell the company that they should go with SSDs for better performance, but since you have pointed out the network is going to be too slow anyway your probably right this is ridiculous.
If you do that, you're going to have to change your name to "stares at thousands of error messages". For real, this sounds terrible. Not least of which because it requires downtime. There's no reason that I know of to go with such an approach. I can't give a good recommendation because I'm a Windows person but maybe someone else here can. However, I DO know that bringing down a file system, carrying out some literal clone operation of the disk (like for real, what are you going to do `dd` the blocks?) and then re-mounting it is nuts. Even just having rsync running to copy file changes to a network mounted backup server, simple and traditional as it is, would be infinitely better.
  1. I probably should have been clearer earlier I'm just a contractor. For example, I don't' know how the company backs up their workstations. I don't know what they do with the backups they make. I don't know if the have off site backups or anything.
  2. It's really not crazy to do a clone. Other places I've done work for, the sys-admin there would cone their workstations. You don't physically remove the drives you just have script umount and then copy it with dd. A problem with rsync is that it would just pull the drives in and then recreate them on a separate drive. rsync alone does not have it's own deduplication system, so you would still have to do the same copying just slower because you're trying to do it with a mounted volume, copying the same files and recreating folders on the destination drive, it is going to be slower then just cloning the drive. I know there are net-book live images that companies have on their networks so they have their systems wake-on-LAN and then preform the cloning process from the net-booted images. The backups are sent off someplace safe and you don't even have to think when you need to recover, it's plug and play. You tell me, you've had a data-loss, you computer is toast and you need to do you work but can't. With rsync or some other mounted-file-system backup, you have to reload the box, get your programs on it (Yikes you're on windows. Maybe 10 is better, but those net installs to get all your ms office stuff.) and maybe tomorrow you can get back to work with your method. With a clone of the drive, a disk image, you can load all the software for the installed system that was just lost in hours and get back to work. There to configure because you have an exact copy of what you just had. In a data-loss situation with a clone, you just need a new box and you're back in action. This is just kooky, why would you want to have to reconfigure you lost system to get it back when you could just pull out a copy and in a pinch just use that copy, you don't even need to take the time to recover the back up. Rsync is completely incommensurate with disk cloning. It's the same difference between hand copying notes from a notebook or xeroxing the notebook and walking of with the copy. 9 our of 10 data-recovery specialists asked say that drive cloning is the method they use personal on their system. You're probably right about INSERT IGNORE, but you have drive cloning backward.
Listen to me on this - I am saving your job right now.
You seem to be very confident you know better.
 
Last edited:
need to have sequential business ids.
this shits me as much as anything else you've described, because it sounds like you might be in a position to be blamed if this 'integrity' is not maintained.

It sounds like this is some kind of composite key that is based on some business unit within the business and a per-business unit ID (maybe with an extra division or two in there, who knows).

How is this ID populated? I know that you're being told these are unique and must be sequential, but how does that work? If users can just create folders which are supposed to relate to keys and they sit under more than one layer of directories, you need to be concerned about that. If there aren't uniqueness checks on the DB side, either done properly or through some hack in a trigger or in the business application, there may even be duplicate keys in the DB.

Same deal with the 'sequential' part of things. Can users create a record in the DB and then delete it? If so, there's a gap in the sequence. When does the generation of IDs happen? If before a record is actually inserted into the DB, then either an ID gets grabbed before the record is inserted, in which case there will be a gap in the sequence if the user doesn't finish inserting the record, or, there has to be some sort of lock held until they finish inserting it.

Or, just as likely, some retard has created a system where there isn't actually a uniqueness constraint and the 'id' to be assigned to the record is populated from the highest currently present ID at the time a user starts inserting a record, so two users creating records simultaneously could create records with the same ID.

If these aren't things you know, it would be a good idea to make a list of all the key-related directories you have on the existing file system, and check for uniqueness, and query on the existing database to check for uniqueness (HAVING COUNT(1) > 1 etc) and sequentialness (maybe just make a numbers table with numbers up to the highest numeric part of the identifier, join to the prefixes, then join that to all the existing records on prefix+number <> existingrecordid).

You're going to need to walk the directories anyway, so that all needs to be done at some point.

EDIT: Another note- it's unclear just what sort of client operating system your users are using, whether they're using custom business applications or Word, Excel etc, but it sounds like you are (reasonably) concerned about what they would do if they can rawdog your file system.

Something you might consider- retain a DB table which stores the directories, and preferably files, present on the file system. This can be updated by a nightly job. The nightly job can detect where existing files/directories have been renamed/deleted and alert relevant parties/flag that in a field against the former record/log it in an audit table.
 
Last edited:
You seem to be very confident you know better.
Well you said your backup strategy on these files was to take the filesystem offline on evenings or weekends and clone the disks. So... yes?

Frankly, I've put quite a bit of effort into helping you but I think you're out of your depth. You keep focusing on weird details, often getting them wrong or not getting their purpose. I tell you even basic rsync would be better and you tell me it's no good because rsync doesn't have de-duplication. Well no shit - it's not supposed to. De-duplication (in the traditional sense) doesn't happen at that level. You know what also doesn't have de-duplication? Cloning disks.
You keep mentally jumping around. I said use a backup solution for the files you keep talking about. Now you're talking about how you can't restore programs with rsync. I ask you why you keep talking about globs when I've seen nothing in your requirements where you need to use wildcard matching, and you start explaining to me what "globbing" is (i.e. wildcard matching). You come off as having a bit of an attitude for someone who I've had to explain basic things to.

@Overly Serious @3119967d0c
Thanks, but I figured it out on my own.
Frankly, I doubt that. In any case, I'm out. Pretty sure this is a case where a company with no in-house IT expertise has hired someone because they can't tell the difference between someone who is good with tech and someone who talks about tech.
 
Screenshot from 2021-06-24 00-15-50.png

r8 the top layer of my database
 
Back