any here into stylometrics? not a data scientist so i could be reading all this wrong
What a coincidence! I've done some semi-professional work with stylometry and yeah, you're reading this completely wrong.
Word length doesn't mean much, aside from a general indication that two people write at about the same level. A child will use shorter words, an academic paper will use longer words. There's not a whole lot of meaningful information to be gleaned from word length, at least on its own. Sentence length is similarly meaningless. Paragraph length is a bit more useful but not enough to make a real conclusion from.
Letter frequency is totally meaningless, like absolutely a worthless metric. You'll find that basically every chunk of English text, if it's long enough, will share a pretty similar distribution.
Punctuation is, contrary to what may seem obvious, probably among the most useful metrics here, but it's still not a very useful metric on its own. POS distribution is also a useful metric, but again, not on its own.
Unique word choice is a
very useful metric in some scenarios - if I remember right, there was some pedophile who ultimately got caught because the initial lead on him was that he greeted people by saying "heya" instead of "hey" or "hi" and "heya" tends to only be used in a fairly small part of the world. However, in this case, it's almost certainly a red herring because we're in a fairly insular community which uses slang that doesn't appear elsewhere. Compared to any member of the public, people who say words like "boglim" or stuff like that will stand out when you analyze text in this way. It points to a similarity with each other versus the general public, but it doesn't point to a similarity between the two writers within the context of this insular community.
ChatGPT's analysis is totally meaningless. I've experimented with using LLMs for stylometry and it turns out they're awful at it because, often enough that it makes their output unreliable, they'll just tell you what they think you want to hear. They're biased towards saying two writers are the same because you've started off by asking the question "are these two writers the same" essentially. You can often get LLMs to completely 180 on an arbitrary analysis like this just by asking "hmm, are you sure about that?" because "hmm, are you sure about that?" tends (tended?) to produce a response of "Oh, you're right, I'm mistaken" in the training data, and then an explanation made up after the fact about why the first explanation was wrong and this new explanation is right.
Using PCA on this data makes no sense. Maybe I'm missing context from what the lead-up to ChatGPT's response was, but PCA isn't an appropriate tool for comparing one piece of data against a second piece of data. It's useful if you're comparing a couple data points against hundreds or thousands of other data points, but comparing 2 things to one another using PCA is kinda baffling, which is why I wonder if I'm missing something from this.
Above, when I said that these metrics are useful, but not on their own, what I mean is that they're useful in aggregate when compared to tons and tons of other data points. Statistically significant differences are visible when you compare arrays of word length, letter frequencies, punctuation frequencies, unique word occurrences, etc, and treat them all as input into more complex tools, but when you're just looking at graphs of these pieces of information in isolation, even things which seem statistically significant (like letter frequency) aren't really. Even then though, these very basic stylometric features aren't very information-rich.
Token sequence analysis or n-gram analysis is generally considered to be a better avenue for authorship attribution, at least to my knowledge. I recommend these papers if you're interested in learning a bit more:
Of particular note:
Using purely stylometric features produces really poor results. Barely above coin flip odds, even under good conditions.