Creating a search index:
The documents put into the search index have to be processed and then indexed. The index only holds a table (how ever you represent it) that has the document (file path or reference number), the number of times the word appears in the document (popularity), and the word. The polarities are what are being ranked and the words are what are matched form the query.
1. Get text for all documents being indexed.
2. Delete all the words that are super common: all the prepositions and articles. This is done with a list of words to drop when counting words in the document.
3. Create a map (basically a table) of all the remaining words and their popularity for each document.
4. Store the document, word, and popularity in what becomes your Search Index. I did this with a Key Value DB and JSON, but you could do this different ways.
After the above steps all you have is a search index. To search it, you need to write a program that will compare words in the query with the popularities and return the documents ranked by popularity numbers. There are a lot of different ways to do this. There are a lot of different ranking algorithms, some more complex than others having special cases for words and special ways of processing queries. What I did was the simplest. Words in the query get put though the same processing and the documents and then words that remain are matched against the index, ranked by the popularity, and then the document (by reference ID) is returned. The interesting part is that I played around with different ranking algorithms I found online.
Processing the query:
1. Get query.
2. Delete all the words that are super common.
3. Create a matrix (table) to send to the GPU. This looks like each column is a word from the query, so we will have a value of 0 (zero) in some columns and others will have the popularity number of the word. The rows are the documents by reference ID. When you send this to the GPU the column and row labels are not sent with it, the order is maintained in this process and you keep the data used to create the table to turn it back into something comprehensible to the rest of the program. This step often is poorly represented online where I've seen it described. There is not any explanation given on how this matrix is created or how you know what the data in the matrix actually represents. Basically, it just the natural sorted order of the documents and their popularity values streamed into a vector, and then you know to divide the vector by the number of word popularities being compared to get the rows.
4. Send this vector to the GPU and tell the GPU to run the algorithm on it. This is done in OpenCL. Using OpenCL is not required and if this was all being done on CPU or single threaded the matrix step could be different and the algorithm implementation would look different.
5. After the GPU does maths, you get back a single vector, each number in the vector represents a row from the matrix and that row is then matched back with the document that was associated with the columns and then is sorted on the ranking number.
(There are some data normalisation steps and query preparation that I left out. Basically I made the query equal a document that has the sum of all popularities form the matched documents. I also wanted all the values to have a range between -1 to 1 when doing the calculations and then be represented from 0 to 1 when returned to the program. These steps are not essential and were really more of a way for me to get more comfortable with OpenCL.)
At the end you have documents denoted by ID numbers because it would be ridiculous to keep the string for the whole document as the label, obviously, and the rankings. The matrix is no longer kept around at this point and you just sort the documents by the values given from the GPU. You can now decided how you want to display this information.
I will post the source code for all the algorithms I tried. Sometimes the algorithms would represent things in descending order instead of ascending order. All the algorithms were changed so that the rank was inversely proportional to the match strength; best smallest number first worst biggest number last. Every algorithm had the same quagmire that they could be tricked by having too high a search term popularity in some documents and the algorithms didn’t try to find what document scored bast on all search term popularities provided. I found Manhattan Distance to give results that I liked best.
What I have done does not use any ML. In the future I may look into using ML for indexing and processing queries. AI chatbots are trained to be chatbots not matching and ranking algorithms. I don’t know what it would look like to train an AI for document ranking. Might be something to look into in the future.
C:
// Open CL Kernel
/* // Cosine Similarity */
/* size_t i = get_global_id(0); */
/* float */
/* dp = 0.0f, */
/* n1 = 0.0f, */
/* n2 = 0.0f; */
/* for (size_t ii = 0; ii < elements; ii++) { */
/* float */
/* v1 = matrix[i * elements + ii], */
/* v2 = query[ii]; */
/* n1 += v1 * v1; */
/* n2 += v2 * v2; */
/* dp += v1 * v2; */
/* } */
/* scores[i] = 1 - (dp / (sqrt(n1) * sqrt(n2))); */
// Manhattan Distance
size_t i = get_global_id(0);
scores[i] = 0.0f;
for (size_t ii = 0; ii < elements; ii++) {
float
v1 = matrix[i * elements + ii],
v2 = query[ii];
scores[i] += fabs(v1 - v2);
}
/* size_t i = get_global_id(0); */
/* scores[i] = 0.0f; */
/* for (size_t ii = 0; ii < elements; ii++) { */
/* float */
/* v1 = matrix[i * elements + ii], */
/* v2 = query[ii]; */
/* scores[i] += 1 / fabs(v1 - v2); */
/* } */
/*
// Vector Inner Product
size_t i = get_global_id(0);
scores[i] = 0.0f;
for (size_t ii = 0; ii < elements; ii++) {
float
v1 = matrix[i * elements + ii],
v2 = query[ii];
scores[i] += v1 * v2;
}
// scores[i] = 1 / scores[i];
*/
/*
// Tanimoto Coefficient
size_t i = get_global_id(0);
float
dp = 0.0f,
n1 = 0.0f,
n2 = 0.0f;
for (size_t ii = 0; ii < elements; ii++) {
float
v1 = matrix[i * elements + ii],
v2 = query[ii];
n1 += v1 * v1;
n2 += v2 * v2;
dp += v1 * v2;
}
scores[i] = 1.0f - (dp / (sqrt(n1) + sqrt(n2) - dp));
*/
}
)";