Create Embeddings#

Let's consider an example of a robot that takes instructions for a car from a text file, splits it into fragments, and computes an embedding for each fragment - a feature vector for subsequent searching of the required parts of the document by similarity. The text fragments along with their corresponding embeddings are recorded in a CSV file. This project serves as a preparatory step for using the SearchEmbeddings project and should be run only once.

The robot project consists of a single diagram. Thus, to describe it step by step, this robot works as follows:

  1. Reads the content of the specified file in the properties.
  2. Splits this content into fragments.
  3. Computes the embedding for each element.
  4. Writes the obtained embeddings to Data Tables.
  5. Saves the text fragments with their corresponding embeddings to a CSV file.

The project diagram looks like this (for convenience, the blocks of the diagram are numbered):

  1. Start Block (any diagram begins with this block).
  2. Read Line from File Block reads the content of the file sequentially (line by line). The following properties are specified for this block:
  • File Name (the name of the file from which information will be read);
  • Encoding (the encoding of the file from which information will be read);
  • Skip Empty Lines (when the flag is set, the function to skip empty lines while reading information from the file is enabled).
  1. Condition Block checks the specified condition for truth, after which the script execution continues towards the "Yes" exit (if the condition is met) or towards the "No" exit (if the condition is not met).

The condition is written in the format: “variable” equals (==)/ greater than (>)/ less than (<) “value”.

For example: $a == “Hello”, that is, if the value of the variable $a is equal to “Hello”, then the exit is “Yes”, otherwise – the exit is “No”.

$Result > 5, that is, if the value of the variable $Result is less than 5, then the exit is “Yes”, otherwise – the exit is “No”.

In this case, the condition is set as: $Lines.Count > 0, that is, this condition checks if the value of the variable $Lines is greater than zero, then the exit is “Yes”, otherwise – the exit is “No”.

  1. Get Embeddings Block allows obtaining embeddings for a string or a list of strings using the Open AI service. The following properties are specified for this block:
  • Text (input text for which embeddings will be calculated);
  • Model (neural network model for generating the response);
  • As Data Table (when the flag is set, the function to return the result as a data table with two columns: “Text” and “Embeddings” is enabled);
  • Timeout (maximum waiting time for a response in seconds).
  1. Join Data Tables Block allows joining two Data Tables into one. The following properties are specified for this block:
  • First Data Table;
  • Second Data Table;
  • Join Type (the type of join for the tables if the schemas of the specified tables differ. In this case, the type is “Add”, that is, to add).
  1. Save to CSV Block allows saving the Data Table to a CSV document. The following properties are specified for this block:
  • Data Table (the Data Table that needs to be saved in the document);
  • File Path (the path to the file where the Data Table needs to be saved);
  • Separator (the separator character);
  • Encoding (the file encoding).
  1. Log Block allows outputting arbitrary messages and/or variable values to the log during the robot's script execution. The property “Value” is specified for this block. A text constant is indicated in quotes, and the variable name starts with the symbol $. That is, this block logs a message about reading a line from the file, indicating this line using the variable.
  1. Add Item to List Block allows adding the specified item to the end of the list. The following properties are specified for this block:
  • List (the list to which the new item needs to be added);
  • Item (the variable that needs to be added to the list).
  1. Condition Block checks the specified condition for truth, after which the script execution continues towards the "Yes" exit (if the condition is met) or towards the "No" exit (if the condition is not met).

The condition is written in the format: “variable” equals (==)/ greater than (>)/ less than (<) “value”.

For example: $a == “Hello”, that is, if the value of the variable $a is equal to “Hello”, then the exit is “Yes”, otherwise – the exit is “No”.

$Result > 5, that is, if the value of the variable $Result is less than 5, then the exit is “Yes”, otherwise – the exit is “No”.

In this case, the condition is set as: $Lines.Count > 20, that is, if the value of the variable $Lines is greater than 20, then the exit is “Yes”, otherwise – the exit is “No”.

  1. Get Embeddings Block allows obtaining embeddings for a string or a list of strings using the Open AI service. The following properties are specified for this block:
  • Text (input text for which embeddings will be calculated);
  • Model (neural network model for generating the response);
  • As Data Table (when the flag is set, the function to return the result as a data table with two columns: “Text” and “Embeddings” is enabled);
  • Timeout (maximum waiting time for a response in seconds).
  1. Join Data Tables Block allows joining two Data Tables into one. The following properties are specified for this block:
  • First Data Table;
  • Second Data Table;
  • Join Type (the type of join for the tables if the schemas of the specified tables differ. In this case, the type is “Add”, that is, to add).
  1. Save to CSV Block allows saving the Data Table to a CSV document. The following properties are specified for this block:
  • Data Table (the Data Table that needs to be saved in the document);
  • File Path (the path to the file where the Data Table needs to be saved);
  • Separator (the separator character);
  • Encoding (the file encoding).
  1. Clear List Block allows clearing the list by removing all its elements. The property “List” (the list that needs to be cleared) is specified for this block.
  1. End Block (this block concludes the script execution or returns the subprocess diagram to the main process).