You may have noticed that there has been some discussion lately about Large Language Models (LLM).
If for the moment we put aside issues about gross overhype, there are many useful things that they can do. Like helping generate some of the code below even.
Much of the heavy lifting for the analysis of unstructured data is accomplished by the ability to create vector embeddings of text. Creating a RAG (Retrieval Augmented Generation) for example relies on this.
So we want to deal with embeddings.
Local
One objection with the use of these tools is around privacy – you maybe don’t want to send your data to someone else’s server in the cloud where it can be used for training or other purposes. If, for instance the data consists of the writeup of your own dreams, as here.
If you are looking to avoid this, you will want a scheme that can be run completely in the safety and comfort of your own laptop.
So we want our embeddings to be local.
Portable
Ideally, we would like to be able to take our private local analysis of embeddings and apply it to any given corpus of documents without burdensome overhead in setup.
So we want our local embeddings to be portable.
Whereas the current implementation succeeds for the most part in doing everything locally, the portability still needs work.
This is an initial implementation of a general vision for Embeddings, Local and Portable, or ELP. Since this is the first such, why not call is after ELP’s first album? Now you get it.
Dreams
Dreams are in postgres with the following table.
```{sql}CREATE TABLE IF NOT EXISTS public.doc_embeddings( idx_ bigint NOT NULL DEFAULT nextval('doc_idx_sequence'::regclass), doc_ character varying(8192) COLLATE pg_catalog."default", url_ character varying(2048) COLLATE pg_catalog."default", emb_ vector(768), created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP, date_ date, title_ character varying COLLATE pg_catalog."default", CONSTRAINT doc_embeddings_pkey PRIMARY KEY (idx_))```
idx_ : primary key, auto-generated
doc_ : the actual text
url_ : future. not used here
emb_ : the embedding (nomic-embed-text length 768 generated locally with ollama )
date_ : when was it?
title_ : short description
Both postgres + ollama are running in docker containers on the host at the default ports (5432 and 11434 respectively). The host is PopOS 22.04 LTS (Ubuntu). postgres has pgvector installed.
To generate the embeddings, the docs are retrieved in a c++ program with pgxx. The embeddings are then generated in the same program with ollama-hpp and pushed back to postgres.
These are then brought down to the host by running the following psql command.
```{psql}postgres=# \copy (SELECT row_to_json(t) FROM doc_embeddings t) TO '/home/pma/Documents/gh-source/ELP01/incl_/tarkus/RRR/output.json';```
Is is a manual step which needs to be automated in the sequel.
Pull in all the libraries.
As we run R, we start by loading all our libraries.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)library(dplyr)library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
I had to remove instances of \” in the file in order to be able to read it.
stream_in is a jsonlite function.
de_json <-stream_in(file("output.json"))
opening file input connection.
Found 500 records...
Found 881 records...
Imported 881 records. Simplifying...
closing file input connection.
json -> dataframe
Now we have the embeddings and the rest of the information from the postgres database, we get it into a regular R dataframe. I asked Claude for help with these.