Tarkus

Introduction

Why Tarkus?

The Album Cover of Emerson Lake and Palmer's first album, Tarkus

Tarkus

Embeddings

You may have noticed that there has been some discussion lately about Large Language Models (LLM).
If for the moment we put aside issues about gross overhype, there are many useful things that they can do. Like helping generate some of the code below even.

Much of the heavy lifting for the analysis of unstructured data is accomplished by the ability to create vector embeddings of text. Creating a RAG (Retrieval Augmented Generation) for example relies on this.

So we want to deal with embeddings.

Local

One objection with the use of these tools is around privacy – you maybe don’t want to send your data to someone else’s server in the cloud where it can be used for training or other purposes. If, for instance the data consists of the writeup of your own dreams, as here.

If you are looking to avoid this, you will want a scheme that can be run completely in the safety and comfort of your own laptop.

So we want our embeddings to be local.

Portable

Ideally, we would like to be able to take our private local analysis of embeddings and apply it to any given corpus of documents without burdensome overhead in setup.

So we want our local embeddings to be portable.

Whereas the current implementation succeeds for the most part in doing everything locally, the portability still needs work.

This is an initial implementation of a general vision for Embeddings, Local and Portable, or ELP. Since this is the first such, why not call is after ELP’s first album? Now you get it.

Dreams

Dreams are in postgres with the following table.

```{sql}


CREATE TABLE IF NOT EXISTS public.doc_embeddings
(
    idx_ bigint NOT NULL DEFAULT nextval('doc_idx_sequence'::regclass),
    doc_ character varying(8192) COLLATE pg_catalog."default",
    url_ character varying(2048) COLLATE pg_catalog."default",
    emb_ vector(768),
    created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
    date_ date,
    title_ character varying COLLATE pg_catalog."default",
    CONSTRAINT doc_embeddings_pkey PRIMARY KEY (idx_)
)


```
  • idx_ : primary key, auto-generated
  • doc_ : the actual text
  • url_ : future. not used here
  • emb_ : the embedding (nomic-embed-text length 768 generated locally with ollama )
  • date_ : when was it?
  • title_ : short description

Both postgres + ollama are running in docker containers on the host at the default ports (5432 and 11434 respectively). The host is PopOS 22.04 LTS (Ubuntu). postgres has pgvector installed.

To generate the embeddings, the docs are retrieved in a c++ program with pgxx. The embeddings are then generated in the same program with ollama-hpp and pushed back to postgres.

These are then brought down to the host by running the following psql command.

```{psql}

postgres=# \copy (SELECT row_to_json(t) FROM doc_embeddings t) TO '/home/pma/Documents/gh-source/ELP01/incl_/tarkus/RRR/output.json';

```

Is is a manual step which needs to be automated in the sequel.

Pull in all the libraries.

As we run R, we start by loading all our libraries.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(dplyr)
library(jsonlite)

Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten
library(readr)

library(RPostgreSQL)
Loading required package: DBI
library(DBI)

library(Rcpp)

connect to the postgres database :

```{r}
#| echo: false

con <- dbConnect(odbc::odbc(), .connection_string = "Driver={PostgreSQL Unicode};;Server=localhost;Database=postgres;UID=postgres;PWD=xxx;Port=5432;max.char = 100000", timeout = 10)


```

read the json file

I had to remove instances of \” in the file in order to be able to read it.

stream_in is a jsonlite function.

de_json <- stream_in(file("output.json"))
opening file input connection.

 Found 500 records...
 Found 881 records...
 Imported 881 records. Simplifying...
closing file input connection.

json -> dataframe

Now we have the embeddings and the rest of the information from the postgres database, we get it into a regular R dataframe. I asked Claude for help with these.

json_to_df <- function(q0) {
    qi <- q0$idx_
    q1 <- q0$emb_ %>% fromJSON()
    df <- data.frame(matrix(q1, nrow=1))
    row.names(df) <- qi
    return(df)
}


fn <- function(k) {
  # cat(k)
  de_json[k,] %>% json_to_df()
}


xx <- 1:881 %>% map(fn)




# dd <- function(k) de_json[k,]

# list_of_x <- 1:881 %>% dd() 

# results <- lapply(de_json, json_to_df)

combined_df <- do.call(rbind, xx)

PCA

now that we have the vector embeddedings in the data frame, we can effect the principal component analysis

ppp <-  prcomp(combined_df)

biplot(ppp)

and why dont we save the principal components as a .csv in case we want to read them back into our database later. Might be useful.

get_principal_component <- function( the_cdf, cdf_idx, pc_idx) 
{
  q0 <- ppp %>% predict( combined_df[1,] )  
  q0[pc_idx]
  
}

idx <- 1:881 %>% map( function(it) rownames(combined_df)[it]) %>% as.numeric( )%>% unlist()

PC1 <- 1:881 %>% map( function(it) (ppp %>% predict( combined_df[it,] ))[1]    ) %>% unlist()

PC2 <- 1:881 %>% map( function(it) (ppp %>% predict( combined_df[it,] ))[2]    ) %>% unlist()


some_nice_principal_components <- data.frame(idx, PC1, PC2)

some_nice_principal_components %>% write.csv("some_nice_principal_components.csv")

now we can have some fun exploring our data.

# the biplot the standard example gives us.
plot(PC2 ~ PC1, data=some_nice_principal_components) 

# is PC1 a function of time? (idx is increasing with time)
plot(PC1 ~ idx, data=some_nice_principal_components) 

# no correlation!
hm <- lm (PC1 ~ idx, data=some_nice_principal_components) 

summary(hm)

Call:
lm(formula = PC1 ~ idx, data = some_nice_principal_components)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.24001 -0.10115 -0.04864  0.02485  0.68380 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.387e-02  3.515e-02  -0.964    0.335
idx          2.370e-05  2.422e-05   0.979    0.328

Residual standard error: 0.1828 on 879 degrees of freedom
Multiple R-squared:  0.001089,  Adjusted R-squared:  -4.761e-05 
F-statistic: 0.9581 on 1 and 879 DF,  p-value: 0.3279