Skip to main content

Transfer learning

Connect to superduper​

from superduper import superduper

db = superduper('mongomock:///test_db')

Get useful sample data​

!curl -O https://superduperdb-public-demo.s3.amazonaws.com/text_classification.json
import json

with open("text_classification.json", "r") as f:
data = json.load(f)
num_classes = 2

After obtaining the data, we insert it into the database.

datas = [{'txt': d['x'], 'label': d['y']} for d in data]        

Insert simple data​

After turning on auto_schema, we can directly insert data, and superduper will automatically analyze the data type, and match the construction of the table and datatype.

from superduper import Document

table_or_collection = db['docs']

ids = db.execute(table_or_collection.insert([Document(data) for data in datas]))
select = table_or_collection.select()

Compute features​

key = 'txt'
import sentence_transformers
from superduper import vector, Listener
from superduper_sentence_transformers import SentenceTransformer

superdupermodel = SentenceTransformer(
identifier="embedding",
object=sentence_transformers.SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2"),
postprocess=lambda x: x.tolist(),
)

jobs, listener = db.apply(
Listener(
model=superdupermodel,
select=select,
key=key,
identifier="features"
)
)

Choose features key from feature listener​

input_key = listener.outputs
training_select = select.outputs(listener.predict_id)

We can find the calculated feature data from the database.

feature = list(training_select.limit(1).execute())[0][input_key]
feature_size = len(feature)

Build and train classifier​

from superduper_sklearn import Estimator, SklearnTrainer
from sklearn.svm import SVC

model = Estimator(
identifier="my-model",
object=SVC(),
trainer=SklearnTrainer(
"my-trainer",
key=(input_key, "label"),
select=training_select,
),
)

Define a validation for evaluating the effect after training.

from superduper import Dataset, Metric, Validation


def acc(x, y):
return sum([xx == yy for xx, yy in zip(x, y)]) / len(x)


accuracy = Metric(identifier="acc", object=acc)
validation = Validation(
"transfer_learning_performance",
key=(input_key, "label"),
datasets=[
Dataset(identifier="my-valid", select=training_select.add_fold('valid'))
],
metrics=[accuracy],
)
model.validation = validation

If we execute the apply function, then the model will be added to the database, and because the model has a Trainer, it will perform training tasks.

db.apply(model)
model.encode()

Get the training metrics

model = db.load('model', model.identifier)
model.metric_values
from superduper import Template

t = Template('transfer-learner', template=model, substitutions={'docs': 'table'})
t.export('.')