Skip to main content

download

superduper.misc.download

Source code

download_from_one​

download_from_one(r: superduper.base.document.Document)
ParameterDescription
rdocument to download from

Download content from a single document.

This function will find all URIs in the document and download them.

download_content​

download_content(db,
query: Union[superduper.backends.base.query.Query,
Dict],
ids: Optional[Sequence[str]] = None,
documents: Optional[List[superduper.base.document.Document]] = None,
raises: bool = True,
n_workers: Optional[int] = None) -> Optional[Sequence[superduper.base.document.Document]]
ParameterDescription
dbdatabase instance
queryquery to be executed
idsids to be downloaded
documentsdocuments to be downloaded
raiseswhether to raise errors
n_workersnumber of download workers

Download content contained in uploaded data.

Items to be downloaded are identifier via the subdocuments in the form exemplified below. By default items are downloaded to the database, unless a download_update function is provided.

d = {"_content": {"uri": "<uri>", "encoder": "<encoder-identifier>"}}
def update(key, id, bytes):
... with open(f'/tmp/{key}+{id}', 'wb') as f:
... f.write(bytes)
download_content(None, None, ids=["0"], documents=[d]))

gather_uris​

gather_uris(documents: Sequence[superduper.base.document.Document],
gather_ids: bool = True) -> Tuple[List[str],
List[str],
List[Any],
List[str]]
ParameterDescription
documentslist of dictionaries
gather_idsif True then gather ids of documents

Get the uris out of all documents as denoted by {"_content": ...}.

timeout​

timeout(seconds)
ParameterDescription
secondsseconds until timeout

Context manager to set a timeout.

timeout_handler​

timeout_handler(signum,
frame)
ParameterDescription
signumsignal number
frameframe

Timeout handler to raise an TimeoutException.

BaseDownloader​

BaseDownloader(self,
uris: List[str],
n_workers: int = 0,
timeout: Optional[int] = None,
headers: Optional[Dict] = None,
raises: bool = True)
ParameterDescription
urislist of uris/ file names to fetch
n_workersnumber of multiprocessing workers
timeoutset seconds until request times out
headersdictionary of request headers passed torequests package
raisesraises error True/False

Base class for downloading files.

DownloadFiles​

DownloadFiles(self,
identifier: str,
db: dataclasses.InitVar[typing.Optional[ForwardRef('Datalayer')]] = None,
uuid: None = <factory>,
*,
upstream: "t.Optional[t.List['Component']]" = None,
plugins: "t.Optional[t.List['Plugin']]" = None,
artifacts: 'dc.InitVar[t.Optional[t.Dict]]' = None,
cache: 't.Optional[bool]' = True,
status: 't.Optional[Status]' = None,
signature: str = 'singleton',
datatype: 'EncoderArg' = None,
output_schema: 't.Optional[Schema]' = None,
model_update_kwargs: None = <factory>,
predict_kwargs: None = <factory>,
compute_kwargs: None = <factory>,
validation: 't.Optional[Validation]' = None,
metric_values: None = <factory>,
num_workers: int = (10,
),
serve: 'bool' = False,
trainer: 't.Optional[Trainer]' = None,
example: 'dc.InitVar[t.Any | None]' = None,
postprocess: Optional[Callable] = None,
timeout: Optional[int] = None,
headers: Optional[Dict] = None,
raises: bool = True) -> None
ParameterDescription
identifierIdentifier of the leaf.
dbDatalayer instance.
uuidUUID of the leaf.
artifactsA dictionary of artifacts paths and DataType objects
upstreamA list of upstream components
pluginsA list of plugins to be used in the component.
cache(Optional) If set true the component will not be cached during primary job of the component i.e on a distributed cluster this component will be reloaded on every component task e.g model prediction.
statusWhat part of the lifecycle the component is in.
signaturesignature of the model
datatypeDataType instance.
output_schemaOutput schema (mapping of encoders).
model_update_kwargsThe kwargs to use for model update.
predict_kwargsAdditional arguments to use at prediction time.
compute_kwargsKwargs used for compute backend job submit. Example (Ray backend): compute_kwargs = dict(resources=...).
validationThe validation Dataset instances to use.
metric_valuesThe metrics to evaluate on.
num_workersnumber of multiprocessing workers
serveCreates an http endpoint and serve the model with compute_kwargs on a distributed cluster.
trainerTrainer instance to use for training.
exampleAn example to auto-determine the schema/ datatype.
postprocesspostprocess function to apply to the results
timeoutset seconds until request times out
headersdictionary of request headers passed torequests package
raisesraises error True/False

Download files from a list of URIs.

Downloader​

Downloader(self,
uris,
update_one: Optional[Callable] = None,
ids: Union[List[str],
List[int],
NoneType] = None,
keys: Optional[List[str]] = None,
datatypes: Optional[List[str]] = None,
n_workers: int = 20,
headers: Optional[Dict] = None,
skip_existing: bool = True,
timeout: Optional[int] = None,
raises: bool = True)
ParameterDescription
urislist of uris/ file names to fetch
update_onefunction to call to insert data into table
idslist of ids of rows/ documents to update
keyslist of keys in rows/ documents to insert to
datatypeslist of datatypes of rows/ documents to insert to
n_workersnumber of multiprocessing workers
headersdictionary of request headers passed torequests package
skip_existingif True then don't bother getting already present data
timeoutset seconds until request times out
raisesraises error True/False

Download files from a list of URIs.

Fetcher​

Fetcher(self,
headers: Optional[Dict] = None,
n_workers: int = 0)
ParameterDescription
headersheaders to be used for download
n_workersnumber of download workers

Fetches data from a URI.

TimeoutException​

TimeoutException(self,
/,
*args,
**kwargs)
ParameterDescription
args*args of Exception
kwargs**kwargs of Exception

Timeout exception.

Updater​

Updater(self,
db,
query)
ParameterDescription
dbDatalayer instance
queryquery to be executed

Updater class to update the artifact.