Parallel Ingest with pingest.pl
A program named pingest.pl allows fast bibliographic record
ingest. It performs ingest in parallel so that multiple batches can
be done simultaneously. It operates by splitting the records to be
ingested up into batches and running all of the ingest methods on each
batch. You may pass in options to control how many batches are run at
the same time, how many records there are per batch, and which ingest
operations to skip.
Note
The browse ingest is presently done in a single process over all
of the input records as it cannot run in parallel with itself. It
does, however, run in parallel with the other ingests.
pingest.pl accepts the following command line options:
-
--host
-
The server where PostgreSQL runs (either host name or IP address).
The default is read from the PGHOST environment variable or
"localhost."
-
--port
-
The port that PostgreSQL listens to on host. The default is read
from the PGPORT environment variable or 5432.
-
--db
-
The database to connect to on the host. The default is read from
the PGDATABASE environment variable or "evergreen."
-
--user
-
The username for database connections. The default is read from
the PGUSER environment variable or "evergreen."
-
--password
-
The password for database connections. The default is read from
the PGPASSWORD environment variable or "evergreen."
-
--batch-size
-
Number of records to process per batch. The default is 10,000.
-
--max-child
-
Max number of worker processes (i.e. the number of batches to
process simultaneously). The default is 8.
-
--skip-browse
,
--skip-attrs
,
--skip-search
,
--skip-facets
,
--skip-display
-
Skip the selected reingest component.