NetBSD/usr.sbin/makemandb/apropos-utils.h
abhinav 188f922ddf Add a custom tokenizer which does not stem certain keywords.
Which keywords should not be stemmed is specified in the nostem.txt file.
(Right now I have taken all the man page names, split them if they had
underscores, removed common English words and converted everything to
lowercase.)

The tokenizer itself is based on the Porter stemming tokenizer shipped with
Sqlite. The code in custom_apropos_tokenizer.c is copy of that code with
some modifications to prevent stemming keywords specified in nostem.txt.

Additionally, it now uses underscore `_' also as a token delimiter. Therefore,
now it's possible to do query for `lwp' and all `_lwp_*' man page names
will be matched. Or the query can be `unconst' and `__UNCONST' will be matched.
This was not possible earlier, because underscore was not a delimiter and therefore
the index would have __UNCONST as a key rather than UNCONST.

The tokenizer needs fts3_tokenizer.h file, which is not shipped with the
amalgamation build of Sqlite, therefore it needs to be added here (unless
we decide there is a better place for it).

To enforce using the new tokenizer, a schema version bump is needed

Since the tokenization is done both at the indexing time (via makemandb) and
also while query time (via apropos or whatis), it will be needed to bump
the schema version everytime nostem.txt is modified. Otherwise the
index will consist of old tokens and desired changes will not be seen with
apropos.

This should also fix the issue reported in PR bin/46255. Similar suggestion was
also made on tech-userlevel@ recently:
<http://mail-index.netbsd.org/tech-userlevel/2017/06/08/msg010620.html>

Thanks to christos@ for multiple rounds of reviews of the tokenizer code.
2017-06-18 16:24:10 +00:00

102 lines
3.2 KiB
C

/* $NetBSD: apropos-utils.h,v 1.13 2017/06/18 16:24:10 abhinav Exp $ */
/*-
* Copyright (c) 2011 Abhinav Upadhyay <er.abhinav.upadhyay@gmail.com>
* All rights reserved.
*
* This code was developed as part of Google's Summer of Code 2011 program.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
* FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
* COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
* AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
#ifndef APROPOS_UTILS_H
#define APROPOS_UTILS_H
#include "sqlite3.h"
#define MANCONF "/etc/man.conf"
/* Flags for opening the database */
typedef enum mandb_access_mode {
MANDB_READONLY = SQLITE_OPEN_READONLY,
MANDB_WRITE = SQLITE_OPEN_READWRITE,
MANDB_CREATE = SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE
} mandb_access_mode;
#define APROPOS_SCHEMA_VERSION 20170618
/*
* Used to identify the section of a man(7) page.
* This is similar to the enum mdoc_sec defined in mdoc.h from mdocml project.
*/
enum man_sec {
MANSEC_NAME = 0,
MANSEC_SYNOPSIS,
MANSEC_LIBRARY,
MANSEC_ERRORS,
MANSEC_FILES,
MANSEC_RETURN_VALUES,
MANSEC_EXIT_STATUS,
MANSEC_DESCRIPTION,
MANSEC_ENVIRONMENT,
MANSEC_DIAGNOSTICS,
MANSEC_EXAMPLES,
MANSEC_STANDARDS,
MANSEC_HISTORY,
MANSEC_BUGS,
MANSEC_AUTHORS,
MANSEC_COPYRIGHT,
MANSEC_NONE
};
typedef struct query_args {
const char *search_str; // user query
char **sections; // Sections in which to do the search
int nrec; // number of records to fetch
int offset; //From which position to start processing the records
int legacy;
const char *machine;
int (*callback) (void *, const char *, const char *, const char *,
const char *, size_t); // The callback function
void *callback_data; // data to pass to the callback function
char **errmsg; // buffer for storing the error msg
} query_args;
typedef enum query_format {
APROPOS_NONE,
APROPOS_PAGER,
APROPOS_TERM,
APROPOS_HTML
} query_format;
char *lower(char *);
void concat(char **, const char *);
void concat2(char **, const char *, size_t);
sqlite3 *init_db(mandb_access_mode, const char *);
void close_db(sqlite3 *);
char *get_dbpath(const char *);
int run_query(sqlite3 *, query_format, query_args *);
#endif