COMP6714 2025T2 Project Specification
1. Project Overview
In this project, you will implement (using Python3 in CSE Linux machines) a simple search engine that ranks documents based on:
- Query term coverage
- Proximity of matched terms
- Preservation of query term order
A search query consists of space-separated terms containing only alphanumeric characters (no punctuation).
2. Core Requirements
- Implement an indexer (
index.py
) and a search program (search.py
). - Use an inverted index with positional information (as described in Week 1 lectures).
- Additional indexes may be implemented if needed.
3. Term Matching Rules
- Case insensitive (e.g., “Apple” = “apple”).
- Abbreviations: Ignore full stops (e.g., “U.S.” = “US”).
- Hyphenated terms:
- Preserve if the first part has < 3 letters (e.g., “D-Kans”, “co-author”).
- Split otherwise (e.g., “set-aside” → “set”, “aside”).
- Singular/Plural/Tense ignored (e.g., “cat” = “cats”; “breach” = “breached”).
- Sentence endings: Only
.
,?
,!
mark sentence boundaries. - Numbers:
- Decimal numbers can be ignored (
.
is invalid in search terms). - Years/integers should be indexed (commas ignored, e.g., “1,000,000” = “1000000”).
- Decimal numbers can be ignored (
- Other punctuation: Treated as token dividers.
4. Ranking Criteria
Documents are ranked by:
- Term coverage (proportion of query terms matched).
- Proximity (average distance between matched terms).
- Order preservation (consecutive query terms appearing in the same left-to-right order).
Scoring Formula:
[
Score(d) = alpha * frac{#matched_terms}{#query_terms} + beta * frac{1}{1 + avg_distance} + gamma * ordered_pairs
]
Where:
- (alpha = 1.0), (beta = 1.0), (gamma = 0.1) (default values).
- For single-term queries, proximity and order scores are 0.
5. Indexer (index.py
)
Command:
python3 index.py [folder-of-documents] [folder-of-indexes]
Output:
- Total documents, tokens, and terms indexed.
Example:
$ python3 index.py ~cs6714/Public/data ./MyTestIndex
Total number of documents: 1000
Total number of tokens: 268,568
Total number of terms: 259,182
6. Search Program (search.py
)
Command:
python3 search.py [folder-of-indexes]
Behavior:
- Accepts queries from stdin until
Ctrl-D
. - Outputs ranked document IDs (one per line).
Example:
$ python3 search.py ~/Proj/MyTestIndex
Apple
1361
Australia Technology
3454
10
18
...
7. Displaying Matching Lines (Optional)
For queries starting with >
:
- Displays document IDs prefixed with
>
followed by lines containing the closest matching terms. - Only one line per matched term (prioritizing earliest occurrence).
Example:
$ python3 search.py ~/Proj/MyTestIndex
> Apples
> 1361
The department said stocks of fresh apples in cold storage
8. Marking (40 Points Total)
- Correctness: Exact match of document IDs and order required for full marks.
- Partial Marks: F-measure used for ranking errors (precision/recall).
- Runtime Limits:
- Indexer: 1 minute.
- Search: 10 seconds per query.
9. Submission
- Deadline: Friday, 1st August 23:59.
- Format: All
.py
files in a.zip
folder submitted via Moodle. - Late Penalty: 5% deduction per day (up to 5 days).
10. Permitted Libraries
- Python Standard Library only.
- NLTK allowed (pre-downloaded for marking; remove
nltk.download()
calls).
11. Plagiarism Policy
- Individual work only.
- Penalties apply for copied code or public repositories.