pkuanvil
    • 版块
    • 标签
    • 帮助
    • 注册
    • 登录

    求助几个备份wiki的问题

    Networking
    3
    12
    1.1k
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • wumingshiW
      wumingshi
      最后由 wumingshi 编辑

      一,我想在电脑本地创建一个某个wiki网站其中的一小部分的备份,但是我查了这个手册,
      https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki/zh#
      没有找到仅仅备份一部分的方法,这是一个只需要操作者编写一个特定脚本即可完成的操作吗?需要学习什么内容呢?
      二,按照我的理解,这个备份文件是以数据库形式储存的,那么不能直接在电脑本地像在浏览器里在线那样浏览,有办法在电脑本地做到像在浏览器里那样浏览吗?我搜索了一下,没能找到解决方案。维基百科有对应的软件,但是应该不能用在别的wiki网站上?
      又搜索了一下,发现Kiwix应该是一个符合要求的软件,我再学习一下
      三,kiwix提供了一个在线工具
      https://youzim.it/
      可以爬取某个网站并生成kiwix阅读器可读的一个文件,还可以设置规则,但是规则我看不懂,GitHub仓库给的应该是另一个版本?要使用这个是不是要知道爬虫的一些基本概念?

      1 条回复 最后回复 回复 引用 0
      • pku_jerryP
        pku_jerry
        最后由 编辑

        北大野史是吧。。凑满8个字

        wumingshiW 1 条回复 最后回复 回复 引用 0
        • wumingshiW
          wumingshi @pku_jerry
          最后由 编辑

          @pku_jerry 倒也不是,是另一个小众的国外wiki

          1 条回复 最后回复 回复 引用 0
          • A admin 从 中的 Discussion 移动了该主题
          • A admin 从 中的 移动了该主题
          • A admin 从 中的 Discussion 移动了该主题
          • A admin 从 中的 Computer 移动了该主题
          • ?
            老用户
            最后由 编辑

            大概解释一下zimit advanced setting
            Language应该选UTF-8吧
            depth :A website's crawl depth refers to the extent to which a search engine indexes the site's content. A site with high crawl depth will get a lot more indexed than a site with low crawl depth.我填的-7
            extra hop翻译过来应该是跃点,不知道
            Crawl scope:爬取范围
            When defining a web application in the wizard, you must select a crawl scope setting. In case of authenticated scan, ensure that you always put the login link as the first link. The following settings are available.
            Limit to URL hostname (abc.xyz)

            Select this setting to limit crawling to the hostname within the URL, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/. All links discovered in www.example.org domain will be crawled. Also all links discovered in http://www.example.org/support and https://www.example.org:8080/logout will be crawled. No links will be followed from subdomains of www.example.org. This means http://www2.example.org and http://cdn.www.example.org/ will not be crawled.
            Limit to content located at or below URL subdirectory

            Select this setting to crawl all links starting with a URL subdirectory using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/. All links starting with http://www.example.org/news/ will be crawled. Also http://www.example.org/news/headlines and https://www.example.org:8080/news/ will be crawled. Links like http://www.example.org/agenda and http://www2.example.org will not be crawled.
            Limit to URL hostname and specified sub-domain

            Select this setting to crawl only the URL hostname and one specified sub-domain, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/ and the sub-domain is cdn.example.org. All links discovered in www.example.org and in cdn.example.org and any of its subdomains will be crawled. Also these domains will be crawled: http://www.example.org/support, https://www.example.org:8080/logout, http://cdn.example.org/images/ and http://videos.cdn.example.org. Links whose domain does not match the web application URL hostname or is not a sub-domain of cdn.example.org will not be followed. This means http://videos.example.org will not be crawled.
            Limit to URL hostname and specified domains

            Select this setting to crawl only the URL hostname and specified domains, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/ and the specified domains are cdn.example.org and site.example.org. All links discovered in www.example.org and in cdn.example.org and all other domains specified will be crawled. This means these domains will be crawled: http://www.example.org/support, https://www.example.org:8080/logout and http://cdn.example.org/images/. Links whose domain does not match web application URL hostname or one of the domains specified will not be followed. This means http://videos.example.org and http://videos.cdn.example.org will not be crawled.
            其他应该不太重要吧,我也不太懂爬虫,以下是全部的参量和简略官方说明,求助爬虫大佬
            Language
            ISO-639-3 (3 chars) language code of content. Defaults to eng
            Title
            Custom title for ZIM. Defaults to title of main page
            Description
            Description for ZIM
            Illustration
            URL for Illustration. If unspecified, will attempt to use favicon from main page.
            ZIM filename
            ZIM file name (based on --name if not provided). Make sure to end with _{period}.zim
            ZIM Tags
            List of Tags for the ZIM file.
            Content Creator
            Name of content creator.
            Content Source
            Source name/URL of content
            New Context
            The context for each new capture. Defaults to page
            WaitUntil
            Puppeteer page.goto() condition to wait for before continuing. Defaults to load
            Depth
            The depth of the crawl for all seeds. Defaults to -1
            Extra Hops
            Number of extra 'hops' to follow, beyond the current scope. Defaults to 0
            Scope Type
            A predfined scope of the crawl. For more customization, use 'custom' and set include regexes. Defaults to prefix.
            Include
            Regex of page URLs that should be included in the crawl (defaults to the immediate directory of URL)
            Exclude
            Regex of page URLs that should be excluded from the crawl
            Allow Hashtag URLs
            Not set
            Allow Hashtag URLs, useful for single-page-application crawling or when different hashtags load dynamic content
            As device
            Device to crawl as. Defaults to Iphone X. See Pupeeter's DeviceDescriptors.
            User Agent
            Override user-agent with specified
            Use sitemap
            Use as sitemap to get additional URLs for the crawl (usually at /sitemap.xml)
            Behaviors
            Which background behaviors to enable on each page. Defaults to autoplay,autofetch,siteSpecific.
            Behavior Timeout
            If >0, timeout (in seconds) for in-page behavior will run on each page. If 0, a behavior can run until finish. Defaults to 90
            Size Limit
            If set, save state and exit if size limit exceeds this value, in bytes
            Time Limit
            If set, save state and exit after time limit, in seconds

            wumingshiW 1 条回复 最后回复 回复 引用 0
            • wumingshiW
              wumingshi @老用户
              最后由 编辑

              @kgdjcb46158 感谢解释,改天我再试试

              1 条回复 最后回复 回复 引用 0
              • ?
                老用户
                最后由 老用户 编辑

                https://s3.us-west-1.wasabisys.com/org-kiwix-zimit/other/www.pkuanvil.com_c59aa3b1.zim
                这次zim文件可以在内部打开链接了,快来试试
                上面是下载链接,在有些kiwix客户端效果不好,kiwix pwa还行https://pwa.kiwix.org

                wumingshiW 1 条回复 最后回复 回复 引用 0
                • ?
                  老用户
                  最后由 编辑

                  浏览器插件应该也行

                  1 条回复 最后回复 回复 引用 0
                  • ?
                    老用户
                    最后由 编辑

                    @admin 我找不到提交的地方,好像要先fork再发请求,我就放我的仓库吧,你直接fork就行了https://github.com/pkej1236/pkuanvil_zim

                    1 条回复 最后回复 回复 引用 0
                    • wumingshiW
                      wumingshi @老用户
                      最后由 编辑

                      @kgdjcb46158 这是目前本站的zim?

                      ? 1 条回复 最后回复 回复 引用 0
                      • ?
                        老用户 @wumingshi
                        最后由 编辑

                        @wumingshi 对

                        1 条回复 最后回复 回复 引用 0
                        • ?
                          老用户
                          最后由 编辑

                          Zimit 生成的zim文件依赖kiwix pwa,因此需要联网,如果不科学还有点慢,但已经把网站所有链接整合了

                          1 条回复 最后回复 回复 引用 0
                          • ?
                            老用户
                            最后由 编辑

                            crawl scope应该选domain,这样才能把pkuanvil整个域下资源爬下来

                            1 条回复 最后回复 回复 引用 0
                            • 1 / 1
                            • 第一个帖子
                              最后一个帖子