pkuanvil
    • 版块
    • 标签
    • 帮助
    • 注册
    • 登录

    求助几个备份wiki的问题

    Networking
    3
    12
    1.1k
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • wumingshiW
      wumingshi @pku_jerry
      最后由 编辑

      @pku_jerry 倒也不是,是另一个小众的国外wiki

      1 条回复 最后回复 回复 引用 0
      • A admin 从 中的 Discussion 移动了该主题
      • A admin 从 中的 移动了该主题
      • A admin 从 中的 Discussion 移动了该主题
      • A admin 从 中的 Computer 移动了该主题
      • ?
        老用户
        最后由 编辑

        大概解释一下zimit advanced setting
        Language应该选UTF-8吧
        depth :A website's crawl depth refers to the extent to which a search engine indexes the site's content. A site with high crawl depth will get a lot more indexed than a site with low crawl depth.我填的-7
        extra hop翻译过来应该是跃点,不知道
        Crawl scope:爬取范围
        When defining a web application in the wizard, you must select a crawl scope setting. In case of authenticated scan, ensure that you always put the login link as the first link. The following settings are available.
        Limit to URL hostname (abc.xyz)

        Select this setting to limit crawling to the hostname within the URL, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/. All links discovered in www.example.org domain will be crawled. Also all links discovered in http://www.example.org/support and https://www.example.org:8080/logout will be crawled. No links will be followed from subdomains of www.example.org. This means http://www2.example.org and http://cdn.www.example.org/ will not be crawled.
        Limit to content located at or below URL subdirectory

        Select this setting to crawl all links starting with a URL subdirectory using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/. All links starting with http://www.example.org/news/ will be crawled. Also http://www.example.org/news/headlines and https://www.example.org:8080/news/ will be crawled. Links like http://www.example.org/agenda and http://www2.example.org will not be crawled.
        Limit to URL hostname and specified sub-domain

        Select this setting to crawl only the URL hostname and one specified sub-domain, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/ and the sub-domain is cdn.example.org. All links discovered in www.example.org and in cdn.example.org and any of its subdomains will be crawled. Also these domains will be crawled: http://www.example.org/support, https://www.example.org:8080/logout, http://cdn.example.org/images/ and http://videos.cdn.example.org. Links whose domain does not match the web application URL hostname or is not a sub-domain of cdn.example.org will not be followed. This means http://videos.example.org will not be crawled.
        Limit to URL hostname and specified domains

        Select this setting to crawl only the URL hostname and specified domains, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/ and the specified domains are cdn.example.org and site.example.org. All links discovered in www.example.org and in cdn.example.org and all other domains specified will be crawled. This means these domains will be crawled: http://www.example.org/support, https://www.example.org:8080/logout and http://cdn.example.org/images/. Links whose domain does not match web application URL hostname or one of the domains specified will not be followed. This means http://videos.example.org and http://videos.cdn.example.org will not be crawled.
        其他应该不太重要吧,我也不太懂爬虫,以下是全部的参量和简略官方说明,求助爬虫大佬
        Language
        ISO-639-3 (3 chars) language code of content. Defaults to eng
        Title
        Custom title for ZIM. Defaults to title of main page
        Description
        Description for ZIM
        Illustration
        URL for Illustration. If unspecified, will attempt to use favicon from main page.
        ZIM filename
        ZIM file name (based on --name if not provided). Make sure to end with _{period}.zim
        ZIM Tags
        List of Tags for the ZIM file.
        Content Creator
        Name of content creator.
        Content Source
        Source name/URL of content
        New Context
        The context for each new capture. Defaults to page
        WaitUntil
        Puppeteer page.goto() condition to wait for before continuing. Defaults to load
        Depth
        The depth of the crawl for all seeds. Defaults to -1
        Extra Hops
        Number of extra 'hops' to follow, beyond the current scope. Defaults to 0
        Scope Type
        A predfined scope of the crawl. For more customization, use 'custom' and set include regexes. Defaults to prefix.
        Include
        Regex of page URLs that should be included in the crawl (defaults to the immediate directory of URL)
        Exclude
        Regex of page URLs that should be excluded from the crawl
        Allow Hashtag URLs
        Not set
        Allow Hashtag URLs, useful for single-page-application crawling or when different hashtags load dynamic content
        As device
        Device to crawl as. Defaults to Iphone X. See Pupeeter's DeviceDescriptors.
        User Agent
        Override user-agent with specified
        Use sitemap
        Use as sitemap to get additional URLs for the crawl (usually at /sitemap.xml)
        Behaviors
        Which background behaviors to enable on each page. Defaults to autoplay,autofetch,siteSpecific.
        Behavior Timeout
        If >0, timeout (in seconds) for in-page behavior will run on each page. If 0, a behavior can run until finish. Defaults to 90
        Size Limit
        If set, save state and exit if size limit exceeds this value, in bytes
        Time Limit
        If set, save state and exit after time limit, in seconds

        wumingshiW 1 条回复 最后回复 回复 引用 0
        • wumingshiW
          wumingshi @老用户
          最后由 编辑

          @kgdjcb46158 感谢解释,改天我再试试

          1 条回复 最后回复 回复 引用 0
          • ?
            老用户
            最后由 老用户 编辑

            https://s3.us-west-1.wasabisys.com/org-kiwix-zimit/other/www.pkuanvil.com_c59aa3b1.zim
            这次zim文件可以在内部打开链接了,快来试试
            上面是下载链接,在有些kiwix客户端效果不好,kiwix pwa还行https://pwa.kiwix.org

            wumingshiW 1 条回复 最后回复 回复 引用 0
            • ?
              老用户
              最后由 编辑

              浏览器插件应该也行

              1 条回复 最后回复 回复 引用 0
              • ?
                老用户
                最后由 编辑

                @admin 我找不到提交的地方,好像要先fork再发请求,我就放我的仓库吧,你直接fork就行了https://github.com/pkej1236/pkuanvil_zim

                1 条回复 最后回复 回复 引用 0
                • wumingshiW
                  wumingshi @老用户
                  最后由 编辑

                  @kgdjcb46158 这是目前本站的zim?

                  ? 1 条回复 最后回复 回复 引用 0
                  • ?
                    老用户 @wumingshi
                    最后由 编辑

                    @wumingshi 对

                    1 条回复 最后回复 回复 引用 0
                    • ?
                      老用户
                      最后由 编辑

                      Zimit 生成的zim文件依赖kiwix pwa,因此需要联网,如果不科学还有点慢,但已经把网站所有链接整合了

                      1 条回复 最后回复 回复 引用 0
                      • ?
                        老用户
                        最后由 编辑

                        crawl scope应该选domain,这样才能把pkuanvil整个域下资源爬下来

                        1 条回复 最后回复 回复 引用 0
                        • 1 / 1
                        • 第一个帖子
                          最后一个帖子