【干货】找不到适合自己的编程书？我自己动手写了一个热门编程书搜索网站（附PDF书单）

原作者 Vlad Wetzel

编译 CDA 编译团队

本文为 CDA 数据分析师原创作品，转载需授权

选择适合自己的编程书绝非易事，美国的程序员小哥根据国外著名编程技术问答网站Stack Overflow 所推荐的所有编程书，自己动手写了一个搜索热门编程书的网站。

选择适合自己的编程书绝非易事。

作为一名开发者，你的时间是有限的，读一本书需要很多时间。用这些时间你可以敲代码，你可以休息，可以做很多事。但相反，你用这些宝贵的时间来阅读和提升自己的技能。

那么应该读什么书呢？我和同事经常讨论这个问题，但是我发现我们对某本书的看法差别很大。

所以我决定深入探究这个问题——怎样选择适合自己的编程书呢？

在这里我决定把目光转向 Stack Overflow （国外著名编程技术问答网站），当中不少大神都有推荐他们的书单。我打算通过分析 Stack Overflow 中关于编程书籍的相关数据，从而得出当中哪些书被推荐最多的。

幸运的是， Stack Exchange （ Stack Overflow 的母公司）最近刚刚发布了他们的数据转储。以此为基础，我构建了网站 dev-books.com ，通过对关键字的搜索，你可以发现 Stack Overflow 最被推崇的编程相关书籍列表。现在网站有超过10万的用户。

总体来说，如果你求知欲很强，那么推荐你阅读《Working Effectively with Legacy Code》，同时《Design Pattern: Elements of Reusable Object-Oriented Software》也是不错的选择。虽然这些书名看上去十分枯燥，但是内容保证干货满满。你可以通过标签（如 JavaScript ， C ，图形等等）对书籍进行分类排序。这显然不是所有的书推荐，如果你刚刚入门编程或者想扩展你的知识，这两本书是很好的开始。

下面我来描述该网站是如何构建的。

获取和导入数据

我从 archive.org 获取了 Stack Exchange 数据库。

从一开始，我就意识到不可能使用如 myxml := pg_read_file(‘path/to/my_file.xml’) 这类常用工具将 48GB XML 文件导入新创建的数据库（PostgreSQL），因为我服务器没有 48GB 的内存。所以，我决定使用SAX解析器。

所有的值存储在 <row> 标签之间，从而我打算使用一个 Python 脚本来解析它：

def startElement(self, name, attributes):

 if name == ‘row’:
  self.cur.execute(“INSERT INTO posts (Id, Post_Type_Id, Parent_Id, Accepted_Answer_Id, Creation_Date, Score, View_Count, Body, Owner_User_Id, Last_Editor_User_Id, Last_Editor_Display_Name, Last_Edit_Date, Last_Activity_Date, Community_Owned_Date, Closed_Date, Title, Tags, Answer_Count, Comment_Count, Favorite_Count) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)”,
  (
    (attributes[‘Id’] if ‘Id’ in attributes else None),
    (attributes[‘PostTypeId’] if ‘PostTypeId’ in attributes else None),
    (attributes[‘ParentID’] if ‘ParentID’ in attributes else None),
    (attributes[‘AcceptedAnswerId’] if ‘AcceptedAnswerId’ in attributes else None),
    (attributes[‘CreationDate’] if ‘CreationDate’ in attributes else None),
    (attributes[‘Score’] if ‘Score’ in attributes else None),
    (attributes[‘ViewCount’] if ‘ViewCount’ in attributes else None),
    (attributes[‘Body’] if ‘Body’ in attributes else None),
    (attributes[‘OwnerUserId’] if ‘OwnerUserId’ in attributes else None),
    (attributes[‘LastEditorUserId’] if ‘LastEditorUserId’ in attributes else None),
    (attributes[‘LastEditorDisplayName’] if ‘LastEditorDisplayName’ in attributes else None),
    (attributes[‘LastEditDate’] if ‘LastEditDate’ in attributes else None),
    (attributes[‘LastActivityDate’] if ‘LastActivityDate’ in attributes else None),
    (attributes[‘CommunityOwnedDate’] if ‘CommunityOwnedDate’ in attributes else None),
    (attributes[‘ClosedDate’] if ‘ClosedDate’ in attributes else None),
    (attributes[‘Title’] if ‘Title’ in attributes else None),
    (attributes[‘Tags’] if ‘Tags’ in attributes else None),
    (attributes[‘AnswerCount’] if ‘AnswerCount’ in attributes else None),
    (attributes[‘CommentCount’] if ‘CommentCount’ in attributes else None),
    (attributes[‘FavoriteCount’] if ‘FavoriteCount’ in attributes else None)
  )
);

经过近三天的导入（几乎一半的 XML 在此期间被导入），我意识到我犯了一个错误： ParentID 字段应该是 ParentId 。

但是，我并不想再浪费一个星期，于是我从 AMD E-350（2 x 1.35GHz）改为使用英特尔 G2020（2 x 2.90GHz）。但这仍然没有加快进程。

下一个决定 - 批量插入：

class docHandler(xml.sax.ContentHandler):
  def __init__(self, cusor):
    self.cusor = cusor;
    self.queue = 0;
    self.output = StringIO();
  def startElement(self, name, attributes):
    if name == ‘row’:
      self.output.write(
          attributes[‘Id’] + 't` + 
          (attributes[‘PostTypeId’] if ‘PostTypeId’ in attributes else '\N') + 't' + 
          (attributes[‘ParentId’] if ‘ParentId’ in attributes else '\N') + 't' + 
          (attributes[‘AcceptedAnswerId’] if ‘AcceptedAnswerId’ in attributes else '\N') + 't' + 
          (attributes[‘CreationDate’] if ‘CreationDate’ in attributes else '\N') + 't' + 
          (attributes[‘Score’] if ‘Score’ in attributes else '\N') + 't' + 
          (attributes[‘ViewCount’] if ‘ViewCount’ in attributes else '\N') + 't' + 
          (attributes[‘Body’].replace('\', '\\').replace('n', '\n').replace('r', '\r').replace('t', '\t') if ‘Body’ in attributes else '\N') + 't' + 
          (attributes[‘OwnerUserId’] if ‘OwnerUserId’ in attributes else '\N') + 't' + 
          (attributes[‘LastEditorUserId’] if ‘LastEditorUserId’ in attributes else '\N') + 't' + 
          (attributes[‘LastEditorDisplayName’].replace('n', '\n') if ‘LastEditorDisplayName’ in attributes else '\N') + 't' + 
          (attributes[‘LastEditDate’] if ‘LastEditDate’ in attributes else '\N') + 't' + 
          (attributes[‘LastActivityDate’] if ‘LastActivityDate’ in attributes else '\N') + 't' + 
          (attributes[‘CommunityOwnedDate’] if ‘CommunityOwnedDate’ in attributes else '\N') + 't' + 
          (attributes[‘ClosedDate’] if ‘ClosedDate’ in attributes else '\N') + 't' + 
          (attributes[‘Title’].replace('\', '\\').replace('n', '\n').replace('r', '\r').replace('t', '\t') if ‘Title’ in attributes else '\N') + 't' + 
          (attributes[‘Tags’].replace('n', '\n') if ‘Tags’ in attributes else '\N') + 't' + 
          (attributes[‘AnswerCount’] if ‘AnswerCount’ in attributes else '\N') + 't' + 
          (attributes[‘CommentCount’] if ‘CommentCount’ in attributes else '\N') + 't' + 
          (attributes[‘FavoriteCount’] if ‘FavoriteCount’ in attributes else '\N') + 'n'
      );
      self.queue += 1;
    if (self.queue >= 100000):
      self.queue = 0;
      self.flush();
  def flush(self):
      self.output.seek(0);
      self.cusor.copy_from(self.output, ‘posts’)
      self.output.close();
      self.output = StringIO();

StringIO 允许使用像文件的变量来处理使用 COPY 的函数 copy_from 。这样，整个过程只花了一个晚上。

下面开始创建索引。理论上， GiST 所花的时间比 GIN 多，但占用的空间更小。所以我决定使用 GiST 。一天后我得到了 70GB 的索引。

当我几次尝试查询时，我发现处理时间特别长。其原因在于磁盘 IO 的等待时间。 SSD GOODRAM C40 120Gb 有很大的提升作用，即使它不是目前最快的 SSD 。

我创建了一个全新的 PostgreSQL 集群：

initdb -D /media/ssd/postgresq/data

然后我更改了服务配置的路径（我使用的是 Manjaro 操作系统）：

vim /usr/lib/systemd/system/postgresql.service

Environment=PGROOT=/media/ssd/postgres
PIDFile=/media/ssd/postgres/data/postmaster.pid

接着重新加载配置并启动 postgreSQL ：

systemctl daemon-reload
postgresql systemctl start postgresql

这一次我使用 GIN ，导入仅花了几个小时。索引在 SSD 上占 20GB 的空间，查询仅需不到一分钟。

从数据库中提取书籍信息

随着数据的最终导入，我开始搜索提到推荐书籍的帖子，然后使用 SQL 将它们复制到单独的表：

CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE ‘%book%’”;

下一步是找到当中所有的超链接：


CREATE TABLE http_books AS SELECT * posts WHERE body LIKE ‘%http%’”;

在这一点上，我发现 StackOverflow 代理所有的链接，如：

rads.stackowerflow.com/[$isbn]/

我创建了另一个表格，其中有所有包含链接的帖子：

CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE ‘%http://rads.stackowerflow.com%'";

然后使用正则表达式提取所有 ISBN 。我通过 regexp_split_to_table 将 Stack Overflow 标签提取到另一个表。

一旦对热门标签进行提取和计算，可以得出20本被推荐最多的书籍（文末附有书单）。

下一步：优化标签。

这一步需要每个标签中提取前 20 本书，并排除已处理的书籍。

因为它是“一次性”的工作，我决定使用 PostgreSQL 数组。我写了一个脚本来实现查询：

SELECT *
    , ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude ))
    , ARRAY_UPPER(ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude )), 1) 
FROM (
   SELECT *
      , ARRAY[‘isbn1’, ‘isbn2’, ‘isbn3’] AS to_exclude 
   FROM (
      SELECT 
           tag
         , ARRAY_AGG(DISTINCT isbn) AS isbns
         , COUNT(DISTINCT isbn) 
      FROM (
         SELECT * 
         FROM (
            SELECT 
                 it.*
               , t.popularity 
            FROM isbn_tags AS it 
            LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
            LEFT OUTER JOIN tags AS t on t.tag = it.tag 
            WHERE it.tag in (
               SELECT tag 
               FROM tags 
               ORDER BY popularity DESC 
               LIMIT 1 OFFSET 0
            ) 
            ORDER BY post_count DESC LIMIT 20
      ) AS t1 
      UNION ALL
      SELECT * 
      FROM (
         SELECT 
              it.*
            , t.popularity 
         FROM isbn_tags AS it 
         LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
         LEFT OUTER JOIN tags AS t on t.tag = it.tag 
         WHERE it.tag in (
            SELECT tag 
            FROM tags 
            ORDER BY popularity DESC 
            LIMIT 1 OFFSET 1
         ) 
         ORDER BY post_count 
         DESC LIMIT 20
       ) AS t2 
       UNION ALL
       SELECT * 
       FROM (
          SELECT 
               it.*
             , t.popularity 
          FROM isbn_tags AS it 
          LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
          LEFT OUTER JOIN tags AS t on t.tag = it.tag 
          WHERE it.tag in (
             SELECT tag 
             FROM tags 
             ORDER BY popularity DESC 
             LIMIT 1 OFFSET 2
          ) 
          ORDER BY post_count DESC 
          LIMIT 20
      ) AS t3 
...
      UNION ALL
      SELECT * 
      FROM (
         SELECT 
              it.*
            , t.popularity 
         FROM isbn_tags AS it 
         LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
         LEFT OUTER JOIN tags AS t on t.tag = it.tag 
         WHERE it.tag in (
            SELECT tag 
            FROM tags 
            ORDER BY popularity DESC 
            LIMIT 1 OFFSET 78
         ) 
         ORDER BY post_count DESC 
         LIMIT 20
     ) AS t79
   ) AS tt 
   GROUP BY tag 
   ORDER BY max(popularity) DESC 
  ) AS ttt
) AS tttt 
ORDER BY ARRAY_upper(ARRAY(SELECT UNNEST(arr) EXCEPT SELECT UNNEST(la)), 1) DESC;

有了这些数据，我开始建网站。

构建Web应用

由于我不是一个 Web 开发人员，也不是一个 Web 界面专家，我决定创建一个基于默认 Bootstrap 主题的非常简易的单页面应用程序。

我创建了一个“按标签搜索”选项，然后提取热门标签，每次搜索时可点击对应标签。

我使用条形图显示搜索结果。我试过 Hightcharts 和 D3 ，但它们更适合做仪表盘。同时有一些有响应性的问题，并配置相当复杂。所以，我创建了基于 SVG 的响应图表。为了使它能够响应，必须在改变屏幕方向时刷新：

var w = $('#plot').width();
var bars = "";var imgs = "";
var texts = "";
var rx = 10;
var tx = 25;
var max = Math.floor(w / 60);
var maxPop = 0;
for(var i =0; i < max; i ++){
  if(i > books.length - 1 ){
    break;
  }
  obj = books[i];
  if(maxPop < Number(obj.pop)) {
    maxPop = Number(obj.pop);
  }
}
for(var i =0; i < max; i ++){
  if(i > books.length - 1){
    break;
   }
   obj = books[i];
   h = Math.floor((180 / maxPop ) * obj.pop);
   dt = 0;
   if(('' + obj.pop + '').length == 1){
    dt = 5;
   }
   if(('' + obj.pop + '').length == 3){
    dt = -3;
   }
   var scrollTo = 'onclick="scrollTo(''+ obj.id +''); return false;" "';
   bars += '<rect id="rect'+ obj.id +'" x="'+ rx +'" y="' + (180 - h + 30) + '" width="50" height="' + h + '" ' + scrollTo + '>';
   bars += '<title>' + obj.name+ '</title>';
   bars += '</rect>';
   imgs += '<image height="70" x="'+ rx +'" y="220" href="img/ol/jpeg/' + obj.id + '.jpeg" onmouseout="unhoverbar('+ obj.id +');" onmouseover="hoverbar('+ obj.id +');" width="50" ' + scrollTo + '>';
   imgs += '<title>' + obj.name+ '</title>';
   imgs += '</image>';
   texts += '<text x="'+ (tx + dt) +'" y="'+ (180 - h + 20) +'"  class="bar-label"  style="font-size: 16px;" ' + scrollTo + '>' + obj.pop + '</text>';
   rx += 60;
   tx += 60;
}
$('#plot').html(
    ' <svg width="100%" height="300" aria-labelledby="title desc" role="img">'
  + '  <defs> '
  + '    <style type="text/css"><![CDATA['
  + '      .cla {'
  + '        fill: #337ab7;'
  + '      }'
  + '      .cla:hover {'
  + '        fill: #5bc0de;'
  + '      }'
  + '      ]]></style>'
  + '  </defs>'
  + '  <g>'
  + bars
  + '  </g>'
  + '  <g>'
  + imgs
  + '  </g>'
  + '  <g>'
  + texts
  + '  </g>'
  + '</svg>');

Web服务器故障

发布 dev-books.com 之后，马上有许多用户访问我的网站。 Apache 不能同时为超过 500 个访问者服务，所以我很快设置切换为 Nginx 。当实时访问者高达 800人时我真的很惊讶。

书单下载：

Stack Overflow 推荐书单.pdf