内容简介
现在人们已经意识到数据可以让选举或者商业模 式变得不同,数据科学作为一项职业正在不断发展。
但是你应该如何在这样一个广阔而又错综复杂的交叉 学科领域中开展工作呢?舒特、奥尼尔著的《数据科 学(影印版)》这本书将会告诉你所需要了解的一切。
它富有深刻见解,是根据哥伦比亚大学的数据科学课 程的讲义整理而成。
目录
Preface
1. Introduction: What Is Data Science?
Big Data and Data Science Hype
Getting Past the Hype
Why Now?
Datafication
The Current Landscape (with a Little History)
Data Science lobs
A Data Science Profile
Thought Experiment: meta-Definition
OK, So What Is a Data Scientist, Really?
In Academia
In Industry
2. Statistical Inference, Exploratory Data Analysis, and the Data Science
Process
Statistic.a1 Thinking in the Age of Big Data
Statistical Inference
Populations and Samples
Populations and Samples of Big Data
Big Data Can Mean Big Assumptions
Modeling
Exploratory Data Analysis
Philosophy of Exploratory Data Analysis
Exercise: EDA
The Data Science Process
A Data Scientist's Role in This Process
Thought Experiment: How Would You Simulate Chaos?
Case Study: RealDirect
How Does RealDirect Make Money?
Exercise: RealDirect Data Strategy
3. Algorithms
Machine Learning Algorithms
Three Basic Algorithms
Linear Regression
k-Nearest Neighbors (k-NN)
k-means
Exercise: Basic Machine Learning Algorithms
Solutions
Summing It All Up
Thought Experiment: Automated Statistician
4. Spare Filters, Naive Bayes, and Wrangling
Thought Experiment: Learning by Example
Why Won't Linear Regression Work for Filtering Spare?
How about k-nearest Neighbors?
Naive Bayes
Bayes Law
A Spare Filter for Individual Words
A Spam Filter That Combines Words: Naive Bayes
Fancy It Up: Laplace Smoothing
Comparing Naive Bayes to k-NN
Sample Code in bash
Scraping the Web: APIs and Other Tools
Jake's Exercise: Naive Bayes for Article Classification
Sample R Code for Dealing with the NYT API
5. Logistic Regression
Thought Experiments
Classifiers
Runtime
You
Interpretability
Scalability
M6D Logistic Regression Case Study
Chck Models
The Underlying Math
6.1ime Stamps and Financial Modeling
7.Extracting Meaning from Data
8.Recommendation Engines:Building a User-Facing Data Product at Scale
9.Data Visualization and Fraud Detection
10.SociaI Networks and Data Journalism
11.Causality
12.Epidemiology
13.Lessons Learned from Data Competitions:Data Leakage and Model evaluation
14.Data Engineering:MapReduce,Pregel,and Hadoop
15.The Students Speak
16.Next-Generation Data Scientists,Hubris,and Ethics
Index