datadesigncn.github.io/ch10.html at master · datadesigncn/datadesigncn.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en">
  <head>
    <meta charset="utf-8"/>
    <title>数据 + 设计</title>
    <link rel="stylesheet" type="text/css" href="theme/html/html.css"/>
<script src="js/retina.min.js" type="text/javascript"> </script>
<script src="js/jquery.min.js" type="text/javascript"> </script>
<script src="js/data-design.js" type="text/javascript"> </script>
    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
  </head>
  <body data-type="book">
    <span class="btn open">打开</span>
    <div class="navbar">
      <div class="title">
        <span class="btn close">关闭</span>
        <h1>数据 + 设计</h1>
        <h2>对信息准备与可视化的简要介绍</h2>
      </div>
      <nav data-type="toc" id="idp97216">
  <ol>
    <li data-type="part">
      <a href="titlepage01.html">简介</a>
      <ol>
        <li data-type="copyright-page">
          <a href="copyright-page01.html">版权许可</a>
        </li>
        <li data-type="preface">
          <a href="preface01.html">绪言</a>
        </li>
        <li data-type="foreword">
          <a href="foreword01.html">序</a>
        </li>
        <li data-type="introduction">
          <a href="introduction01.html">怎样使用本书</a>
        </li>
      </ol>
    </li>

    <li data-type="part">
      <a href="part01.html">数据基础</a>
      <ol>
        <li data-type="chapter">
          <a href="ch01.html">基本数据类型</a>
        </li>
        <li data-type="chapter">
          <a href="ch02.html">关于数据聚合/统计</a>
        </li>
      </ol>
    </li>

    <li data-type="part">
      <a href="part02.html">数据采集</a>
      <ol>
        <li data-type="chapter">
          <a href="ch03.html">调查数据简介</a>
        </li>

        <li data-type="chapter">
          <a href="ch04.html">调查问题类型</a>
        </li>

        <li data-type="chapter">
          <a href="ch05.html">其他的数据采集方法</a>
        </li>

        <li data-type="chapter">
          <a href="ch06.html">发现外部数据</a>
        </li>
      </ol>
    </li>

    <li data-type="part">
      <a href="part03.html">让数据就绪</a>
      <ol>
        <li data-type="chapter">
          <a href="ch07.html">数据准备</a>
        </li>

        <li data-type="chapter">
          <a href="ch08.html">数据清理</a>
        </li>

        <li data-type="chapter">
          <a href="ch09.html">数据校对种类</a>
        </li>

        <li data-type="chapter">
          <a href="ch10.html">数据清理的能和不能</a>
        </li>

        <li data-type="chapter">
          <a href="ch11.html">数据转换</a>
        </li>
      </ol>
    </li>

    <li data-type="part">
      <a href="part04.html">数据可视化</a>
      <ol>
        <li data-type="chapter">
          <a href="ch12.html">决定哪些以及多少数据用于呈现</a>
        </li>

        <li data-type="chapter">
          <a href="ch13.html">图形化调查响应结果</a>
        </li>

        <li data-type="chapter">
          <a href="ch14.html">解析信息图</a>
        </li>

        <li data-type="chapter">
          <a href="ch15.html">色彩、字体、图标的重要性</a>
        </li>

        <li data-type="chapter">
          <a href="ch16.html">打印 Vs. 网页，静态 Vs. 交互</a>
        </li>
      </ol>
    </li>

    <li data-type="part">
      <a href="part05.html">不要做什么</a>
      <ol>
        <li data-type="chapter">
          <a href="ch17.html">知觉欺骗</a>
        </li>

        <li data-type="chapter"><a href="ch18.html">常见可视化错误</a>
        </li>
      </ol>
    </li>

    <li data-type="part">
      <a href="#">总结</a>
      <ol>
        <li data-type="chapter">
          <a href="app01.html">资源</a>
        </li>
        <li data-type="chapter">
          <a href="glossary01.html">术语表</a>
        </li>
        <li data-type="chapter">
          <a href="acknowledgments01.html">贡献者/致谢</a>
        </li>
      </ol>
    </li>
  </ol>
</nav>

    </div>
    <section class="blue" data-type="chapter" data-pdf-bookmark="Chapter 28. What Data Cleaning Can and Can’t Catch" id="idp6086928">
<header>
  <div class="icon"><img src="images/sections/04/close-inspection.png"/></div>
  <p>Chapter 10</p>
  <p>第十章</p>
  <h1>What Data Cleaning Can and Can’t Catch</h1>
  <h1>数据清洗能做什么不能做什么</h1>
  <p data-type="author">By Dyanna Gregory</p>
</header>

<section data-type="sect1" id="idp6083840">
<p>Now that we understand what data cleaning is for and what methods and approaches there are to shape up our dataset, there is still the question of what cleaning can and can’t catch.</p>
<p>现在我们了解了数据清洗的目的以及改进我们数据集的方法和途径，但仍然存在一个问题：数据清洗能做什么不能做什么？</p>

<p>A general rule for cleaning a dataset where each column is a variable and the rows represent the records is:</p>
<p>对列代表变量而行代表记录的数据集进行清理的一个常用规则是：</p>

<ul>
	<li>if the number of incorrect or missing values in a row is greater than the number of correct values, it is recommended to exclude that row.</li>
	<li>if the number of incorrect or missing values in a column is greater than the number of correct values in that column, it is recommended to exclude that column.</li>
</ul>
<ul>
	<li>如果一行中错误或缺失值的数量大于正确值，建议剔除该行；</li>
	<li>如果一列中错误或缺失值的数量大于正确值，建议剔除该列。</li>
</ul>

<p>It should be made clear that exclusion is not the same as deletion! If you decide that you don’t want to include a row or column in your analysis or visualization, you should set them aside in a separate dataset rather than deleting them altogether. Once data are deleted, you can’t retrieve them any longer, even if you realize later on that there was a way to fill in the missing values. Unless you are absolutely certain that you will not use a record or variable again, do not just delete it.</p>
<p>需要弄清楚的是剔除并不等同于删除！如果你决定在你的分析或可视化中不使用某行或某列的话，你应该把它们放入备份数据集而不是完全删除。一旦数据被删除，你再也无法恢复它们，甚至以后你想到了填补该缺失值的方法也不行。除非你完全确定再也不会用这个记录或变量，否则不要删除它。</p>

<p>In the last few chapters, we have talked about several different processes for data cleaning and have seen the types of problems they can help identify and fix. When we’re searching for errors and mistakes, we are able to detect potential problems such as:</p>
<p>在前几章中，我们已经谈过几种不同的数据清洗过程，并且看到了它们能够帮助识别并解决的问题类型。当我们在查找错误时，我们能探测出这些潜在问题：</p>

<ul>
	<li>inconsistent labels, misspellings, and errors in punctuation;</li>
	<li>outliers, <a class="glossterm" target="_blank" href="glossary01.html#data-invalid">invalid values</a>, and extreme values;</li>
	<li>data that aren’t internally consistent within the dataset (e.g. 200 lbs. of morphine);</li>
	<li>lack or excess of data;</li>
	<li>odd patterns in distributions; and</li>
	<li>missing values.</li>
</ul>
<ul>
	<li>不一致的标签，拼写错误，以及标点符号错误；</li>
	<li>异常值，<a class="glossterm" target="_blank" href="glossary01.html#data-invalid">无效值</a>以及极端值；</li>
	<li>不符合数据集内部一致性的数据（例如200磅的吗啡）；</li>
	<li>数据缺乏或过量；</li>
	<li>分布中的反常现象；以及</li>
	<li>缺失值。</li>
</ul>

<p>What we haven’t talked a lot about yet is what data cleaning can’t catch. There may be incorrect values that are nevertheless both within the acceptable range for the data and that make complete sense. For example, if someone enters the number 45 instead of 54 into your dataset and your valid range of numbers is 0-100, it will be unlikely that you’ll catch that error unless that field is one that you’re cross-checking with another field or you’re verifying the information with an outside source record.</p>
<p>对于数据清洗不能做什么，我们还没说太多。可能会有些不正确的值，但它们在可接受范围内且有完整意义。举个例子，如果有人在你的数据集里输入45来替代54，同时你的数据有效范围是0-100，除非是你用别的领域来检查这个领域，或者你用外部数据源的记录来验证信息，否则你不大可能发现这个错误。</p>

<p>Similar to that, you may be receiving information from an online survey form and the person filling it out may have selected the button for “Strongly Agree” when they actually meant to select “Strongly Disagree.” Again, unless this answer is somehow cross-checked with another variable or source, you will have no easy way to detect this error. Sometimes this type of error is more critical than others. If a person selects “Strongly Agree” instead of “Agree” on an opinion survey, that is unlikely to have the same impact on the results as if someone accidentally marks the wrong gender on a form for a research study where you are using gender as a grouping category for treatment assignments.</p>
<p>类似的，你也许会收到线上调查表格的信息，而填表人可能本来想选 “强烈反对”却点了“强烈同意”的按钮。还是那句话，除非这个回答用别的变量或数据源以某种方式进行核查，否则你想检测出这个错误是不太容易的。有时候这种形式的错误更为严重。如果一个人在一项意见调查中选择了“强烈同意”而不是“同意”，这种情况对结果造成的影响和如果有人意外地将用于研究的表格上的性别填错了而你又把性别作为处理任务的分类标签所造成的影响还不太一样。</p>

<p>Data cleaning also can’t tell if a missing value is truly missing (i.e. the question was accidentally skipped or the data were not collected for some reason) or the question was purposely skipped (i.e. the participant declined to answer) unless “Prefer not to answer” was an answer choice. This may be relevant in some cases (particularly in demographics), though in others, you may decide to just treat both as missing data. This is why, as mentioned before, you need to include a “Prefer not to answer” choice for any question of a personal nature where you want to know if the data are truly missing, since some people may actively choose to not answer questions about race/ethnicity, income, political affiliation, sexual orientation, etc.</p>
<p>数据清洗也不能告诉你缺失值是否真的缺失了（也就是，该问题被意外地跳答了或者因为某种原因数据没有被收集）还是问题被故意跳答了（也就是参与者谢绝回答），除非“不愿意回答”也作为一个回答选项。这个在有些情况下可能是意义重大的（尤其在人口统计中），尽管在其他情况下你可能决定把以上两种都当作数据缺失处理。这就是为什么，如前面提到的，如果你想知道数据是否真的缺失，就需要对任何有关个人本质的问题设计一个“不愿意回答”的选项，因为有些人可能不愿意回答关于人种/种族，收入，政治背景，性取向等方面的问题。</p>
</section>
</section>
    <div class="navigation">
      <ul>
        <li id="next_page"><a href="ch11.html">Next</a></li>
        <li id="previous_page"><a href="ch09.html">Previous</a></li>
      </ul>
    </div>

    <script>
      var _hmt = _hmt || [];
      (function() {
        var hm = document.createElement("script");
        hm.src = "//hm.baidu.com/hm.js?27111432badf47d1f6260dcd3c815289";
        var s = document.getElementsByTagName("script")[0];
        s.parentNode.insertBefore(hm, s);
      })();
    </script>
  </body>
</html>