针对数据分析工作的Python入门简介（上篇）

本文是极客时间专栏《数据分析实战 45 讲》第 3 讲的笔记，由于原文章对于 Python 基础的讲解比较简洁，出于复习和分享的目的，对笔记进行了很多细节上的扩充，于是有了这篇文章。

为什么要学习 Python ？

有很多人会问，要掌握数据分析，一定要学习 Python 语言吗？

首先，根据调查，使用过 Python 的开发者中，有 80% 都会把 Python 当做自己的主力开发语言，Python 已经是现如今发展最快的编程语言。

其次，在数据分析领域里，使用 Python 的开发者是最多的。感兴趣的朋友可以参考这篇对比 Python、R 和 Matlab/Octave 的文章《R vs Python vs MATLAB vs Octave》，链接附在文末。

最后，Python 语法简洁，有一个说法是，当你需要用 500 行 C/C++ 代码来完成一项功能的时候，使用 Java 你需要 200 行，但使用 Python 你只需要 50 行。另外，Python 的第三方库众多且强大，足以解决大部分情况下的数据分析问题。在数据分析领域，Python 有许多著名的工具库，如科学计算工具 NumPy 和 Pandas，机器学习 Scikit-learn，深度学习 Keras、TensorFlow、PyTorch 和 MXNet，爬虫框架 Scrapy。

安装 Python 和 IDE

Python 的版本选择

虽然 Python 2.7 在 2020 年就要停止支持，但是很多商业项目中 2.7 版本仍然是主流，所以不应该忽视 Python 2.7。我个人认为，如果是零基础刚开始上手的朋友，直接学习 Python 3 就可以了。

IDE 推荐

市面上有众多的 Python IDE 可以很方便的使用，如 PyCharm、Sublime Text、Vim、Eclipse + PyDev 等，但我个人推荐使用 Anaconda 来配置 Python 虚拟环境，并使用 Jupyter Notebook 作为学习环境，在配备一些插件之后它甚至可以作为 IDE 来使用（关于 Jupyter Notebook 的插件可以看今天的第三篇文章），另外 PyCharm 和 Visual Studio Code 也是两个我常用的开发工具。

基础语法

输入和输出

name = input('Please enter your name: ')
money = 100 + 200
print('Hello, {0}! You have {1} RMB.'.format(name, money))
print('Hello, %s! You have %d RMB.' % (name, money))

Please enter your name: Alain
Hello, Alain! You have 300 RMB.
Hello, Alain! You have 300 RMB.

这里使用了两种格式化的写法，在后文中的字符串部分会谈到。

分支与循环

if … else … 语句

score = int(input('Please enter a score: '))
if 90 <= score <= 100:
	print('Excellent!')
elif 60 <= score < 90:
	print('Good job!')
elif score < 0 or score > 100:
	print('Error input!')
else:
	print('Failed.')

Python 采用缩进的方式来区分代码块之间的层级关系，这种强制规定使得代码必须正确对齐，让程序员决定else应该属于哪一个if，从而避免了 C/C++ 或 Java 中的「悬挂else」的问题，减少了不确定性，并且使得代码变得简洁易读。

条件表达式（三元操作符 Ternary Operator）

所谓的「多少元」操作符的意思是，这个操作符有多少个操作数。例如-负号就是「一元操作符」，=是「二元操作符」。Python 原先是没有三元操作符的，Python 的作者 Guido van Rossum 认为，三元操作符会让程序的结构变得复杂。但是由于社区的需求非常强烈，最终 Guido 在 2005 年发布的 Python 2.5 版本中加入了三元操作符。

a = x if 条件 else y

三元操作符可以让一些条件判断和赋值操作变得非常的简洁。比如原先的代码是这样写的：

if x > y:
	big = x
else:
	big = y

big = x if x > y else y

while 循环语句

Python 的while语句和if语句非常相似，都是在条件为真的时候执行一段代码，不同的是，只要条件为真，while的循环就会一直执行那段代码，即循环体。

while 条件:
	循环体

for 循环语句

在 Python 中，for循环经常会和range()内建函数（Built-In Function，BIF）一起使用。 range()的使用方法如下：

range([start,] stop[, step = 1])

>>> for i in range(5):
		print(i)
0
1
2
3
4
>>> for i in range(1, 5):
		print(i)
1
2
3
4
>>> for i in range(1, 10, 2):
		print(i)
1
3
5
7
9

对比 while 与 for 循环，while 适合用于循环次数不确定的循环，for 适合循环次数相对确定的循环。

break 与 continue 语句

break语句的作用是，终止当前循环，并跳出循环体。而continue语句的作用是，终止本轮循环并开始下一轮循环。举例：

for i in range(10):
    if i % 2 != 0:
        print(i)
        continue
        # break
    i += 2

这一段代码的作用是打印输出 10 以内的单数。continue在这里表示，当当前数字不能被 2 整除的时候，打印之，并终止本轮循环，然后直接开始下一轮循环。当把continue注释掉使用break时，整段代码就只能打印出一个1。

数据结构

列表、元组与字符串

列表 list

Python 中的列表很像 C/C++ 中的数组，具有增删查改的功能，但不需要像 C/C++ 的数组一样需要数组内部元素的数据类型一致。举例：

number = [1, 2, 3, 4, 5]
mix = [1, 'hello world', 1.11, ['a', 1, 'b', 2]]
empty = []

>>> number = [1, 2, 3, 4, 5]
>>> number[1] # 从列表中获取相应位置的元素
2
>>> number.append(6) # 向列表尾部添加一个元素
>>> number
[1, 2, 3, 4, 5, 6]
>>> number.pop() # 删除列表尾部的元素
1
>>> len(number) # 计算列表的长度
5
>>> number.insert(1, 'a') # 在列表中插入元素
>>> number
[1, 'a', 2, 3, 4, 5]
>>> number.remove('a') # 删除列表中的元素，不需要知道元素的位置
>>> number
[1, 2, 3, 4, 5]
>>> del number[1] # 删除指定位置的元素，del 是一个语句而不是方法，不需要添加括号
>>> number
[1, 3, 4, 5]
>>> del number # del 也可以用来删除整个列表
>>> number
Traceback (most recent call last):
  File "<pyshell#43>", line 1, in <module>
    number
NameError: name 'number' is not defined

如果需要一次性从列表中获取多个元素，可以使用列表分片（slice）的方式：

>>> number = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> number[:3]
[1, 2, 3]
>>> number[0:3]
[1, 2, 3]
>>> number[1:3]
[2, 3]
>>> number[0:9:2]
[1, 3, 5, 7, 9]
>>> number[:]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> number[::-1]
[9, 8, 7, 6, 5, 4, 3, 2, 1]

列表还有一些其他的常用操作符和方法，限于篇幅就不展开，留给大家自己去研究了。

元组 tuple

元组和列表最大的区别就在于，列表中的元素可以被任意地修改，而元组是不可改变的，后文提到的字符串也是如此。举例：

>>> number = (1, 2, 3, 4, 5, 6, 7, 8, 9)
>>> number[1] # 访问元素的方法和列表一致
2
>>> number[:3] # 也有分片操作
(1, 2, 3)
>>> number[1] = 1 # 如果试图修改元组，则告诉你元组对象不支持赋值操作
Traceback (most recent call last):
  File "<pyshell#58>", line 1, in <module>
    number[1] = 1
TypeError: 'tuple' object does not support item assignment
>>> empty = ()
>>> number + ('a', 'b', 'c', 'd') # 元组支持连接操作符 +
(1, 2, 3, 4, 5, 6, 7, 8, 9, 'a', 'b', 'c', 'd')

字符串 string

字符串也可叫做「文本」，文本和数字是两个截然不同的东西。如果让两个数字相加，那么会得到一个相加的结果：

>>> '1' + '2'
'12'

所以在 Python 中要创建字符串，就需要在字符两边加上引号，单引号双引号皆可，但必须成对。如果字符串中需要出现引号，则应使用反斜杠\来进行转义，或使用单双引号嵌套的做法：

>>> 'abc"
SyntaxError: EOL while scanning string literal

>>> 'Let\'s go!'
"Let's go!"
>>> "Let's go!"
"Let's go!"

>>> str1 = '123456789'
>>> str1[1]
'2'
>>> str1[:3]
'123'
>>> str1[1] = 1
Traceback (most recent call last):
  File "<pyshell#68>", line 1, in <module>
    str1[1] = 1
TypeError: 'str' object does not support item assignment
>>> str1 + 'abcd' # 字符串支持连接操作符 +
'123456789abcd'

在上文中有提到字符串的格式化。所谓格式化字符串，就是按照一个统一的规则去输出一个字符串。

name = input('Please enter your name: ')
money = 100 + 200
print('Hello, {0}! You have {1} RMB.'.format(name, money))
print('Hello, %s! You have %d RMB.' % (name, money))

Please enter your name: Alain
Hello, Alain! You have 300 RMB.
Hello, Alain! You have 300 RMB.

这里有两种格式化的写法，但format()能提供更精确和简洁的控制。如在某些场景下，对比%这种格式化方式，format()可以同时传递一个变量和元组，如下所示：

>>> name = (1, 2, 3)
>>> print('ok, %s' % name)
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    print('ok, %s' % name)
TypeError: not all arguments converted during string formatting
>>> print('ok, {}'.format(name))
ok, (1, 2, 3)

更多的字符串格式化的内容也请大家自行研究探讨，这里就不展开了。

字典与集合

字典 dictionary

字典是 Python 中唯一的映射类型，在 Python 中，字典以「键值对（key-value）」的形式储存，有些地方会把字典称为「哈希（Hash）」、「散列」或者「关系数组」，其实都和「字典」是同一个概念。

>>> testScore = {'a': 96, 'b': 85, 'c': 90}
>>> testScore
{'a': 96, 'b': 85, 'c': 90}
>>> print("a's score is:", testScore['a'])
a's score is: 96

>>> empty = {} # 声明空字典
>>> empty
{}
>>> type(empty)
<class 'dict'>

>>> testSocre = dict((('a', 96), ('b', 85), ('c', 90))) # 也可以通过这几种方式创建字典
>>> testScore
{'a': 96, 'b': 85, 'c': 90}
>>> testScore = dict(a=96, b=85, c=90)
>>> testScore
{'a': 96, 'b': 85, 'c': 90}
>>> testScore = dict('a'=96, 'b'=85, 'c'=90) # 注意此处键的位置不能加上字符串的引号
SyntaxError: keyword can't be an expression

>>> testScore
{'a': 96, 'b': 85, 'c': 90}
>>> testScore['d']
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    testScore['d']
KeyError: 'd'
>>> testScore['d'] = 100
>>> testScore['d']
100

下面这一段代码本人觉得非常高效地给出了创建字典的 5 中方法（引用自小甲鱼《零基础入门学习 Python》书籍）：

>>> a = dict(one=1, two=2, three=3)
>>> b = {'one':1, 'two':2, 'three':3}
>>> c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
>>> d = dict([('two', 2), ('one', 1), ('three', 3)])
>>> e = dict({'three':3, 'one':1, 'two':2})
>>> a == b == c == d == e
True

>>> example = {}
>>> example.fromkeys(range(32), '赞') # fromkeys() 方法，用于创建并返回一个新的字典
{0: '赞', 1: '赞', 2: '赞', 3: '赞', 4: '赞', 5: '赞', 6: '赞', 7: '赞', 8: '赞', 9: '赞', 10: '赞', 11: '赞', 12: '赞', 13: '赞', 14: '赞', 15: '赞', 16: '赞', 17: '赞', 18: '赞', 19: '赞', 20: '赞', 21: '赞', 22: '赞', 23: '赞', 24: '赞', 25: '赞', 26: '赞', 27: '赞', 28: '赞', 29: '赞', 30: '赞', 31: '赞'}
>>> example # 记得要覆盖原字典，否则并未对原字典进行修改
{}
>>> example = example.fromkeys(range(32), '赞')
>>> example
{0: '赞', 1: '赞', 2: '赞', 3: '赞', 4: '赞', 5: '赞', 6: '赞', 7: '赞', 8: '赞', 9: '赞', 10: '赞', 11: '赞', 12: '赞', 13: '赞', 14: '赞', 15: '赞', 16: '赞', 17: '赞', 18: '赞', 19: '赞', 20: '赞', 21: '赞', 22: '赞', 23: '赞', 24: '赞', 25: '赞', 26: '赞', 27: '赞', 28: '赞', 29: '赞', 30: '赞', 31: '赞'}
>>> example.keys() # keys() 方法用于获取所有键
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
>>> example.values() # values() 方法用于获取所有值
dict_values(['赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞', '赞'])
>>> example.items() # items() 方法用于获取所有键值对
dict_items([(0, '赞'), (1, '赞'), (2, '赞'), (3, '赞'), (4, '赞'), (5, '赞'), (6, '赞'), (7, '赞'), (8, '赞'), (9, '赞'), (10, '赞'), (11, '赞'), (12, '赞'), (13, '赞'), (14, '赞'), (15, '赞'), (16, '赞'), (17, '赞'), (18, '赞'), (19, '赞'), (20, '赞'), (21, '赞'), (22, '赞'), (23, '赞'), (24, '赞'), (25, '赞'), (26, '赞'), (27, '赞'), (28, '赞'), (29, '赞'), (30, '赞'), (31, '赞')])
>>> example.get(31) # get() 方法可以按照键去索引字典内容
'赞'
>>> example.get(32) # 但如果字典中不存在这个键，get() 方法返回一个 None 表示什么也没有找到
>>>
>>> example.get(32, '木有') # 如果这个键不存在，也可以给它一个默认值
'木有'
>>> example.get(31, '木有')
'赞'
>>> 31 in example # 另外也可以使用 in 或 not in 这两个「成员资格操作符」来判断键是否在字典中
True
>>> 32 in example
False

其他关于字典的方法还请读者自行查找练习。文末有一篇关于 Python 中字典的存储方式，解释了为什么给相同的键赋值会直接覆盖的原因，大家可以参考。

集合 set

>>> test = {}
>>> type(test)
<class 'dict'>

>>> test = {1, 2, 3}
>>> type(test)
<class 'set'>

学过高中数学的朋友们都知道（可能忘了），集合有三大特点（摘自维基百科）：

>>> test = {1, 2, 3, 4, 5, 1, 5, 3, 2, 0}
>>> test
{0, 1, 2, 3, 4, 5}
>>> test[2]
Traceback (most recent call last):
  File "<pyshell#52>", line 1, in <module>
    test[2]
TypeError: 'set' object does not support indexing

这里可以很好的体现集合的两大特点：集合具有互异性，所以重复的数据会被清理掉；集合具有无序性，所以不能通过索引去访问集合中的某一元素（这里 Python 好心办坏事，体现出来似乎有「有序性」，但实际上集合内部是无序的）。

>>> set1 = {'a', 'b', 'c', 'a'}
>>> set2 = set(['a', 'b', 'c', 'a'])
>>> set1 == set2
True

>>> for each in test:    # 使用迭代的方法可以将集合中的数据读取出来
	print(each, end=' ')

0 1 2 3 4 5
>>> 5 in test # 使用成员资格操作符来判断一个元素是否属于一个集合
True
>>> 6 in test
False
>>> test.add(6) # add() 方法添加元素
>>> test
{0, 1, 2, 3, 4, 5, 6}
>>> test.remove(0) # remove() 方法删除元素
>>> test
{1, 2, 3, 4, 5, 6}