Python语法速查:数据处理

neevop 十二月 10, 2022

JSON

Text file format for storing collections of strings and numbers.

import json
<str>    = json.dumps(<object>)     # Converts object to JSON string.
<object> = json.loads(<str>)        # Converts JSON string to object.

Read Object from JSON File

def read_json_file(filename):
    with open(filename, encoding='utf-8') as file:
        return json.load(file)

Write Object to JSON File

def write_to_json_file(filename, an_object):
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(an_object, file, ensure_ascii=False, indent=2)

Pickle

Binary file format for storing Python objects.

import pickle
<bytes>  = pickle.dumps(<object>)   # Converts object to bytes object.
<object> = pickle.loads(<bytes>)    # Converts bytes object to object.

Read Object from File

def read_pickle_file(filename):
    with open(filename, 'rb') as file:
        return pickle.load(file)

Write Object to File

def write_to_pickle_file(filename, an_object):
    with open(filename, 'wb') as file:
        pickle.dump(an_object, file)

CSV

Text file format for storing spreadsheets.

import csv

Read

<reader> = csv.reader(<file>)       # Also: `dialect='excel', delimiter=','`.
<list>   = next(<reader>)           # Returns next row as a list of strings.
<list>   = list(<reader>)           # Returns a list of remaining rows.
  • File must be opened with a 'newline=""' argument, or newlines embedded inside quoted fields will not be interpreted correctly!
  • To print the spreadsheet to the console use Tabulate library.
  • For XML and binary Excel files (xlsx, xlsm and xlsb) use Pandas library.

Write

<writer> = csv.writer(<file>)       # Also: `dialect='excel', delimiter=','`.
<writer>.writerow(<collection>)     # Encodes objects using `str(<el>)`.
<writer>.writerows(<coll_of_coll>)  # Appends multiple rows.
  • File must be opened with a 'newline=""' argument, or ‘\r’ will be added in front of every ‘\n’ on platforms that use ‘\r\n’ line endings!

Parameters

  • 'dialect' - Master parameter that sets the default values. String or a Dialect object.
  • 'delimiter' - A one-character string used to separate fields.
  • 'quotechar' - Character for quoting fields that contain special characters.
  • 'doublequote' - Whether quotechars inside fields are/get doubled or escaped.
  • 'skipinitialspace' - Is space character at the start of the field stripped by the reader.
  • 'lineterminator' - How writer terminates rows. Reader is hardcoded to ‘\n’, ‘\r’, ‘\r\n’.
  • 'quoting' - 0: As necessary, 1: All, 2: All but numbers which are read as floats, 3: None.
  • 'escapechar' - Character for escaping quotechars if doublequote is False.

Dialects

+------------------+--------------+--------------+--------------+
|                  |     excel    |   excel-tab  |     unix     |
+------------------+--------------+--------------+--------------+
| delimiter        |       ','    |      '\t'    |       ','    |
| quotechar        |       '"'    |       '"'    |       '"'    |
| doublequote      |      True    |      True    |      True    |
| skipinitialspace |     False    |     False    |     False    |
| lineterminator   |    '\r\n'    |    '\r\n'    |      '\n'    |
| quoting          |         0    |         0    |         1    |
| escapechar       |      None    |      None    |      None    |
+------------------+--------------+--------------+--------------+

Read Rows from CSV File

def read_csv_file(filename, dialect='excel'):
    with open(filename, encoding='utf-8', newline='') as file:
        return list(csv.reader(file, dialect))

Write Rows to CSV File

def write_to_csv_file(filename, rows, dialect='excel'):
    with open(filename, 'w', encoding='utf-8', newline='') as file:
        writer = csv.writer(file, dialect)
        writer.writerows(rows)

SQLite

A server-less database engine that stores each database into a separate file.

import sqlite3
<conn> = sqlite3.connect(<path>)                # Opens existing or new file. Also ':memory:'.
<conn>.close()                                  # Closes the connection.

Read

<cursor> = <conn>.execute('<query>')            # Can raise a subclass of sqlite3.Error.
<tuple>  = <cursor>.fetchone()                  # Returns next row. Also next(<cursor>).
<list>   = <cursor>.fetchall()                  # Returns remaining rows. Also list(<cursor>).

Write

<conn>.execute('<query>')                       # Can raise a subclass of sqlite3.Error.
<conn>.commit()                                 # Saves all changes since the last commit.
<conn>.rollback()                               # Discards all changes since the last commit.

Or:

with <conn>:                                    # Exits the block with commit() or rollback(),
    <conn>.execute('<query>')                   # depending on whether any exception occurred.

Placeholders

<conn>.execute('<query>', <list/tuple>)         # Replaces '?'s in query with values.
<conn>.execute('<query>', <dict/namedtuple>)    # Replaces ':<key>'s with values.
<conn>.executemany('<query>', <coll_of_above>)  # Runs execute() multiple times.
  • Passed values can be of type str, int, float, bytes, None, bool, datetime.date or datetime.datetime.
  • Bools will be stored and returned as ints and dates as ISO formatted strings.

Example

Values are not actually saved in this example because 'conn.commit()' is omitted!

>>> conn = sqlite3.connect('test.db')
>>> conn.execute('CREATE TABLE person (person_id INTEGER PRIMARY KEY, name, height)')
>>> conn.execute('INSERT INTO person VALUES (NULL, ?, ?)', ('Jean-Luc', 187)).lastrowid
1
>>> conn.execute('SELECT * FROM person').fetchall()
[(1, 'Jean-Luc', 187)]

SqlAlchemy

# $ pip3 install sqlalchemy
from sqlalchemy import create_engine, text
<engine> = create_engine('<url>').connect()     # Url: 'dialect://user:password@host/dbname'.
<conn>   = <engine>.connect()                   # Creates a connection. Also <conn>.close().
<cursor> = <conn>.execute(text('<query>'), )   # Replaces ':<key>'s with keyword arguments.
with <conn>.begin(): ...                        # Exits the block with commit or rollback.
+------------+--------------+-----------+-----------------------------------+
| Dialects   | pip3 install | import    | Dependencies                      |
+------------+--------------+-----------+-----------------------------------+
| mysql      | mysqlclient  | MySQLdb   | www.pypi.org/project/mysqlclient  |
| postgresql | psycopg2     | psycopg2  | www.psycopg.org/docs/install.html |
| mssql      | pyodbc       | pyodbc    | apt install g++ unixodbc-dev      |
| oracle     | cx_oracle    | cx_Oracle | Oracle Instant Client             |
+------------+--------------+-----------+-----------------------------------+

Bytes

Bytes object is an immutable sequence of single bytes. Mutable version is called bytearray.

<bytes> = b'<str>'                          # Only accepts ASCII characters and \x00-\xff.
<int>   = <bytes>[<index>]                  # Returns an int in range from 0 to 255.
<bytes> = <bytes>[<slice>]                  # Returns bytes even if it has only one element.
<bytes> = <bytes>.join(<coll_of_bytes>)     # Joins elements using bytes as a separator.

Encode

<bytes> = bytes(<coll_of_ints>)             # Ints must be in range from 0 to 255.
<bytes> = bytes(<str>, 'utf-8')             # Or: <str>.encode('utf-8')
<bytes> = <int>.to_bytes(n_bytes, )        # `byteorder='little/big', signed=False`.
<bytes> = bytes.fromhex('<hex>')            # Hex pairs can be separated by whitespaces.

Decode

<list>  = list(<bytes>)                     # Returns ints in range from 0 to 255.
<str>   = str(<bytes>, 'utf-8')             # Or: <bytes>.decode('utf-8')
<int>   = int.from_bytes(<bytes>, )        # `byteorder='little/big', signed=False`.
'<hex>' = <bytes>.hex()                     # Returns hex pairs. Accepts `sep=<str>`.

Read Bytes from File

def read_bytes(filename):
    with open(filename, 'rb') as file:
        return file.read()

Write Bytes to File

def write_bytes(filename, bytes_obj):
    with open(filename, 'wb') as file:
        file.write(bytes_obj)

Struct

  • Module that performs conversions between a sequence of numbers and a bytes object.
  • System’s type sizes, byte order, and alignment rules are used by default.
from struct import pack, unpack
<bytes> = pack('<format>', <el_1> [, ...])  # Packages arguments into bytes object.
<tuple> = unpack('<format>', <bytes>)       # Use iter_unpack() for iterator of tuples.
>>> pack('>hhl', 1, 2, 3)
b'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('>hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)

Format

For standard type sizes and manual alignment (padding) start format string with:

  • '=' - System’s byte order (usually little-endian).
  • '<' - Little-endian.
  • '>' - Big-endian (also '!').

Besides numbers, pack() and unpack() also support bytes objects as part of the sequence:

  • 'c' - A bytes object with a single element. For pad byte use 'x'.
  • '<n>s' - A bytes object with n elements.

Integer types. Use a capital letter for unsigned type. Minimum and standard sizes are in brackets:

  • 'b' - char (1/1)
  • 'h' - short (2/2)
  • 'i' - int (2/4)
  • 'l' - long (4/4)
  • 'q' - long long (8/8)

Floating point types:

  • 'f' - float (4/4)
  • 'd' - double (8/8)

Array

List that can only hold numbers of a predefined type. Available types and their minimum sizes in bytes are listed above. Sizes and byte order are always determined by the system.

from array import array
<array> = array('<typecode>', <collection>)    # Array from collection of numbers.
<array> = array('<typecode>', <bytes>)         # Array from bytes object.
<array> = array('<typecode>', <array>)         # Treats array as a sequence of numbers.
<bytes> = bytes(<array>)                       # Or: <array>.tobytes()
<file>.write(<array>)                          # Writes array to the binary file.

Memory View

  • A sequence object that points to the memory of another object.
  • Each element can reference a single or multiple consecutive bytes, depending on format.
  • Order and number of elements can be changed with slicing.
  • Casting only works between char and other types and uses system’s sizes.
  • Byte order is always determined by the system.
<mview> = memoryview(<bytes/bytearray/array>)  # Immutable if bytes, else mutable.
<real>  = <mview>[<index>]                     # Returns an int or a float.
<mview> = <mview>[<slice>]                     # Mview with rearranged elements.
<mview> = <mview>.cast('<typecode>')           # Casts memoryview to the new format.
<mview>.release()                              # Releases the object's memory buffer.

Decode

<bytes> = bytes(<mview>)                       # Returns a new bytes object.
<bytes> = <bytes>.join(<coll_of_mviews>)       # Joins mviews using bytes object as sep.
<array> = array('<typecode>', <mview>)         # Treats mview as a sequence of numbers.
<file>.write(<mview>)                          # Writes mview to the binary file.
<list>  = list(<mview>)                        # Returns a list of ints or floats.
<str>   = str(<mview>, 'utf-8')                # Treats mview as a bytes object.
<int>   = int.from_bytes(<mview>, )           # `byteorder='little/big', signed=False`.
'<hex>' = <mview>.hex()                        # Treats mview as a bytes object.

Deque

A thread-safe list with efficient appends and pops from either side. Pronounced “deck”.

from collections import deque
<deque> = deque(<collection>, maxlen=None)
<deque>.appendleft(<el>)                       # Opposite element is dropped if full.
<deque>.extendleft(<collection>)               # Collection gets reversed.
<el> = <deque>.popleft()                       # Raises IndexError if empty.
<deque>.rotate(n=1)                            # Rotates elements to the right.